get_job()

get_job() gives data about the current running Job:

import clusterscope
job = clusterscope.get_job()

Which then enables all of the methods below:

get_cpus()

If inside a Slurm Job, returns the number of CPUs allocated on the Node according to Slurm (SLURM_CPUS_ON_NODE), if not, returns number of CPUs available on the local node.

import clusterscope
job = clusterscope.get_job()
cpus_count = job.get_cpus()
print(gpus_count, type(job_id))
# 80 <class 'int'>

get_gpus()

If inside a Slurm Job, returns the number of GPUs allocated on the Node according to Slurm (SLURM_GPUS_ON_NODE), if not, returns number of GPUs available on the local node.

import clusterscope
job = clusterscope.get_job()
gpus_count = job.get_gpus()
print(gpus_count, type(job_id))
# 2 <class 'int'>

get_job_id()

If inside a Slurm Job, returns the Job ID, if not, returns 0.

import clusterscope
job = clusterscope.get_job()
job_id = job.get_job_id()
print(job_id, type(job_id))
# <job-id> <class 'int'>

get_job_name()

If inside a Slurm Job, returns the Job ID, if not, returns 0.

import clusterscope
job = clusterscope.get_job()
job_name = job.get_job_name()
print(job_name, type(job_name))
# <job-name> <class 'str'>

get_global_rank()

Method to get the Global Rank of the current process. In order of preference, this returns:

RANK env var
if Slurm Job, SLURM_PROCID env var
0

import clusterscope
job = clusterscope.get_job()
global_rank = job.get_global_rank()
print(global_rank, type(global_rank))
# 0 <class 'int'>

get_local_rank()

Method to get the Local Rank of the current process. In order of preference, this returns:

LOCAL_RANK env var
if Slurm Job, SLURM_LOCALID env var
0

import clusterscope
job = clusterscope.get_job()
local_rank = job.get_local_rank()
print(local_rank, type(local_rank))
# 0 <class 'int'>

get_world_size()

Method to get the Local Rank of the current process. In order of preference, this returns:

WORLD_SIZE env var
if Slurm Job, SLURM_NTASKS env var
1

import clusterscope
job = clusterscope.get_job()
world_size = job.get_world_size()
print(world_size, type(world_size))
# 10 <class 'int'>

get_is_rank_zero()

Returns a boolean that indicates whether get_global_rank()

import clusterscope
job = clusterscope.get_job()
is_rank_zero = job.get_is_rank_zero()
print(is_rank_zero, type(is_rank_zero))
# True <class 'bool'>

get_master_port()

Method to get the Rendezvous Master Port. In order of preference, this returns:

MASTER_PORT env var
Random int from (20_000, 60_000), seeded at SLURM_JOB_ID or -1

import clusterscope
job = clusterscope.get_job()
master_port = job.get_master_port()
print(master_port, type(master_port))
# 20_000 <class 'int'>

get_master_addr()

Method to get the Rendezvous Master Address. In order of preference, this returns:

MASTER_ADDR env var
if Slurm Job, first node from: scontrol show hostnames os.environ["SLURM_JOB_NODELIST"]
127.0.0.1

import clusterscope
job = clusterscope.get_job()
master_addr = job.get_master_addr()
print(master_addr, type(master_addr))
# 127.0.0.1 <class 'str'>

set_torch_distributed_env_from_slurm()

Method to set torch distributed vars from Slurm vars. This assign values as below:

Torch Distributed, Slurm Var

WORLD_SIZE, SLURM_NTASKS
RANK, SLURM_PROCID
LOCAL_WORLD_SIZE, SLURM_NTASKS_PER_NODE
LOCAL_RANK, SLURM_LOCALID
MASTER_ADDR, get_master_addr()
MASTER_PORT get_master_port()
CUDA_VISIBLE_DEVICES, SLURM_LOCALID

get_cpus()​

get_gpus()​

get_job_id()​

get_job_name()​

get_global_rank()​

get_local_rank()​

get_world_size()​

get_is_rank_zero()​

get_master_port()​

get_master_addr()​

set_torch_distributed_env_from_slurm()​