Skip to main content

get_job()

get_job() gives data about the current running Job:

import clusterscope
job = clusterscope.get_job()

Which then enables all of the methods below:

get_cpus()

If inside a Slurm Job, returns the number of CPUs allocated on the Node according to Slurm (SLURM_CPUS_ON_NODE), if not, returns number of CPUs available on the local node.

import clusterscope
job = clusterscope.get_job()
cpus_count = job.get_cpus()
print(gpus_count, type(job_id))
# 80 <class 'int'>

get_gpus()

If inside a Slurm Job, returns the number of GPUs allocated on the Node according to Slurm (SLURM_GPUS_ON_NODE), if not, returns number of GPUs available on the local node.

import clusterscope
job = clusterscope.get_job()
gpus_count = job.get_gpus()
print(gpus_count, type(job_id))
# 2 <class 'int'>

get_job_id()

If inside a Slurm Job, returns the Job ID, if not, returns 0.

import clusterscope
job = clusterscope.get_job()
job_id = job.get_job_id()
print(job_id, type(job_id))
# <job-id> <class 'int'>

get_job_name()

If inside a Slurm Job, returns the Job ID, if not, returns 0.

import clusterscope
job = clusterscope.get_job()
job_name = job.get_job_name()
print(job_name, type(job_name))
# <job-name> <class 'str'>

get_global_rank()

Method to get the Global Rank of the current process. In order of preference, this returns:

  • RANK env var
  • if Slurm Job, SLURM_PROCID env var
  • 0
import clusterscope
job = clusterscope.get_job()
global_rank = job.get_global_rank()
print(global_rank, type(global_rank))
# 0 <class 'int'>

get_local_rank()

Method to get the Local Rank of the current process. In order of preference, this returns:

  • LOCAL_RANK env var
  • if Slurm Job, SLURM_LOCALID env var
  • 0
import clusterscope
job = clusterscope.get_job()
local_rank = job.get_local_rank()
print(local_rank, type(local_rank))
# 0 <class 'int'>

get_world_size()

Method to get the Local Rank of the current process. In order of preference, this returns:

  • WORLD_SIZE env var
  • if Slurm Job, SLURM_NTASKS env var
  • 1
import clusterscope
job = clusterscope.get_job()
world_size = job.get_world_size()
print(world_size, type(world_size))
# 10 <class 'int'>

get_is_rank_zero()

Returns a boolean that indicates whether get_global_rank()

import clusterscope
job = clusterscope.get_job()
is_rank_zero = job.get_is_rank_zero()
print(is_rank_zero, type(is_rank_zero))
# True <class 'bool'>

get_master_port()

Method to get the Rendezvous Master Port. In order of preference, this returns:

  • MASTER_PORT env var
  • Random int from (20_000, 60_000), seeded at SLURM_JOB_ID or -1
import clusterscope
job = clusterscope.get_job()
master_port = job.get_master_port()
print(master_port, type(master_port))
# 20_000 <class 'int'>

get_master_addr()

Method to get the Rendezvous Master Address. In order of preference, this returns:

  • MASTER_ADDR env var
  • if Slurm Job, first node from: scontrol show hostnames os.environ["SLURM_JOB_NODELIST"]
  • 127.0.0.1
import clusterscope
job = clusterscope.get_job()
master_addr = job.get_master_addr()
print(master_addr, type(master_addr))
# 127.0.0.1 <class 'str'>

set_torch_distributed_env_from_slurm()

Method to set torch distributed vars from Slurm vars. This assign values as below:

Torch Distributed, Slurm Var

WORLD_SIZE, SLURM_NTASKS
RANK, SLURM_PROCID
LOCAL_WORLD_SIZE, SLURM_NTASKS_PER_NODE
LOCAL_RANK, SLURM_LOCALID
MASTER_ADDR, get_master_addr()
MASTER_PORT get_master_port()
CUDA_VISIBLE_DEVICES, SLURM_LOCALID