fairseq2.gang¶
This module provides the implementation of the Gang
class and its related classes for managing collective operations in a distributed environment.
Classes¶
- class fairseq2.gang.Gang[source]¶
Bases:
ABC
Represents a set of processes that work collectively.
- abstract all_reduce(tensor, op)[source]¶
Reduce
tensor
across all processes.- Parameters:
tensor (Tensor) – The input and output tensor of the operation.
op (ReduceOperation) – The element-wise reduce operation.
- abstract all_gather(output_tensor, input_tensor)[source]¶
Gather tensors from all processes and put them in
output_tensor
.
- abstract all_gather_to_list(output_tensors, input_tensor)[source]¶
Gather tensors from all processes and put them in
output_tensors
.
- abstract broadcast(tensor, source_rank=0)[source]¶
Broadcast
tensor
fromsource_rank
to all processes.
- class fairseq2.gang.AbstractGang(rank, size, device)[source]¶
Bases:
Gang
Provides a skeletal implementation of
Gang
.- Parameters:
- final class fairseq2.gang.FakeGang(device=None, *, rank=0, size=1)[source]¶
Bases:
AbstractGang
Represents a non-distributed gang for local use.
- Parameters:
- all_reduce(tensor, op)[source]¶
Reduce
tensor
across all processes.- Parameters:
tensor (Tensor) – The input and output tensor of the operation.
op (ReduceOperation) – The element-wise reduce operation.
- all_gather(output_tensor, input_tensor)[source]¶
Gather tensors from all processes and put them in
output_tensor
.
- all_gather_to_list(output_tensors, input_tensor)[source]¶
Gather tensors from all processes and put them in
output_tensors
.
- final class fairseq2.gang.ProcessGroupGang(pg, device, *, monitor_pg=None)[source]¶
Bases:
AbstractGang
Represents a gang that wraps a process group.
- Parameters:
rank – The rank of this process in the gang.
size – The number of processes that are part of the gang.
device (Device) – The associated device.
- classmethod init_root_process_group(device, *, timeout=None, high_priority=False, num_threads=None, monitored=False)[source]¶
Initialize the root process group and wrap it as a gang.
- Parameters:
device (device) – The device for which to initialize the gang. For CUDA devices, NCCL; for CPU, Gloo will be used.
timeout (timedelta | None) – The timeout for collective operations. If
None
, the default timeout value (15 minutes) will be used.num_threads (int | None) – The number of threads to use for interaop parallelism.
high_priority (bool) – If
True
, the underlying collective operations will be performed on high priority channels (e.g. CUDA streams).monitored (bool) – If
True
, puts a monitored barrier before every collective call for troubleshooting purposes.
- Return type:
- all_reduce(tensor, op)[source]¶
Reduce
tensor
across all processes.- Parameters:
tensor (Tensor) – The input and output tensor of the operation.
op (ReduceOperation) – The element-wise reduce operation.
- all_gather(output_tensor, input_tensor)[source]¶
Gather tensors from all processes and put them in
output_tensor
.
- all_gather_to_list(output_tensors, input_tensor)[source]¶
Gather tensors from all processes and put them in
output_tensors
.
Functions¶
- fairseq2.gang.setup_root_gang(device, *, timeout=None, high_priority=False, monitored=False)[source]¶
Create the root gang of this process.
- Parameters:
device (device) – The device for which to initialize the gang. For CUDA devices, NCCL; for CPU, Gloo will be used.
timeout (timedelta | None) – The timeout for collective operations. If
None
, the default timeout value (15 minutes) will be used.high_priority (bool) – If
True
, the underlying collective operations will be performed on high priority channels (e.g. CUDA streams).monitored (bool) – If
True
, puts a monitored barrier before every collective call for troubleshooting purposes.
- Return type:
- fairseq2.gang.setup_parallel_gangs(root_gang, *, tp_size=1)[source]¶
Sets up gangs to be used for data and model parallelism.
For instance; if we have 8 devices denoted by g0 to g7 and 2 devices are used for tensor parallelism, this function will make 4 tensor parallel gangs and 2 data parallel gangs as:
- 4 tensor parallel gangs:
[g0, g1], [g2, g3], [g4, g5], [g6, g7]
- 2 data parallel gangs:
[g0, g2, g4, g6], [g1, g3, g5, g7]
For efficiency, the caller should make sure adjacent ranks are on the same host. For example, if there are two hosts with a total of 16 GPUs, ranks 0 to 7 belong to the first host and ranks 8 to 15 belong to the second host.