fairseq2.gang¶
This module provides the implementation of the Gang
class and its related classes for managing collective operations in a distributed environment.
Classes¶
- class fairseq2.gang.Gang[source]¶
Bases:
ABC
Represents a set of processes that work collectively.
- abstract all_reduce(tensor, op)[source]¶
Reduce
tensor
across all processes.- Parameters:
tensor (Tensor) – The input and output tensor of the operation.
op (ReduceOperation) – The element-wise reduce operation.
- abstract all_gather(output_tensor, input_tensor)[source]¶
Gather tensors from all processes and put them in
output_tensor
.
- abstract all_gather_to_list(output_tensors, input_tensor)[source]¶
Gather tensors from all processes and put them in
output_tensors
.
- abstract broadcast(tensor, source_rank=0)[source]¶
Broadcast
tensor
fromsource_rank
to all processes.
- class fairseq2.gang.AbstractGang(rank, size, device)[source]¶
Bases:
Gang
Provides a skeletal implementation of
Gang
.- Parameters:
- final class fairseq2.gang.FakeGang(*, rank=0, size=1, device=None)[source]¶
Bases:
AbstractGang
Represents a non-distributed gang for local use.
- Parameters:
- all_reduce(tensor, op)[source]¶
Reduce
tensor
across all processes.- Parameters:
tensor (Tensor) – The input and output tensor of the operation.
op (ReduceOperation) – The element-wise reduce operation.
- all_gather(output_tensor, input_tensor)[source]¶
Gather tensors from all processes and put them in
output_tensor
.
- all_gather_to_list(output_tensors, input_tensor)[source]¶
Gather tensors from all processes and put them in
output_tensors
.
- final class fairseq2.gang.ProcessGroupGang(pg, device, *, monitor_pg=None)[source]¶
Bases:
AbstractGang
Represents a gang that wraps a process group.
- Parameters:
rank – The rank of this process in the gang.
size – The number of processes that are part of the gang.
device (Device) – The associated device.
- classmethod init_default_process_group(*, device=None, timeout=None, num_threads=None, monitored=False, ok_initialized=False)[source]¶
Initialize the default process group and wrap it as a gang.
- Parameters:
device (device | None) – If
None
; if CUDA is available, the gang will use the default CUDA device of the process; otherwise, it will use the CPU.timeout (timedelta | None) – The timeout for collective operations. If
None
, the default timeout value (15 minutes) will be used.num_threads (int | None) – The number of threads to use for interaop parallelism.
monitored (bool) – If
True
, puts a monitored barrier before every collective call.ok_initialized (bool) – If
True
, does not raise an error if the default process group is already initialized.
- Return type:
- static from_process_group(pg, device)[source]¶
Wrap
pg
as a gang.- Parameters:
pg (ProcessGroup) – The process group to wrap.
device (device) – The associated device.
- Return type:
- classmethod from_default_process_group()[source]¶
Wrap the default process group as a gang.
- Return type:
- all_reduce(tensor, op)[source]¶
Reduce
tensor
across all processes.- Parameters:
tensor (Tensor) – The input and output tensor of the operation.
op (ReduceOperation) – The element-wise reduce operation.
- all_gather(output_tensor, input_tensor)[source]¶
Gather tensors from all processes and put them in
output_tensor
.
- all_gather_to_list(output_tensors, input_tensor)[source]¶
Gather tensors from all processes and put them in
output_tensors
.
Functions¶
- fairseq2.gang.setup_default_gang(*, device=None, timeout=None, monitored=False)[source]¶
Make the default gang of this process.
- Parameters:
- Return type:
- fairseq2.gang.setup_parallel_gangs(root_gang, *, tp_size=1)[source]¶
Make gangs to be used for data and tensor parallelism.
For instance; if we have 8 devices denoted by g0 to g7 and 2 devices are used for tensor parallelism, this function will make 4 tensor parallel gangs and 2 data parallel gangs as:
- 4 tensor parallel gangs:
[g0, g1], [g2, g3], [g4, g5], [g6, g7]
- 2 data parallel gangs:
[g0, g2, g4, g6], [g1, g3, g5, g7]
For efficiency, the caller should make sure adjacent ranks are on the same host. For example, if there are two hosts with a total of 16 GPUs, ranks 0 to 7 belong to the first host and ranks 8 to 15 belong to the second host.
- Parameters:
- Returns:
Three gangs: the root gang, the data parallel gang that this process is part of, and the tensor parallel gang that this process is part of.
- Return type:
Gangs
- fairseq2.gang.broadcast_flag(gang, flag, source_rank=0)[source]¶
Broadcast
flag
to all processes ingang
fromsource_rank
.- Return type:
- fairseq2.gang.get_local_world_size()[source]¶
Return the local world size of the running job.
- Return type: