spdl.io.transfer_tensor

transfer_tensor(batch: T, /, *, num_caches: int = 4) T[source]

Transfers PyTorch CPU Tensors to CUDA in a dedicated stream.

This function wraps calls to torch.Tensor.pin_memory() and torch.Tensor.to(), and execute them in a dedicated CUDA stream.

When called in a background thread, the data transfer overlaps with the GPU computation happening in the foreground thread (such as training and inference).

See also

Multi-threading (custom) - An intended way to use this function in Pipeline.

../_static/data/parallelism_transfer.png

Concretely, it performs the following operations.

  1. If a dedicated CUDA stream local to the calling thread is not found in a thread-local storage, creates and stashes one. (The target device is deetrmined by "LOCAL_RANK" environment variable.)

  2. Activates the CUDA stream.

  3. Traverses the given object recursively, and transfer tensors to GPU. Data are first copied to page-locked memory by calling pin_memory method, then the data is transferred to the GPU in asynchronous manner. (i.e. .to(non_blocking=True))

  4. Synchronizes the stream, to ensure that all the data transfers are completed.

Parameters:
  • batch – A Torch.Tensor or a composition of tensors with container types such as list, tuple, dict and dataclass.

  • num_caches

    Number of batch caches to maintain the reference to. This parameter helps mitigate race conditions when using multi-threading with multiple CUDA streams.

    See PyTorch CUDA Race Condition in Multi-threading for details on the rationale behind this parameter.

Returns:

An object of the same type as the input, but the PyTorch tensors are transferred to CUDA device.