Custom parts reference¶
Sparse CUDA kernels¶
1. Building the kernels¶
xFormers transparently supports CUDA kernels to implement sparse attention computations, some of which are based on Sputnik. These kernels require xFormers to be installed from source, and the recipient machine to be able to compile CUDA source code.
git clone email@example.com:fairinternal/xformers.git conda create --name xformer_env python=3.8 conda activate xformer_env cd xformers pip install -r requirements.txt pip install -e .
Common issues are related to:
NVCC and the current CUDA runtime match. You can often change the CUDA runtime with module unload cuda module load cuda/xx.x, possibly also nvcc
the version of GCC that you’re using matches the current NVCC capabilities
the TORCH_CUDA_ARCH_LIST env variable is set to the architures that you want to support. A suggested setup (slow to build but comprehensive) is export TORCH_CUDA_ARCH_LIST=”6.0;6.1;6.2;7.0;7.2;8.0;8.6”
The sparse attention computation is automatically triggered when using the scaled dot product attention (see), and a sparse enough mask (currently less than 30% of true values). There is nothing specific to do, and a couple of examples are provided in the tutorials.
We use Triton to implement the following parts. These parts will only be visible on a CUDA-enabled machine, and Triton needs to be installed (pip install triton), if any of these conditions are not met a warning is issued.
2. Possible usage¶
The following parts are independent and can be used as-is in any model, provided the above limitations (Triton is installed, and there is a CUDA GPU present) are fullfilled. They are used by default, when possible, in some of the xFormers building blocks.