Attention mechanisms¶

class xformers.components.attention.ScaledDotProduct(dropout: float = 0.0, causal: bool = False, seq_len: Optional[int] = None, to_seq_len: Optional[int] = None, *args, **kwargs)[source]¶

Bases: Attention

Implementing the Scaled Dot-Product attention proposed in Attention is all you need, Vaswani et al.

mask: Optional[AttentionMask]¶

forward(q: Tensor, k: Tensor, v: Tensor, att_mask: Optional[Union[Tensor, AttentionMask]] = None, *args, **kwargs) → Tensor[source]¶

att_mask A 2D or 3D mask which ignores attention at certain positions.

If the mask is boolean, a value of True will keep the value,
while a value of False will mask the value.

Key padding masks (dimension: batch x sequence length) and attention masks (dimension: sequence length x sequence length OR batch x sequence length x sequence length) can be combined and passed in here. Method maybe_merge_masks provided in the utils can be used for that merging.

If the mask has the float type, then an additive mask is expected (masked values are -inf)

class xformers.components.attention.LocalAttention(dropout: float = 0.0, causal: bool = False, window_size: int = 5, force_sparsity: bool = False, *args, **kwargs)[source]¶

Bases: Attention

__init__(dropout: float = 0.0, causal: bool = False, window_size: int = 5, force_sparsity: bool = False, *args, **kwargs)[source]¶

An implementation of a sliding window attention, as proposed in RoutingTransformer, LongFormer or BigBird

Parameters

dropout (float) – the probability of an output to be randomly dropped at training time
causal (bool) – apply a causal mask, in that the attention cannot be applied to the future
window_size (int) – the overall window size for local attention. Odd number is expected if the mask is not causal, as the window size will be evenly distributed on both sides of each query

forward(q: Tensor, k: Tensor, v: Tensor, att_mask: Optional[Union[Tensor, AttentionMask]] = None, *args, **kwargs)[source]¶

class xformers.components.attention.LinformerAttention(dropout: float, seq_len: int, k: Optional[int] = None, *args, **kwargs)[source]¶

Bases: Attention

__init__(dropout: float, seq_len: int, k: Optional[int] = None, *args, **kwargs)[source]¶: Linformer attention mechanism, from Linformer: Self-Attention with Linear Complexity, Wang et al (2020). The original notation is kept as is.

forward(q: Tensor, k: Tensor, v: Tensor, *args, **kwargs)[source]¶

class xformers.components.attention.NystromAttention(dropout: float, num_heads: int, num_landmarks: int = 64, landmark_pooling: Optional[Module] = None, causal: bool = False, use_razavi_pinverse: bool = True, pinverse_original_init: bool = False, inv_iterations: int = 6, v_skip_connection: Optional[Module] = None, conv_kernel_size: Optional[int] = None, *args, **kwargs)[source]¶

Bases: Attention

__init__(dropout: float, num_heads: int, num_landmarks: int = 64, landmark_pooling: Optional[Module] = None, causal: bool = False, use_razavi_pinverse: bool = True, pinverse_original_init: bool = False, inv_iterations: int = 6, v_skip_connection: Optional[Module] = None, conv_kernel_size: Optional[int] = None, *args, **kwargs)[source]¶

Nystrom attention mechanism, from Nystromformer.

"A Nystrom-based Algorithm for Approximating Self-Attention."
Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., Singh, V. (2021)

Reference codebase: https://github.com/mlpen/Nystromformer

forward(q: Tensor, k: Tensor, v: Tensor, key_padding_mask: Optional[Tensor] = None, *args, **kwargs)[source]¶

key_padding_mask Only a key padding mask is accepted here. The size must be (batch size, sequence length) or: (batch size * num_heads, 1, sequence length). If dimensions are not correct, the mask will be ignored. An additive mask is expected, meaning float values using “-inf” to mask values

class xformers.components.attention.RandomAttention(dropout: float, causal: bool = False, r: float = 0.01, constant_masking: bool = True, force_sparsity: bool = False, *args, **kwargs)[source]¶

Bases: Attention

__init__(dropout: float, causal: bool = False, r: float = 0.01, constant_masking: bool = True, force_sparsity: bool = False, *args, **kwargs)[source]¶

“Random” attention, as proposed for instance in BigBird. Random means in that case that each query can attend to a random set of keys. This implementation is sparse-aware, meaning that the empty attention parts will not be represented in memory.

Parameters

r (float) – the ratio in [0,1] of keys that the query can attend to
constant_masking (bool) – if true, keep the same random set for all queries.

forward(q: Tensor, k: Tensor, v: Tensor, att_mask: Optional[Union[Tensor, AttentionMask]] = None, *args, **kwargs)[source]¶

class xformers.components.attention.OrthoFormerAttention(dropout: float, num_landmarks: int = 32, subsample_fraction: float = 1.0, landmark_selection: LandmarkSelection = LandmarkSelection.Orthogonal, *args, **kwargs)[source]¶

Bases: Attention

__init__(dropout: float, num_landmarks: int = 32, subsample_fraction: float = 1.0, landmark_selection: LandmarkSelection = LandmarkSelection.Orthogonal, *args, **kwargs)[source]¶

Orthoformer attention mechanism.

"Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers"
Patrick, M., Campbell, D., Asano, Y., Misra, I., Metze, F., Feichtenhofer,
C., Vedaldi, A., Henriques, J. (2021)

Reference codebase: https://github.com/facebookresearch/Motionformer

forward(q: Tensor, k: Tensor, v: Tensor, att_mask: Optional[Union[Tensor, AttentionMask]] = None, *args, **kwargs)[source]¶

class xformers.components.attention.GlobalAttention(dropout: float, attention_query_mask: Tensor, causal: bool = False, force_sparsity: bool = False, *_, **__)[source]¶

Bases: Attention

__init__(dropout: float, attention_query_mask: Tensor, causal: bool = False, force_sparsity: bool = False, *_, **__)[source]¶

Global attention, as proposed for instance in BigBird or Longformer.

Global means in that case that the queries positively labelled in the `attention_query_mask` can attend to all the other queries. The queries negatively labelled in the `attention_query_mask` cannot attend to any other query.

This implementation is sparse-aware, meaning that the empty attention parts will not be represented in memory.

Parameters

dropout (float) – probability of an element to be zeroed
attention_query_mask (torch.Tensor) – if true, this query can attend to all the others

forward(q: Tensor, k: Tensor, v: Tensor, att_mask: Optional[Union[Tensor, AttentionMask]] = None, *_, **__)[source]¶

class xformers.components.attention.FavorAttention(causal: bool = False, dropout: float = 0.0, dim_features: Optional[int] = None, dim_head: Optional[int] = None, iter_before_redraw: Optional[int] = None, feature_map_type: FeatureMapType = FeatureMapType.SMReg, normalize_inputs: bool = False, *_, **__)[source]¶

Bases: Attention

__init__(causal: bool = False, dropout: float = 0.0, dim_features: Optional[int] = None, dim_head: Optional[int] = None, iter_before_redraw: Optional[int] = None, feature_map_type: FeatureMapType = FeatureMapType.SMReg, normalize_inputs: bool = False, *_, **__)[source]¶

Kernelized attention, as proposed in Performers (“Rethinking attention with performers.” K. Choromanski et al. (2020).).

FAVOR stands for “Fast Attention Via positive Orthogonal Random features”

Parameters

dropout (float) – the probability of an output to be randomly dropped at training time
dim_features (int) – the dimension of the random features space
iter_before_redraw (int) – the number of steps (forward calls) before a redraw of the features
feature_map_type (FeatureMapType) – the type of feature map being used,
features. (for instance orthogonal random) –

forward(q: Tensor, k: Tensor, v: Tensor, *_, **__)[source]¶

class xformers.components.attention.Attention(dropout: Optional[float] = None, *args, **kwargs)[source]¶

Bases: Module

The base Attention mechanism, which is typically a sub-part of the multi-head attention

classmethod from_config(config: AttentionConfig) → Self[source]¶

abstract forward(q: Tensor, k: Tensor, v: Tensor, *args, **kwargs) → Tensor[source]¶

class xformers.components.attention.AttentionMask(additive_mask: Tensor, is_causal: bool = False)[source]¶

Bases: object

Holds an attention mask, along with a couple of helpers and attributes.

to_bool() → Tensor[source]¶

classmethod from_bool(x: Tensor) → Self[source]¶: Create an AttentionMask given a boolean pattern. .. warning: we assume here that True implies that the value should be computed

classmethod from_multiplicative(x: Tensor) → Self[source]¶: Create an AttentionMask given a multiplicative attention mask.

classmethod make_causal(seq_len: int, to_seq_len: Optional[int] = None, device: Optional[device] = None, dtype: Optional[dtype] = None) → Self[source]¶

make_crop(seq_len: int, to_seq_len: Optional[int] = None) → AttentionMask[source]¶: Return a cropped attention mask, whose underlying tensor is a view of this one

property device¶

property is_sparse¶

property ndim¶

property dtype¶

property shape¶

to(device: Optional[device] = None, dtype: Optional[dtype] = None) → AttentionMask[source]¶

xformers.components.attention.build_attention(config: Union[Dict[str, Any], AttentionConfig])[source]¶

Builds an attention from a config.

This assumes a ‘name’ key in the config which is used to determine what attention class to instantiate. For instance, a config {“name”: “my_attention”, “foo”: “bar”} will find a class that was registered as “my_attention” (see register_attention()) and call .from_config on it.

xformers.components.attention.register_attention(name: str, config: ~typing.Any = <class 'xformers.components.attention.base.AttentionConfig'>)¶

Registers a subclass.

This decorator allows xFormers to instantiate a given subclass from a configuration file, even if the class itself is not part of the xFormers library.