Attention mechanisms¶
- class xformers.components.attention.ScaledDotProduct(dropout: float = 0.0, causal: bool = False, seq_len: Optional[int] = None, to_seq_len: Optional[int] = None, *args, **kwargs)[source]¶
Bases:
Attention
Implementing the Scaled Dot-Product attention proposed in Attention is all you need, Vaswani et al.
- mask: Optional[AttentionMask]¶
- forward(q: Tensor, k: Tensor, v: Tensor, att_mask: Optional[Union[Tensor, AttentionMask]] = None, *args, **kwargs) Tensor [source]¶
att_mask A 2D or 3D mask which ignores attention at certain positions.
- If the mask is boolean, a value of True will keep the value,
while a value of False will mask the value.
Key padding masks (dimension: batch x sequence length) and attention masks (dimension: sequence length x sequence length OR batch x sequence length x sequence length) can be combined and passed in here. Method maybe_merge_masks provided in the utils can be used for that merging.
If the mask has the float type, then an additive mask is expected (masked values are -inf)
- class xformers.components.attention.LocalAttention(dropout: float = 0.0, causal: bool = False, window_size: int = 5, force_sparsity: bool = False, *args, **kwargs)[source]¶
Bases:
Attention
- __init__(dropout: float = 0.0, causal: bool = False, window_size: int = 5, force_sparsity: bool = False, *args, **kwargs)[source]¶
An implementation of a sliding window attention, as proposed in RoutingTransformer, LongFormer or BigBird
- Parameters
dropout (float) – the probability of an output to be randomly dropped at training time
causal (bool) – apply a causal mask, in that the attention cannot be applied to the future
window_size (int) – the overall window size for local attention. Odd number is expected if the mask is not causal, as the window size will be evenly distributed on both sides of each query
- forward(q: Tensor, k: Tensor, v: Tensor, att_mask: Optional[Union[Tensor, AttentionMask]] = None, *args, **kwargs)[source]¶
- class xformers.components.attention.LinformerAttention(dropout: float, seq_len: int, k: Optional[int] = None, *args, **kwargs)[source]¶
Bases:
Attention
- class xformers.components.attention.NystromAttention(dropout: float, num_heads: int, num_landmarks: int = 64, landmark_pooling: Optional[Module] = None, causal: bool = False, use_razavi_pinverse: bool = True, pinverse_original_init: bool = False, inv_iterations: int = 6, v_skip_connection: Optional[Module] = None, conv_kernel_size: Optional[int] = None, *args, **kwargs)[source]¶
Bases:
Attention
- __init__(dropout: float, num_heads: int, num_landmarks: int = 64, landmark_pooling: Optional[Module] = None, causal: bool = False, use_razavi_pinverse: bool = True, pinverse_original_init: bool = False, inv_iterations: int = 6, v_skip_connection: Optional[Module] = None, conv_kernel_size: Optional[int] = None, *args, **kwargs)[source]¶
Nystrom attention mechanism, from Nystromformer.
"A Nystrom-based Algorithm for Approximating Self-Attention." Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., Singh, V. (2021) Reference codebase: https://github.com/mlpen/Nystromformer
- forward(q: Tensor, k: Tensor, v: Tensor, key_padding_mask: Optional[Tensor] = None, *args, **kwargs)[source]¶
- key_padding_mask Only a key padding mask is accepted here. The size must be (batch size, sequence length) or
(batch size * num_heads, 1, sequence length). If dimensions are not correct, the mask will be ignored. An additive mask is expected, meaning float values using “-inf” to mask values
- class xformers.components.attention.RandomAttention(dropout: float, causal: bool = False, r: float = 0.01, constant_masking: bool = True, force_sparsity: bool = False, *args, **kwargs)[source]¶
Bases:
Attention
- __init__(dropout: float, causal: bool = False, r: float = 0.01, constant_masking: bool = True, force_sparsity: bool = False, *args, **kwargs)[source]¶
“Random” attention, as proposed for instance in BigBird. Random means in that case that each query can attend to a random set of keys. This implementation is sparse-aware, meaning that the empty attention parts will not be represented in memory.
- forward(q: Tensor, k: Tensor, v: Tensor, att_mask: Optional[Union[Tensor, AttentionMask]] = None, *args, **kwargs)[source]¶
- class xformers.components.attention.OrthoFormerAttention(dropout: float, num_landmarks: int = 32, subsample_fraction: float = 1.0, landmark_selection: LandmarkSelection = LandmarkSelection.Orthogonal, *args, **kwargs)[source]¶
Bases:
Attention
- __init__(dropout: float, num_landmarks: int = 32, subsample_fraction: float = 1.0, landmark_selection: LandmarkSelection = LandmarkSelection.Orthogonal, *args, **kwargs)[source]¶
Orthoformer attention mechanism.
"Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers" Patrick, M., Campbell, D., Asano, Y., Misra, I., Metze, F., Feichtenhofer, C., Vedaldi, A., Henriques, J. (2021) Reference codebase: https://github.com/facebookresearch/Motionformer
- forward(q: Tensor, k: Tensor, v: Tensor, att_mask: Optional[Union[Tensor, AttentionMask]] = None, *args, **kwargs)[source]¶
- class xformers.components.attention.GlobalAttention(dropout: float, attention_query_mask: Tensor, causal: bool = False, force_sparsity: bool = False, *_, **__)[source]¶
Bases:
Attention
- __init__(dropout: float, attention_query_mask: Tensor, causal: bool = False, force_sparsity: bool = False, *_, **__)[source]¶
Global attention, as proposed for instance in BigBird or Longformer.
Global means in that case that the queries positively labelled in the
`attention_query_mask`
can attend to all the other queries. The queries negatively labelled in the`attention_query_mask`
cannot attend to any other query.This implementation is sparse-aware, meaning that the empty attention parts will not be represented in memory.
- Parameters
dropout (float) – probability of an element to be zeroed
attention_query_mask (torch.Tensor) – if true, this query can attend to all the others
- forward(q: Tensor, k: Tensor, v: Tensor, att_mask: Optional[Union[Tensor, AttentionMask]] = None, *_, **__)[source]¶
- class xformers.components.attention.FavorAttention(causal: bool = False, dropout: float = 0.0, dim_features: Optional[int] = None, dim_head: Optional[int] = None, iter_before_redraw: Optional[int] = None, feature_map_type: FeatureMapType = FeatureMapType.SMReg, normalize_inputs: bool = False, *_, **__)[source]¶
Bases:
Attention
- __init__(causal: bool = False, dropout: float = 0.0, dim_features: Optional[int] = None, dim_head: Optional[int] = None, iter_before_redraw: Optional[int] = None, feature_map_type: FeatureMapType = FeatureMapType.SMReg, normalize_inputs: bool = False, *_, **__)[source]¶
Kernelized attention, as proposed in Performers (“Rethinking attention with performers.” K. Choromanski et al. (2020).).
FAVOR stands for “Fast Attention Via positive Orthogonal Random features”
- Parameters
dropout (float) – the probability of an output to be randomly dropped at training time
dim_features (int) – the dimension of the random features space
iter_before_redraw (int) – the number of steps (forward calls) before a redraw of the features
feature_map_type (FeatureMapType) – the type of feature map being used,
features. (for instance orthogonal random) –
- class xformers.components.attention.Attention(dropout: Optional[float] = None, *args, **kwargs)[source]¶
Bases:
Module
The base Attention mechanism, which is typically a sub-part of the multi-head attention
- class xformers.components.attention.AttentionMask(additive_mask: Tensor, is_causal: bool = False)[source]¶
Bases:
object
Holds an attention mask, along with a couple of helpers and attributes.
- classmethod from_bool(x: Tensor) Self [source]¶
Create an AttentionMask given a boolean pattern. .. warning: we assume here that True implies that the value should be computed
- classmethod from_multiplicative(x: Tensor) Self [source]¶
Create an AttentionMask given a multiplicative attention mask.
- classmethod make_causal(seq_len: int, to_seq_len: Optional[int] = None, device: Optional[device] = None, dtype: Optional[dtype] = None) Self [source]¶
- make_crop(seq_len: int, to_seq_len: Optional[int] = None) AttentionMask [source]¶
Return a cropped attention mask, whose underlying tensor is a view of this one
- property device¶
- property is_sparse¶
- property ndim¶
- property dtype¶
- property shape¶
- to(device: Optional[device] = None, dtype: Optional[dtype] = None) AttentionMask [source]¶
- xformers.components.attention.build_attention(config: Union[Dict[str, Any], AttentionConfig])[source]¶
Builds an attention from a config.
This assumes a ‘name’ key in the config which is used to determine what attention class to instantiate. For instance, a config {“name”: “my_attention”, “foo”: “bar”} will find a class that was registered as “my_attention” (see
register_attention()
) and call .from_config on it.
- xformers.components.attention.register_attention(name: str, config: ~typing.Any = <class 'xformers.components.attention.base.AttentionConfig'>)¶
Registers a subclass.
This decorator allows xFormers to instantiate a given subclass from a configuration file, even if the class itself is not part of the xFormers library.