Multi Head Attention

class xformers.components.MultiHeadDispatch(dim_model: int, num_heads: int, attention: Attention, bias: Tuple[bool, bool, bool, bool] = (True, True, True, True), residual_dropout: float = 0.0, use_separate_proj_weight: bool = True, dim_key: Optional[int] = None, dim_value: Optional[int] = None, in_proj_container: Optional[InputProjection] = None, use_rotary_embeddings: Optional[bool] = False, out_proj: Optional[Module] = None, *args, **kwargs)[source]

A multi-head masked self-attention dispatch mechanism, with a projection at the end, following the architecture proposed in Attention is all you need, Vaswani et al.

The actual attention mechanism can vary, as well as the projections. This can be used to wrap the proposed attention mechanisms and make them multi-head aware, but it is optional.

  • dim_model – The model/embedding dimension

  • num_heads – The number of heads being used

  • attention – The attention mechanism (needs to be registered to the xformers library)

  • bias – Whether to use bias for the projections : (Q, K, V, Output)

  • residual_dropout – Amount of dropout on the residual path

  • use_separate_proj_weight – Use different weights for the Q, K, V projections

  • dim_key – Optionally use a different dimension for the key

  • dim_value – Optionally use a different dimension for the value

  • in_proj_container – Optionally provide the input projection module

  • use_rotary_embeddings – Use rotary embeddings

  • out_proj – Optionally provide the output projection module

forward(query: Tensor, key: Optional[Tensor] = None, value: Optional[Tensor] = None, att_mask: Optional[Tensor] = None, key_padding_mask: Optional[Tensor] = None) Tensor[source]

Expected input dimensions are [batch size, sequence length, embed dim] Output dimensions are [batch size, sequence length, embed dim]

classmethod from_config(config: MultiHeadDispatchConfig)[source]
training: bool