Multi Head Attention¶
- class xformers.components.MultiHeadDispatch(dim_model: int, num_heads: int, attention: Attention, bias: Tuple[bool, bool, bool, bool] = (True, True, True, True), residual_dropout: float = 0.0, use_separate_proj_weight: bool = True, dim_key: Optional[int] = None, dim_value: Optional[int] = None, in_proj_container: Optional[InputProjection] = None, use_rotary_embeddings: Optional[bool] = False, out_proj: Optional[Module] = None, *args, **kwargs)[source]¶
A multi-head masked self-attention dispatch mechanism, with a projection at the end, following the architecture proposed in Attention is all you need, Vaswani et al.
The actual attention mechanism can vary, as well as the projections. This can be used to wrap the proposed attention mechanisms and make them multi-head aware, but it is optional.
- Parameters
dim_model – The model/embedding dimension
num_heads – The number of heads being used
attention – The attention mechanism (needs to be registered to the xformers library)
bias – Whether to use bias for the projections : (Q, K, V, Output)
residual_dropout – Amount of dropout on the residual path
use_separate_proj_weight – Use different weights for the Q, K, V projections
dim_key – Optionally use a different dimension for the key
dim_value – Optionally use a different dimension for the value
in_proj_container – Optionally provide the input projection module
use_rotary_embeddings – Use rotary embeddings
out_proj – Optionally provide the output projection module
- forward(query: Tensor, key: Optional[Tensor] = None, value: Optional[Tensor] = None, att_mask: Optional[Tensor] = None, key_padding_mask: Optional[Tensor] = None) Tensor [source]¶
Expected input dimensions are [batch size, sequence length, embed dim] Output dimensions are [batch size, sequence length, embed dim]