neuraltrain.models.transformer.TransformerEncoder

class neuraltrain.models.transformer.TransformerEncoder(*, heads: int = 8, depth: int = 12, cross_attend: bool = False, causal: bool = False, attn_flash: bool = False, attn_dropout: float = 0.1, ff_mult: int = 4, ff_dropout: float = 0.0, use_scalenorm: bool = True, use_rmsnorm: bool = False, rel_pos_bias: bool = False, alibi_pos_bias: bool = False, rotary_pos_emb: bool = True, rotary_xpos: bool = False, residual_attn: bool = False, scale_residual: bool = True, layer_dropout: float = 0.0)[source][source]

Transformer encoder/decoder built on top of x_transformers.

Parameters:
  • heads (int) – Number of attention heads.

  • depth (int) – Number of Transformer layers.

  • cross_attend (bool) – Enable cross-attention (decoder mode).

  • causal (bool) – If True, build a causal Decoder instead of an Encoder.

  • attn_flash (bool) – Use Flash Attention. Not compatible with ALiBi.

  • attn_dropout (float) – Dropout probability inside the attention layers.

  • ff_mult (int) – Feed-forward expansion factor (ff_dim = dim * ff_mult).

  • ff_dropout (float) – Dropout probability in the feed-forward layers.

  • use_scalenorm (bool) – Use ScaleNorm instead of LayerNorm.

  • use_rmsnorm (bool) – Use RMSNorm instead of LayerNorm.

  • rel_pos_bias (bool) – Use relative positional bias.

  • alibi_pos_bias (bool) – Use ALiBi positional bias.

  • rotary_pos_emb (bool) – Use rotary positional embeddings.

  • rotary_xpos (bool) – Use xPos extension for rotary embeddings.

  • residual_attn (bool) – Add residual connections around the attention output.

  • scale_residual (bool) – Scale residual connections.

  • layer_dropout (float) – Probability of dropping an entire Transformer layer during training.