neuraltrain.models.conformer.Conformer

class neuraltrain.models.conformer.Conformer(*, num_heads: int = 4, ffn_dim: int = 144, num_layers: int = 2, depthwise_conv_kernel_size: int = 31, dropout: float = 0.1, use_group_norm: bool = False, convolution_first: bool = False)[source][source]

Reference: Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition”, Interspeech 2020. See https://arxiv.org/abs/2005.08100.

The Conformer combines self-attention (Transformer) and local convolution blocks to capture both global and short-range temporal dependencies.

Parameters:
  • num_heads (int, optional) – Number of attention heads used in the multi-head self-attention layers. Each head learns a different temporal relationship.

  • ffn_dim (int, optional) – Dimension of the feed-forward layer inside each Conformer block. Acts as the hidden expansion size for each token.

  • num_layers (int, optional) – Number of Conformer layers to stack. Controls the model’s depth and temporal abstraction capacity.

  • depthwise_conv_kernel_size (int, optional) – Kernel size of the depthwise convolution in each convolution module. Controls the temporal receptive field of local processing.

  • dropout (float, optional) – Dropout probability applied within the Conformer layers. Helps regularize the model and prevent overfitting.

  • use_group_norm (bool, optional) – Whether to use GroupNorm instead of BatchNorm inside the convolutional modules. GroupNorm can be more stable for small batch sizes.

  • convolution_first (bool, optional) – If True, applies the convolutional module before the self-attention module inside each block. In practice this may slightly alter inductive bias but rarely changes performance significantly.

  • paper. (The values shown below are the default settings used in the original)