.. _api-models-gemma4: ====================== fairseq2.models.gemma4 ====================== .. currentmodule:: fairseq2.models.gemma4 The Gemma 4 module provides support for Google's Gemma 4 model family, including dense (E2B, E4B, 31B) and Mixture-of-Experts (26B-A4B) variants with base and instruction-tuned versions. The architecture features Per-Layer Embeddings (PLE), partial rotary position encodings, KV sharing across attention layers, QK/V-norm, logit soft-capping, and an optional Conformer-based audio tower for multimodal inference. Architecture Overview --------------------- .. image:: /_static/img/gemma4/gemma4_architecture.svg :alt: Gemma 4 decoder architecture :align: center Model Variants -------------- .. image:: /_static/img/gemma4/gemma4_variants.svg :alt: Gemma 4 model variant comparison :align: center .. list-table:: Model Variant Summary :header-rows: 1 :widths: 20 15 15 15 15 20 * - Variant - Params - Layers - model_dim - Features - Active Params * - E2B / E2B-it - 2B - 26 - 2048 - KV-share - 2B (dense) * - E4B / E4B-it - 7.5B - 34 - 2560 - PLE, Audio, KV-share - 7.5B (dense) * - 31B / 31B-it - 30.7B - 48 - 4608 - KV-share, Double MLP - 30.7B (dense) * - 26B-A4B - 25.2B - 34 - 2560 - MoE (128 experts, top-2) - ~3.8B Quick Start ----------- .. code-block:: python from fairseq2.models.gemma4 import get_gemma4_model_hub, get_gemma4_tokenizer_hub # Get the model hub hub = get_gemma4_model_hub # Load a model model = hub.load_model("gemma4_e4b") # Load corresponding tokenizer tokenizer = get_gemma4_tokenizer_hub.load_tokenizer("gemma4_e4b") # Encode text encoder = tokenizer.create_encoder() tokens = encoder("The future of AI is") Available Models ---------------- The following model architectures are registered: - ``gemma4_e2b`` / ``gemma4_e2b_it`` - 2B parameters (dense) - ``gemma4_e4b`` / ``gemma4_e4b_it`` - 7.5B parameters (dense, with PLE + audio) - ``gemma4_31b`` / ``gemma4_31b_it`` - 30.7B parameters (dense) - ``gemma4_26b_a4b`` / ``gemma4_26b_a4b_it`` - 25.2B total / ~3.8B active (MoE) All models use: - Vocabulary size: 262,144 - Tied embeddings with soft-capped final logits (cap=30.0) - Mixed sliding (window=512) and full (global) attention layers - QK-norm and V-norm (non-learnable) on attention - Partial rotary position encoding (50%) on full attention layers - GELU(tanh) activation in GLU feed-forward networks Key Architectural Features -------------------------- **Per-Layer Embeddings (PLE)** — E4B only A learned projection splits the input embedding into per-layer contributions, each gated by a sigmoid before being added to the hidden state at each decoder layer. This replaces the traditional single-embedding approach. **KV Sharing** Adjacent sliding-attention layers share key-value projections. A SOURCE layer computes K/V; CONSUMER layers reuse the pre-computed K/V. This saves memory and compute without sacrificing quality. **K=V Attention** On full (global) attention layers, the value projection is removed and the key projection output is reused as both K and V (after separate norms). **Mixture of Experts (MoE)** — 26B-A4B only Each decoder layer contains a router that selects top-2 experts from a pool of 128 experts. The shared FFN runs in parallel with the expert mixture. **Audio Tower** — E4B only A Conformer encoder processes log-mel spectrograms into audio embeddings that are injected at ``