fairseq2.models.llama

The LLaMA module provides support for LLaMA language models from Meta AI. It includes model configurations, hub access, tokenizers, and utilities for loading and working with LLaMA models.

Quick Start

from fairseq2.models.llama import get_llama_model_hub, get_llama_tokenizer_hub

# Get the model hub
hub = get_llama_model_hub()

# Load a model
model = hub.load_model("llama3_2_1b")

# Load corresponding tokenizer (uses HuggingFace tokenizer by default)
tokenizer = get_llama_tokenizer_hub().load_tokenizer("llama3_2_1b")

# Generate some text
text = "The future of AI is"
encoder = tokenizer.create_encoder()
encoded = encoder(text)
# ... model inference code ...

Tokenizer

LLaMA tokenizer in fairseq2 supports multiple implementations:

  1. HuggingFace Tokenizer (Default):

    The default and recommended implementation using HuggingFace’s tokenizer.

    Asset Card Example:

    name: llama3
    tokenizer: "/path/to/Llama-3.1-8B"  # HuggingFace tokenizer directory
    tokenizer_family: llama
    

    Directory structure should contain e.g. config.json, tokenizer.json, tokenizer_config.json, special_tokens_map.json.

  2. Tiktoken Implementation:

    Implementation using Tiktoken.

    Asset Card Example:

    name: tiktoken_llama_instruct
    tokenizer_config_override:
        impl: tiktoken
        use_eot: True  # For instruction models
    tokenizer_family: llama
    tokenizer: "/path/to/tokenizer.model"  # Tiktoken model file
    
  3. SentencePiece Implementation:

    Implementation using SentencePiece (only available for LLaMA-1 and LLaMA-2).

    Asset Card Example:

    name: sp_llama
    tokenizer_config_override:
        impl: sp
    tokenizer_family: llama
    tokenizer: "/path/to/tokenizer.model"  # SentencePiece model file
    

Special Tokens

The tokenizer handles several special tokens:

  • <|begin_of_text|> - Beginning of text marker

  • <|end_of_text|> - End of text marker (default)

  • <|eot_id|> - End of turn marker (when use_eot=True)

  • <|start_header_id|> - Start of header

  • <|end_header_id|> - End of header

For instruction models (e.g., llama3_2_1b_instruct), use_eot=True is set by default, which means:

from fairseq2.data.tokenizers import load_tokenizer

# Load instruct model tokenizer
tokenizer = load_tokenizer("llama3_2_1b_instruct")

# Will use <|eot_id|> as EOS token
assert tokenizer._eos_token == "<|eot_id|>"

Tokenizer Modes

The tokenizer supports different modes via create_encoder(mode=...):

  • default: Adds BOS and EOS tokens

  • prompt: Adds BOS token only

  • prompt_response: Adds EOS token only

  • as_is: No special tokens added

encoder = tokenizer.create_encoder(mode="prompt")
# Only adds <|begin_of_text|>

encoder = tokenizer.create_encoder(mode="prompt_response")
# Only adds <|eot_id|> or <|end_of_text|>

Model Hub

get_llama_model_hub

fairseq2.models.llama.get_llama_model_hub()

Returns the model hub for LLaMA models.

Return type:

ModelHub[ModelT, ModelConfigT]

get_llama_tokenizer_hub

fairseq2.models.llama.get_llama_tokenizer_hub()

Returns the tokenizer hub for LLaMA tokenizers.

Return type:

TokenizerHub[TokenizerT, TokenizerConfigT]

Model Configuration

LLaMAConfig

class fairseq2.models.llama.LLaMAConfig(*, model_dim=4096, max_seq_len=2048, vocab_size=32000, pad_idx=None, tied_embeddings=False, num_layers=32, num_attn_heads=32, num_key_value_heads=32, ffn_inner_dim=16384, ffn_inner_dim_scale=0.6666666666666666, ffn_inner_dim_multiplier=1.0, ffn_inner_dim_multiple_of=256, rope_theta=10000.0, use_scaled_rope=False, rope_scale=<factory>, dropout_p=0.0, init_std=None, init_std_scale='layer', shard_embed_dim=True)[source]

Bases: object

Holds the configuration of a LLaMA model.

The default values correspond to the base architecture as described in Touvron et al. [4].

model_dim: int = 4096

The dimensionality of the model.

max_seq_len: int = 2048

The maximum sequence length.

vocab_size: int = 32000

The size of the vocabulary.

pad_idx: int | None = None

The index of the PAD symbol in the vocabulary.

tied_embeddings: bool = False

If True, ties the embedding table and the output projection layer.

num_layers: int = 32

The number of decoder layers.

num_attn_heads: int = 32

The number of attention heads in decoder layers.

num_key_value_heads: int = 32

The number of key/value heads for Grouped Query Attention.

ffn_inner_dim: int = 16384

The dimensionality of inner projection layers in feed-forward networks.

ffn_inner_dim_scale: float = 0.6666666666666666

The scale factor for the dimensionality of inner projection layers in feed-forward networks.

ffn_inner_dim_multiplier: float = 1.0

The multiplier for the dimensionality of inner projection layers in feed-forward networks.

ffn_inner_dim_multiple_of: int = 256

The dimensionality of inner projection layers in feed-forward networks is rounded up to the nearest multiple of this value.

rope_theta: float = 10000.0

The coefficient of the long-term decay of the Rotary position encoder.

use_scaled_rope: bool = False

If True, scales Rotary encoder frequencies to the resolver length.

rope_scale: LLaMARoPEScaleConfig

If not None, specifies scaling parameters for the Rotary position encoder, aiming to increase the resolver length.

dropout_p: float = 0.0

The dropout probability on outputs of Transformer layers.

init_std: float | None = None

If not None, the standard deviation to initialize input embeddings and projection weights; otherwise, model_dim ** -0.5 will be used instead.

init_std_scale: Literal['none', 'layer', 'stack'] = 'layer'

The method to use to scale init_std per layer. If ‘none’, no scaling will be applied. If ‘layer’, init_std will be scaled by the depth of the layer. If ‘stack’, init_std will be scaled by the total depth of the decoder.

shard_embed_dim: bool = True

If True, shards the embedding dimension for tensor parallelism.

Tokenizer Configuration

LLaMATokenizerConfig

class fairseq2.models.llama.LLaMATokenizerConfig(impl: "Literal['sp', 'tiktoken', 'hg']" = 'sp', use_eot: 'bool' = False, split_regex: 'str | None' = None)[source]

Bases: object

Configuration for LLaMA tokenizer.

Key Parameters:

  • impl - Implementation to use: “hg” (default), “tiktoken”, or “sp”

  • use_eot - Whether to use <|eot_id|> as EOS token (True for instruction models)

  • split_regex - Custom regex pattern for tiktoken implementation

Complete Examples

Using HuggingFace Tokenizer

from fairseq2.models.llama import get_llama_tokenizer_hub

# Load default HuggingFace tokenizer
tokenizer = get_llama_tokenizer_hub().load_tokenizer("llama3_2_1b")

# Create encoder in different modes
default_encoder = tokenizer.create_encoder()  # Adds BOS and EOS
prompt_encoder = tokenizer.create_encoder(mode="prompt")  # Only BOS

# Encode text
text = "Hello, world!"
tokens = default_encoder(text)

Using Tiktoken Implementation

from fairseq2.models.llama import get_llama_tokenizer_hub
from fairseq2.models.llama.tokenizer import LLaMATokenizerConfig
from pathlib import Path

# Configure tiktoken implementation
config = LLaMATokenizerConfig(impl="tiktoken", use_eot=True)

# Load tokenizer with custom config
hub = get_llama_tokenizer_hub()
tokenizer = hub.load_custom_tokenizer(Path("/path/to/tokenizer.model"), config)

Chat Template Support

The HuggingFace implementation includes support for chat templates through the HuggingFace tokenizer’s apply_chat_template method:

from fairseq2.models.llama import get_llama_tokenizer_hub

# Load tokenizer
tokenizer = get_llama_tokenizer_hub().load_tokenizer("llama3_2_1b")

# Prepare chat messages
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me a joke."},
    {"role": "assistant", "content": "Why did the chicken cross the road?"}
]

# Format using chat template
formatted_text = tokenizer._model._tok.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Then encode the formatted text
encoder = tokenizer.create_encoder()
tokens = encoder(formatted_text)

See Also