fairseq2.models.llama¶

The LLaMA module provides support for LLaMA language models from Meta AI. It includes model configurations, hub access, tokenizers, and utilities for loading and working with LLaMA models.

Quick Start¶

from fairseq2.models.llama import get_llama_model_hub, get_llama_tokenizer_hub

# Get the model hub
hub = get_llama_model_hub()

# Load a model
model = hub.load_model("llama3_2_1b")

# Load corresponding tokenizer (uses HuggingFace tokenizer by default)
tokenizer = get_llama_tokenizer_hub().load_tokenizer("llama3_2_1b")

# Generate some text
text = "The future of AI is"
encoder = tokenizer.create_encoder()
encoded = encoder(text)
# ... model inference code ...

Tokenizer¶

LLaMA tokenizer in fairseq2 supports multiple implementations:

HuggingFace Tokenizer (Default):
The default and recommended implementation using HuggingFace’s tokenizer.

Asset Card Example:
name: llama3 tokenizer: "/path/to/Llama-3.1-8B" # HuggingFace tokenizer directory tokenizer_family: llama
Directory structure should contain e.g. config.json, tokenizer.json, tokenizer_config.json, special_tokens_map.json.

Tiktoken Implementation:

Implementation using Tiktoken.

Asset Card Example:

name: tiktoken_llama_instruct
tokenizer_config_override:
    impl: tiktoken
    use_eot: True  # For instruction models
tokenizer_family: llama
tokenizer: "/path/to/tokenizer.model"  # Tiktoken model file

SentencePiece Implementation:

Implementation using SentencePiece (only available for LLaMA-1 and LLaMA-2).

Asset Card Example:

name: sp_llama
tokenizer_config_override:
    impl: sp
tokenizer_family: llama
tokenizer: "/path/to/tokenizer.model"  # SentencePiece model file

Special Tokens¶

The tokenizer handles several special tokens:

<|begin_of_text|> - Beginning of text marker
<|end_of_text|> - End of text marker (default)
<|eot_id|> - End of turn marker (when use_eot=True)
<|start_header_id|> - Start of header
<|end_header_id|> - End of header

For instruction models (e.g., llama3_2_1b_instruct), use_eot=True is set by default, which means:

from fairseq2.data.tokenizers import load_tokenizer

# Load instruct model tokenizer
tokenizer = load_tokenizer("llama3_2_1b_instruct")

# Will use <|eot_id|> as EOS token
assert tokenizer._eos_token == "<|eot_id|>"

Tokenizer Modes¶

The tokenizer supports different modes via create_encoder(mode=...):

default: Adds BOS and EOS tokens
prompt: Adds BOS token only
prompt_response: Adds EOS token only
as_is: No special tokens added

encoder = tokenizer.create_encoder(mode="prompt")
# Only adds <|begin_of_text|>

encoder = tokenizer.create_encoder(mode="prompt_response")
# Only adds <|eot_id|> or <|end_of_text|>

Model Hub¶

get_llama_model_hub¶

fairseq2.models.llama.get_llama_model_hub()¶

Returns the model hub for LLaMA models.

Return type:: ModelHub[ModelT, ModelConfigT]

get_llama_tokenizer_hub¶

fairseq2.models.llama.get_llama_tokenizer_hub()¶

Returns the tokenizer hub for LLaMA tokenizers.

Return type:: TokenizerHub[TokenizerT, TokenizerConfigT]

Model Configuration¶

LLaMAConfig¶

class fairseq2.models.llama.LLaMAConfig(*, model_dim=4096, max_seq_len=2048, vocab_size=32000, pad_idx=None, tied_embeddings=False, num_layers=32, num_attn_heads=32, num_key_value_heads=32, ffn_inner_dim=16384, ffn_inner_dim_scale=0.6666666666666666, ffn_inner_dim_multiplier=1.0, ffn_inner_dim_multiple_of=256, rope_theta=10000.0, use_scaled_rope=False, rope_scale=<factory>, dropout_p=0.0, init_std=None, init_std_scale='layer', shard_embed_dim=True)[source]¶

Bases: object

Holds the configuration of a LLaMA model.

The default values correspond to the base architecture as described in Touvron et al. [4].

model_dim: int = 4096¶: The dimensionality of the model.

max_seq_len: int = 2048¶: The maximum sequence length.

vocab_size: int = 32000¶: The size of the vocabulary.

pad_idx: int | None = None¶: The index of the PAD symbol in the vocabulary.

tied_embeddings: bool = False¶: If True, ties the embedding table and the output projection layer.

num_layers: int = 32¶: The number of decoder layers.

num_attn_heads: int = 32¶: The number of attention heads in decoder layers.

num_key_value_heads: int = 32¶: The number of key/value heads for Grouped Query Attention.

ffn_inner_dim: int = 16384¶: The dimensionality of inner projection layers in feed-forward networks.

ffn_inner_dim_scale: float = 0.6666666666666666¶: The scale factor for the dimensionality of inner projection layers in feed-forward networks.

ffn_inner_dim_multiplier: float = 1.0¶: The multiplier for the dimensionality of inner projection layers in feed-forward networks.

ffn_inner_dim_multiple_of: int = 256¶: The dimensionality of inner projection layers in feed-forward networks is rounded up to the nearest multiple of this value.

rope_theta: float = 10000.0¶: The coefficient of the long-term decay of the Rotary position encoder.

use_scaled_rope: bool = False¶: If True, scales Rotary encoder frequencies to the resolver length.

rope_scale: LLaMARoPEScaleConfig¶: If not None, specifies scaling parameters for the Rotary position encoder, aiming to increase the resolver length.

dropout_p: float = 0.0¶: The dropout probability on outputs of Transformer layers.

init_std: float | None = None¶: If not None, the standard deviation to initialize input embeddings and projection weights; otherwise, model_dim ** -0.5 will be used instead.

init_std_scale: Literal['none', 'layer', 'stack'] = 'layer'¶: The method to use to scale init_std per layer. If ‘none’, no scaling will be applied. If ‘layer’, init_std will be scaled by the depth of the layer. If ‘stack’, init_std will be scaled by the total depth of the decoder.

shard_embed_dim: bool = True¶: If True, shards the embedding dimension for tensor parallelism.

Tokenizer Configuration¶

LLaMATokenizerConfig¶

class fairseq2.models.llama.LLaMATokenizerConfig(impl: "Literal['sp', 'tiktoken', 'hg']" = 'sp', use_eot: 'bool' = False, split_regex: 'str | None' = None)[source]¶

Bases: object

Configuration for LLaMA tokenizer.

Key Parameters:

impl - Implementation to use: “hg” (default), “tiktoken”, or “sp”
use_eot - Whether to use <|eot_id|> as EOS token (True for instruction models)
split_regex - Custom regex pattern for tiktoken implementation

Complete Examples¶

Using HuggingFace Tokenizer¶

from fairseq2.models.llama import get_llama_tokenizer_hub

# Load default HuggingFace tokenizer
tokenizer = get_llama_tokenizer_hub().load_tokenizer("llama3_2_1b")

# Create encoder in different modes
default_encoder = tokenizer.create_encoder()  # Adds BOS and EOS
prompt_encoder = tokenizer.create_encoder(mode="prompt")  # Only BOS

# Encode text
text = "Hello, world!"
tokens = default_encoder(text)

Using Tiktoken Implementation¶

from fairseq2.models.llama import get_llama_tokenizer_hub
from fairseq2.models.llama.tokenizer import LLaMATokenizerConfig
from pathlib import Path

# Configure tiktoken implementation
config = LLaMATokenizerConfig(impl="tiktoken", use_eot=True)

# Load tokenizer with custom config
hub = get_llama_tokenizer_hub()
tokenizer = hub.load_custom_tokenizer(Path("/path/to/tokenizer.model"), config)

Chat Template Support¶

The HuggingFace implementation includes support for chat templates through the HuggingFace tokenizer’s apply_chat_template method:

from fairseq2.models.llama import get_llama_tokenizer_hub

# Load tokenizer
tokenizer = get_llama_tokenizer_hub().load_tokenizer("llama3_2_1b")

# Prepare chat messages
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me a joke."},
    {"role": "assistant", "content": "Why did the chicken cross the road?"}
]

# Format using chat template
formatted_text = tokenizer._model._tok.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Then encode the formatted text
encoder = tokenizer.create_encoder()
tokens = encoder(formatted_text)