fairseq2.data.tokenizers¶

The tokenizer has multiple concrete implementations for different tokenization algorithms. The main Tokenizer interface defines the contract for creating encoders and decoders, while concrete implementations handle specific tokenization methods like SentencePiece and tiktoken.

Base Classes¶

class fairseq2.data.tokenizers.Tokenizer[source]¶

Bases: ABC

Represents a tokenizer to encode and decode text.

abstract create_encoder(*, task: str | None = None, lang: str | None = None, mode: str | None = None, device: device | None = None, pin_memory: bool = False) → TokenEncoder[source]¶

Constructs a token encoder.

The valid arguments for the task, lang, and mode parameters are implementation specific. Refer to concrete Tokenizer subclasses for more information.

Parameters:

task – The task for which to generate token indices. Typically, task is used to distinguish between different tasks such as ‘translation’ or ‘transcription’.
lang – The language of generated token indices. Typically, multilingual translation tasks use lang to distinguish between different languages such as ‘en-US’ or ‘de-DE’.
mode – The mode in which to generate token indices. Typically, translation tasks use mode to distinguish between different modes such as ‘source’ or ‘target’.
device – The device on which to construct tensors.
pin_memory – If True, uses pinned memory while constructing tensors.

abstract create_raw_encoder(*, device: device | None = None, pin_memory: bool = False) → TokenEncoder[source]¶

Constructs a raw token encoder with no control symbols.

Parameters:

device – The device on which to construct tensors.
pin_memory – If True, uses pinned memory for tensors.

abstract create_decoder(*, skip_special_tokens: bool = False) → TokenDecoder[source]¶: Constructs a token decoder.

abstract property vocab_info: VocabularyInfo¶: The vocabulary information associated with the tokenizer.

class fairseq2.data.tokenizers.TokenEncoder[source]¶

Bases: ABC

Encodes text into tokens or token indices.

abstract encode_as_tokens(text: str) → list[str][source]¶

Parameters:: text – The text to encode.

abstract property prefix_indices: Tensor | None¶: Gets the indices of the prefix tokens. Shape: \((S)\), where \(S\) is the number of indices.

abstract property suffix_indices: Tensor | None¶: Gets the indices of the suffix tokens. Shape: \((S)\), where \(S\) is the number of indices.

class fairseq2.data.tokenizers.TokenDecoder[source]¶

Bases: ABC

Decodes text from tokens or token indices.

abstract decode_from_tokens(tokens: Sequence[str]) → str[source]¶

Bases: object

Describes the vocabulary used by a tokenizer

size: int¶: The size of the vocabulary.

unk_idx: int | None¶: The index of the symbol that represents an unknown element (UNK).

bos_idx: int | None¶: The index of the symbol that represents the beginning of a sequence (BOS).

eos_idx: int | None¶: The index of the symbol that represents the end of a sequence (EOS).

pad_idx: int | None¶: The index of the symbol that is used to pad a sequence (PAD).

boh_idx: int | None = None¶: The index of the symbol that represents the beginning of a header (BOH).

eoh_idx: int | None = None¶: The index of the symbol that represents the end of a header (EOH).

Quick Start¶

Loading a Tokenizer¶

from fairseq2.data.tokenizers import load_tokenizer

tokenizer = load_tokenizer("qwen3_0.6b")

Loading a Specific Model’s Tokenizer¶

from fairseq2.models.qwen import get_qwen_tokenizer_hub

hub = get_qwen_tokenizer_hub()

# directly load a tokenizer to ~/.cache/huggingface/models--qwen--qwen3-0.6b
tokenizer = hub.load_tokenizer("qwen3_0.6b")

This loads the tokenizer and its associated vocabulary for the specified model.

Using TokenizerHub¶

TokenizerHub provides more advanced/customized operations for working with tokenizers. This is helpful if you want to implement your own tokenizer, and configuration. Here’s how to use it with Qwen tokenizers (you can adapt this for your own tokenizer family):

from fairseq2.data.tokenizers.hub import TokenizerHubAccessor
from fairseq2.models.qwen import QwenTokenizer, QwenTokenizerConfig
from pathlib import Path

# when implementing your own tokenizer family, you can create a similar helper function
# to load the hub for that family.
# behind the scene, get_qwen_tokenizer_hub is implemented like this:
get_qwen_tokenizer_hub = TokenizerHubAccessor(
    "qwen",  # tokenizer family name
    QwenTokenizer,  # concrete tokenizer class
    QwenTokenizerConfig,  # concrete tokenizer config class
)
hub = get_qwen_tokenizer_hub()

# directly load a tokenizer to ~/.cache/huggingface/models--qwen--qwen3-0.6b
tokenizer = hub.load_tokenizer("qwen3_0.6b")

# load a tokenizer configuration
config = hub.get_tokenizer_config("qwen3_0.6b")

# load a custom tokenizer from a path
# hf download Qwen/Qwen3-0.6B --local-dir /data/pretrained_llms/qwen3_0.6b
custom_path = Path("/data/pretrained_llms/qwen3_0.6b")
custom_tokenizer = hub.load_custom_tokenizer(custom_path, config)

# Generate some text
text = "The future of AI is"
encoder = custom_tokenizer.create_encoder()
encoded = encoder(text)

# Decode the text
decoder = custom_tokenizer.create_decoder()
decoded = decoder(encoded)

Listing Available Tokenizers¶

You can list all available tokenizers using the list command from the command line:

# List tokenizers from command line
python -m fairseq2.assets list --kind tokenizer

Or, it can be done programmatically:

from fairseq2.models.qwen import get_qwen_tokenizer_hub

hub = get_qwen_tokenizer_hub()

for card in hub.iter_cards():
    print(f"Found tokenizer: {card.name}")

doc:

references/fairseq2.data.tokenizers.hub