fairseq2.data.tokenizers¶
The tokenizer has multiple concrete implementations for different tokenization algorithms.
The main Tokenizer
interface defines the contract for creating encoders and decoders, while concrete implementations
handle specific tokenization methods like SentencePiece and tiktoken.
Base Classes¶
- class fairseq2.data.tokenizers.Tokenizer[source]¶
Bases:
ABC
Represents a tokenizer to encode and decode text.
- abstract create_encoder(*, task=None, lang=None, mode=None, device=None, pin_memory=False)[source]¶
Constructs a token encoder.
The valid arguments for the
task
,lang
, andmode
parameters are implementation specific. Refer to concreteTokenizer
subclasses for more information.- Parameters:
task (str | None) – The task for which to generate token indices. Typically,
task
is used to distinguish between different tasks such as ‘translation’ or ‘transcription’.lang (str | None) – The language of generated token indices. Typically, multilingual translation tasks use
lang
to distinguish between different languages such as ‘en-US’ or ‘de-DE’.mode (str | None) – The mode in which to generate token indices. Typically, translation tasks use
mode
to distinguish between different modes such as ‘source’ or ‘target’.device (device | None) – The device on which to construct tensors.
pin_memory (bool) – If
True
, uses pinned memory while constructing tensors.
- Return type:
- abstract create_raw_encoder(*, device=None, pin_memory=False)[source]¶
Constructs a raw token encoder with no control symbols.
- Parameters:
- Return type:
- abstract create_decoder(*, skip_special_tokens=False)[source]¶
Constructs a token decoder.
- Return type:
- abstract property vocab_info: VocabularyInfo¶
The vocabulary information associated with the tokenizer.
- class fairseq2.data.tokenizers.TokenEncoder[source]¶
Bases:
ABC
Encodes text into tokens or token indices.
- class fairseq2.data.tokenizers.TokenDecoder[source]¶
Bases:
ABC
Decodes text from tokens or token indices.
Quick Start¶
Loading a Tokenizer¶
from fairseq2.data.tokenizers import load_tokenizer
tokenizer = load_tokenizer("qwen3_0.6b")
Loading a Specific Model’s Tokenizer¶
from fairseq2.models.qwen import get_qwen_tokenizer_hub
hub = get_qwen_tokenizer_hub()
# directly load a tokenizer to ~/.cache/huggingface/models--qwen--qwen3-0.6b
tokenizer = hub.load_tokenizer("qwen3_0.6b")
This loads the tokenizer and its associated vocabulary for the specified model.
Using TokenizerHub¶
TokenizerHub
provides more advanced/customized operations for working with tokenizers.
This is helpful if you want to implement your own tokenizer, and configuration.
Here’s how to use it with Qwen tokenizers (you can adapt this for your own tokenizer family):
from fairseq2.data.tokenizers.hub import TokenizerHubAccessor
from fairseq2.models.qwen import QwenTokenizer, QwenTokenizerConfig
from pathlib import Path
# when implementing your own tokenizer family, you can create a similar helper function
# to load the hub for that family.
# behind the scene, get_qwen_tokenizer_hub is implemented like this:
get_qwen_tokenizer_hub = TokenizerHubAccessor(
"qwen", # tokenizer family name
QwenTokenizer, # concrete tokenizer class
QwenTokenizerConfig, # concrete tokenizer config class
)
hub = get_qwen_tokenizer_hub()
# directly load a tokenizer to ~/.cache/huggingface/models--qwen--qwen3-0.6b
tokenizer = hub.load_tokenizer("qwen3_0.6b")
# load a tokenizer configuration
config = hub.get_tokenizer_config("qwen3_0.6b")
# load a custom tokenizer from a path
# hf download Qwen/Qwen3-0.6B --local-dir /data/pretrained_llms/qwen3_0.6b
custom_path = Path("/data/pretrained_llms/qwen3_0.6b")
custom_tokenizer = hub.load_custom_tokenizer(custom_path, config)
# Generate some text
text = "The future of AI is"
encoder = custom_tokenizer.create_encoder()
encoded = encoder(text)
# Decode the text
decoder = custom_tokenizer.create_decoder()
decoded = decoder(encoded)
Listing Available Tokenizers¶
You can list all available tokenizers using the list command from the command line:
# List tokenizers from command line
python -m fairseq2.assets list --kind tokenizer
Or, it can be done programmatically:
from fairseq2.models.qwen import get_qwen_tokenizer_hub
hub = get_qwen_tokenizer_hub()
for card in hub.iter_cards():
print(f"Found tokenizer: {card.name}")