fairseq2.data.tokenizers¶
The tokenizer has multiple concrete implementations for different tokenization algorithms.
The main Tokenizer interface defines the contract for creating encoders and decoders, while concrete implementations
handle specific tokenization methods like SentencePiece and tiktoken.
Base Classes¶
- class fairseq2.data.tokenizers.Tokenizer[source]¶
Bases:
ABCRepresents a tokenizer to encode and decode text.
- abstract create_encoder(*, task: str | None = None, lang: str | None = None, mode: str | None = None, device: device | None = None, pin_memory: bool = False) TokenEncoder[source]¶
Constructs a token encoder.
The valid arguments for the
task,lang, andmodeparameters are implementation specific. Refer to concreteTokenizersubclasses for more information.- Parameters:
task – The task for which to generate token indices. Typically,
taskis used to distinguish between different tasks such as ‘translation’ or ‘transcription’.lang – The language of generated token indices. Typically, multilingual translation tasks use
langto distinguish between different languages such as ‘en-US’ or ‘de-DE’.mode – The mode in which to generate token indices. Typically, translation tasks use
modeto distinguish between different modes such as ‘source’ or ‘target’.device – The device on which to construct tensors.
pin_memory – If
True, uses pinned memory while constructing tensors.
- abstract create_raw_encoder(*, device: device | None = None, pin_memory: bool = False) TokenEncoder[source]¶
Constructs a raw token encoder with no control symbols.
- Parameters:
device – The device on which to construct tensors.
pin_memory – If
True, uses pinned memory for tensors.
- abstract create_decoder(*, skip_special_tokens: bool = False) TokenDecoder[source]¶
Constructs a token decoder.
- abstract property vocab_info: VocabularyInfo¶
The vocabulary information associated with the tokenizer.
- class fairseq2.data.tokenizers.TokenEncoder[source]¶
Bases:
ABCEncodes text into tokens or token indices.
- class fairseq2.data.tokenizers.TokenDecoder[source]¶
Bases:
ABCDecodes text from tokens or token indices.
- class fairseq2.data.tokenizers.VocabularyInfo(size: int, unk_idx: int | None, bos_idx: int | None, eos_idx: int | None, pad_idx: int | None, boh_idx: int | None = None, eoh_idx: int | None = None)[source]¶
Bases:
objectDescribes the vocabulary used by a tokenizer
Quick Start¶
Loading a Tokenizer¶
from fairseq2.data.tokenizers import load_tokenizer
tokenizer = load_tokenizer("qwen3_0.6b")
Loading a Specific Model’s Tokenizer¶
from fairseq2.models.qwen import get_qwen_tokenizer_hub
hub = get_qwen_tokenizer_hub()
# directly load a tokenizer to ~/.cache/huggingface/models--qwen--qwen3-0.6b
tokenizer = hub.load_tokenizer("qwen3_0.6b")
This loads the tokenizer and its associated vocabulary for the specified model.
Using TokenizerHub¶
TokenizerHub provides more advanced/customized operations for working with tokenizers.
This is helpful if you want to implement your own tokenizer, and configuration.
Here’s how to use it with Qwen tokenizers (you can adapt this for your own tokenizer family):
from fairseq2.data.tokenizers.hub import TokenizerHubAccessor
from fairseq2.models.qwen import QwenTokenizer, QwenTokenizerConfig
from pathlib import Path
# when implementing your own tokenizer family, you can create a similar helper function
# to load the hub for that family.
# behind the scene, get_qwen_tokenizer_hub is implemented like this:
get_qwen_tokenizer_hub = TokenizerHubAccessor(
"qwen", # tokenizer family name
QwenTokenizer, # concrete tokenizer class
QwenTokenizerConfig, # concrete tokenizer config class
)
hub = get_qwen_tokenizer_hub()
# directly load a tokenizer to ~/.cache/huggingface/models--qwen--qwen3-0.6b
tokenizer = hub.load_tokenizer("qwen3_0.6b")
# load a tokenizer configuration
config = hub.get_tokenizer_config("qwen3_0.6b")
# load a custom tokenizer from a path
# hf download Qwen/Qwen3-0.6B --local-dir /data/pretrained_llms/qwen3_0.6b
custom_path = Path("/data/pretrained_llms/qwen3_0.6b")
custom_tokenizer = hub.load_custom_tokenizer(custom_path, config)
# Generate some text
text = "The future of AI is"
encoder = custom_tokenizer.create_encoder()
encoded = encoder(text)
# Decode the text
decoder = custom_tokenizer.create_decoder()
decoded = decoder(encoded)
Listing Available Tokenizers¶
You can list all available tokenizers using the list command from the command line:
# List tokenizers from command line
python -m fairseq2.assets list --kind tokenizer
Or, it can be done programmatically:
from fairseq2.models.qwen import get_qwen_tokenizer_hub
hub = get_qwen_tokenizer_hub()
for card in hub.iter_cards():
print(f"Found tokenizer: {card.name}")
- doc:
references/fairseq2.data.tokenizers.hub