fairseq2.data.tokenizers.hub¶
The tokenizer hub provides a centralized way to manage and access tokenizers in fairseq2. It offers functionality for discovering available tokenizers, loading tokenizers, and working with custom tokenizer configurations.
Quick Reference¶
See fairseq2.data.tokenizers for detailed usage examples.
Core Classes¶
TokenizerHub¶
- final class fairseq2.data.tokenizers.hub.TokenizerHub(family, asset_store)[source]¶
Bases:
Generic
[TokenizerT
,TokenizerConfigT
]The main hub class for managing tokenizers. Provides methods for:
Listing available tokenizers (
iter_cards()
)Loading tokenizer configurations (
get_tokenizer_config()
)Loading tokenizers (
load_tokenizer()
)Loading custom tokenizers (
load_custom_tokenizer()
)
Example:
from fairseq2.models.qwen import get_qwen_tokenizer_hub hub = get_qwen_tokenizer_hub() # list available tokenizers for card in hub.iter_cards(): print(f"Found tokenizer: {card.name}") # directly load a tokenizer to ~/.cache/huggingface/models--qwen--qwen3-0.6b tokenizer = hub.load_tokenizer("qwen3_0.6b") # load a tokenizer configuration config = hub.get_tokenizer_config("qwen3_0.6b") # load a custom tokenizer from a path # hf download Qwen/Qwen3-0.6B --local-dir /data/pretrained_llms/qwen3_0.6b custom_path = Path("/data/pretrained_llms/qwen3_0.6b") custom_tokenizer = hub.load_custom_tokenizer(custom_path, config) # Generate some text text = "The future of AI is" encoder = custom_tokenizer.create_encoder() encoded = encoder(text) # Decode the text decoder = custom_tokenizer.create_decoder() decoded = decoder(encoded)
TokenizerHubAccessor¶
- final class fairseq2.data.tokenizers.hub.TokenizerHubAccessor(family_name, kls, config_kls)[source]¶
Bases:
Generic
[TokenizerT
,TokenizerConfigT
]Factory class for creating
TokenizerHub
instances for specific tokenizer families. Can be used by tokenizer implementors to create hub accessors for their tokenizer families likefairseq2.models.qwen.get_qwen_tokenizer_hub()
.Example:
from fairseq2.data.tokenizers.hub import TokenizerHubAccessor from fairseq2.models.qwen import QwenTokenizer, QwenTokenizerConfig # the implementation of get_qwen_tokenizer_hub get_qwen_tokenizer_hub = TokenizerHubAccessor( "qwen", # tokenizer family name QwenTokenizer, # concrete tokenizer class QwenTokenizerConfig, # concrete tokenizer config class )
Functions¶
load_tokenizer¶
- fairseq2.data.tokenizers.hub.load_tokenizer(card, *, config=None, progress=True)[source]¶
The global, family-agnostic function for loading tokenizers. This is a high-level function that handles all the complexities of tokenizer loading internally (via hub methods).
Example:
from fairseq2.data.tokenizers import load_tokenizer tokenizer = load_tokenizer("qwen3_0.6b")
- Return type:
Exceptions¶
TokenizerNotKnownError¶
TokenizerFamilyNotKnownError¶
See Also¶
fairseq2.datasets.hub for dataset hub reference documentation.
fairseq2.models.hub for model hub reference documentation.