.. _tokenizer: fairseq2.data.tokenizers ======================== .. currentmodule:: fairseq2.data.tokenizers The tokenizer has multiple concrete implementations for different tokenization algorithms. The main :class:`Tokenizer` interface defines the contract for creating encoders and decoders, while concrete implementations handle specific tokenization methods like SentencePiece and tiktoken. Base Classes ------------ .. autoclass:: Tokenizer :members: :undoc-members: :show-inheritance: .. autoclass:: TokenEncoder :members: :undoc-members: :show-inheritance: .. autoclass:: TokenDecoder :members: :undoc-members: :show-inheritance: .. autoclass:: VocabularyInfo :members: :undoc-members: :show-inheritance: Quick Start ----------- Loading a Tokenizer ~~~~~~~~~~~~~~~~~~~ .. code-block:: python from fairseq2.data.tokenizers import load_tokenizer tokenizer = load_tokenizer("qwen3_0.6b") Loading a Specific Model's Tokenizer ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from fairseq2.models.qwen import get_qwen_tokenizer_hub hub = get_qwen_tokenizer_hub() # directly load a tokenizer to ~/.cache/huggingface/models--qwen--qwen3-0.6b tokenizer = hub.load_tokenizer("qwen3_0.6b") This loads the tokenizer and its associated vocabulary for the specified model. Using TokenizerHub ~~~~~~~~~~~~~~~~~~ :class:`TokenizerHub` provides more advanced/customized operations for working with tokenizers. This is helpful if you want to implement your own tokenizer, and configuration. Here's how to use it with Qwen tokenizers (you can adapt this for your own tokenizer family): .. code-block:: python from fairseq2.data.tokenizers.hub import TokenizerHubAccessor from fairseq2.models.qwen import QwenTokenizer, QwenTokenizerConfig from pathlib import Path # when implementing your own tokenizer family, you can create a similar helper function # to load the hub for that family. # behind the scene, get_qwen_tokenizer_hub is implemented like this: get_qwen_tokenizer_hub = TokenizerHubAccessor( "qwen", # tokenizer family name QwenTokenizer, # concrete tokenizer class QwenTokenizerConfig, # concrete tokenizer config class ) hub = get_qwen_tokenizer_hub() # directly load a tokenizer to ~/.cache/huggingface/models--qwen--qwen3-0.6b tokenizer = hub.load_tokenizer("qwen3_0.6b") # load a tokenizer configuration config = hub.get_tokenizer_config("qwen3_0.6b") # load a custom tokenizer from a path # hf download Qwen/Qwen3-0.6B --local-dir /data/pretrained_llms/qwen3_0.6b custom_path = Path("/data/pretrained_llms/qwen3_0.6b") custom_tokenizer = hub.load_custom_tokenizer(custom_path, config) # Generate some text text = "The future of AI is" encoder = custom_tokenizer.create_encoder() encoded = encoder(text) # Decode the text decoder = custom_tokenizer.create_decoder() decoded = decoder(encoded) Listing Available Tokenizers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can list all available tokenizers using the `list` command from the command line: .. code-block:: bash # List tokenizers from command line python -m fairseq2.assets list --kind tokenizer Or, it can be done programmatically: .. code-block:: python from fairseq2.models.qwen import get_qwen_tokenizer_hub hub = get_qwen_tokenizer_hub() for card in hub.iter_cards(): print(f"Found tokenizer: {card.name}") .. toctree:: :maxdepth: 1 hub