.. _tokenizer:

fairseq2.data.tokenizers
========================

.. currentmodule:: fairseq2.data.tokenizers

The tokenizer has multiple concrete implementations for different tokenization algorithms.
The main :class:`Tokenizer` interface defines the contract for creating encoders and decoders, while concrete implementations
handle specific tokenization methods like SentencePiece and tiktoken.

Base Classes
------------

.. autoclass:: Tokenizer
    :members:
    :undoc-members:
    :show-inheritance:

.. autoclass:: TokenEncoder
    :members:
    :undoc-members:
    :show-inheritance:

.. autoclass:: TokenDecoder
    :members:
    :undoc-members:
    :show-inheritance:

.. autoclass:: VocabularyInfo
    :members:
    :undoc-members:
    :show-inheritance:

Quick Start
-----------

Loading a Tokenizer
~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from fairseq2.data.tokenizers import load_tokenizer

    tokenizer = load_tokenizer("qwen3_0.6b")


Loading a Specific Model's Tokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from fairseq2.models.qwen import get_qwen_tokenizer_hub

    hub = get_qwen_tokenizer_hub()

    # directly load a tokenizer to ~/.cache/huggingface/models--qwen--qwen3-0.6b
    tokenizer = hub.load_tokenizer("qwen3_0.6b")


This loads the tokenizer and its associated vocabulary for the specified model.


Using TokenizerHub
~~~~~~~~~~~~~~~~~~

:class:`TokenizerHub` provides more advanced/customized operations for working with tokenizers.
This is helpful if you want to implement your own tokenizer, and configuration.
Here's how to use it with Qwen tokenizers (you can adapt this for your own tokenizer family):

.. code-block:: python

    from fairseq2.data.tokenizers.hub import TokenizerHubAccessor
    from fairseq2.models.qwen import QwenTokenizer, QwenTokenizerConfig
    from pathlib import Path

    # when implementing your own tokenizer family, you can create a similar helper function
    # to load the hub for that family.
    # behind the scene, get_qwen_tokenizer_hub is implemented like this:
    get_qwen_tokenizer_hub = TokenizerHubAccessor(
        "qwen",  # tokenizer family name
        QwenTokenizer,  # concrete tokenizer class
        QwenTokenizerConfig,  # concrete tokenizer config class
    )
    hub = get_qwen_tokenizer_hub()

    # directly load a tokenizer to ~/.cache/huggingface/models--qwen--qwen3-0.6b
    tokenizer = hub.load_tokenizer("qwen3_0.6b")

    # load a tokenizer configuration
    config = hub.get_tokenizer_config("qwen3_0.6b")
    
    # load a custom tokenizer from a path
    # hf download Qwen/Qwen3-0.6B --local-dir /data/pretrained_llms/qwen3_0.6b
    custom_path = Path("/data/pretrained_llms/qwen3_0.6b")
    custom_tokenizer = hub.load_custom_tokenizer(custom_path, config)

    # Generate some text
    text = "The future of AI is"
    encoder = custom_tokenizer.create_encoder()
    encoded = encoder(text)

    # Decode the text
    decoder = custom_tokenizer.create_decoder()
    decoded = decoder(encoded)


Listing Available Tokenizers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can list all available tokenizers using the `list` command from the command line:

.. code-block:: bash

    # List tokenizers from command line
    python -m fairseq2.assets list --kind tokenizer

Or, it can be done programmatically:

.. code-block:: python

    from fairseq2.models.qwen import get_qwen_tokenizer_hub

    hub = get_qwen_tokenizer_hub()

    for card in hub.iter_cards():
        print(f"Found tokenizer: {card.name}")


.. toctree::
    :maxdepth: 1

    hub