TextTokenizer

class fairseq2.data.text.TextTokenizer[source]

Bases: ABC

Represents a tokenizer to encode and decode text.

abstract create_decoder()[source]

Create a token decoder.

Return type:: TextTokenDecoder

abstract create_encoder(*, task=None, lang=None, mode=None, device=None, pin_memory=False)[source]

Create a token encoder.

The valid arguments for the task, lang, and mode parameters are implementation specific. Refer to concrete TextTokenizer subclasses for more information.

Parameters:

task (str | None) – The task for which to generate token indices. Typically, task is used to distinguish between different tasks such as ‘translation’ or ‘transcription’.
lang (str | None) – The language of generated token indices. Typically, multilingual translation tasks use lang to distinguish between different languages such as ‘en-US’ or ‘de-DE’.
mode (str | None) – The mode in which to generate token indices. Typically, translation tasks use mode to distinguish between different modes such as ‘source’ or ‘target’.
device (device | None) – The device on which to construct tensors.
pin_memory (bool) – If True, uses pinned memory while constructing tensors.

Return type:

TextTokenEncoder

abstract create_raw_encoder(*, device=None, pin_memory=False)[source]

Create a raw token encoder with no control symbols.

Parameters:

device (device | None) – The device on which to construct tensors.
pin_memory (bool) – If True, uses pinned memory while constructing tensors.

Return type:

TextTokenEncoder

abstract property vocab_info: VocabularyInfo: The vocabulary information associated with the tokenizer.