TextTokenizer

class fairseq2.data.text.TextTokenizer(vocab_info)[source]

Bases: ABC

Represents a tokenizer to encode and decode text.

Parameters:

vocab_info (VocabularyInfo) – The vocabulary information associated with the tokenizer.

abstract create_decoder()[source]

Create a token decoder.

Return type:

TextTokenDecoder

abstract create_encoder(*, task=None, lang=None, mode=None, device=None, pin_memory=False)[source]

Create a token encoder.

The valid arguments for the task, lang, and mode parameters are implementation specific. Refer to concrete TextTokenizer subclasses for more information.

Parameters:
  • task (str | None) – The task for which to generate token indices. Typically, multi-task jobs use task to distinguish between different tasks such as ‘translation’ or ‘transcription’.

  • lang (str | None) – The language of generated token indices. Typically, multilingual translation tasks use lang to distinguish between different languages such as ‘en-US’ or ‘de-DE’.

  • mode (str | None) – The mode in which to generate token indices. Typically, translation tasks use mode to distinguish between different modes such as ‘source’ or ‘target’.

  • device (device | None) – The device on which to construct tensors.

  • pin_memory (bool) – If True, uses pinned memory while constructing tensors.

Return type:

TextTokenEncoder

abstract create_raw_encoder(*, device=None, pin_memory=False)[source]

Create a raw token encoder with no control symbols.

Parameters:
  • device (device | None) – The device on which to construct tensors.

  • pin_memory (bool) – If True, uses pinned memory while constructing tensors.

Return type:

TextTokenEncoder