TextTokenizer
- class fairseq2.data.text.TextTokenizer[source]
Bases:
ABC
Represents a tokenizer to encode and decode text.
- abstract create_encoder(*, task=None, lang=None, mode=None, device=None, pin_memory=False)[source]
Create a token encoder.
The valid arguments for the
task
,lang
, andmode
parameters are implementation specific. Refer to concreteTextTokenizer
subclasses for more information.- Parameters:
task (str | None) – The task for which to generate token indices. Typically,
task
is used to distinguish between different tasks such as ‘translation’ or ‘transcription’.lang (str | None) – The language of generated token indices. Typically, multilingual translation tasks use
lang
to distinguish between different languages such as ‘en-US’ or ‘de-DE’.mode (str | None) – The mode in which to generate token indices. Typically, translation tasks use
mode
to distinguish between different modes such as ‘source’ or ‘target’.device (device | None) – The device on which to construct tensors.
pin_memory (bool) – If
True
, uses pinned memory while constructing tensors.
- Return type:
- abstract create_raw_encoder(*, device=None, pin_memory=False)[source]
Create a raw token encoder with no control symbols.
- Parameters:
- Return type:
- abstract property vocab_info: VocabularyInfo
The vocabulary information associated with the tokenizer.