VocabularyInfo

class fairseq2.data.VocabularyInfo(size, unk_idx, bos_idx, eos_idx, pad_idx)[source]

Bases: object

Describes the vocabulary used by a tokenizer

bos_idx: int | None

The index of the symbol that represents the beginning of a sequence (BOS).

eos_idx: int | None

The index of the symbol that represents the end of a sequence (EOS).

pad_idx: int | None

The index of the symbol that is used to pad a sequence (PAD).

size: int

The size of the vocabulary.

unk_idx: int | None

The index of the symbol that represents an unknown element (UNK).