fairseq2.models.hg.tokenizer

HuggingFace tokenizer integration for fairseq2.

Functions

load_hg_tokenizer(path, config)

Load a HuggingFace tokenizer.

Classes

HgTokenizer(model)

HuggingFace tokenizer adapter for fairseq2.

HgTokenizerConfig(*[, unk_token, bos_token, ...])

Configuration for HuggingFace tokenizers.

class fairseq2.models.hg.tokenizer.HgTokenizerConfig(*, unk_token: str | None = None, bos_token: str | None = None, eos_token: str | None = None, pad_token: str | None = None, boh_token: str | None = None, eoh_token: str | None = None)[source]

Bases: object

Configuration for HuggingFace tokenizers.

unk_token: str | None = None

The unknown token.

bos_token: str | None = None

The beginning-of-sequence token.

eos_token: str | None = None

The end-of-sequence token.

pad_token: str | None = None

The padding token.

boh_token: str | None = None

The beginning-of-head token.

eoh_token: str | None = None

The end-of-head token.

class fairseq2.models.hg.tokenizer.HgTokenizer(model: HuggingFaceTokenModel)[source]

Bases: Tokenizer

HuggingFace tokenizer adapter for fairseq2.

This class wraps a HuggingFace tokenizer to make it compatible with fairseq2’s Tokenizer interface. It provides access to both fairseq2 tokenizer methods and the underlying HuggingFace tokenizer.

Example:

Create a tokenizer from a model:

model = load_hg_token_model("gpt2")
tokenizer = HgTokenizer(model)

# Use fairseq2 interface
tokens = tokenizer.encode("Hello world")
text = tokenizer.decode(tokens)

# Access underlying HuggingFace tokenizer
hf_tokenizer = tokenizer.raw
create_encoder(*, task: str | None = None, lang: str | None = None, mode: str | None = None, device: device | None = None, pin_memory: bool = False) TokenEncoder[source]
create_raw_encoder(*, device: device | None = None, pin_memory: bool = False) TokenEncoder[source]
create_decoder(*, skip_special_tokens: bool = False) TokenDecoder[source]
encode(text: str, *, device: device | None = None, pin_memory: bool = False) Tensor[source]
decode(token_indices: Tensor, *, skip_special_tokens: bool = False) str[source]
convert_tokens_to_ids(tokens: list[str] | str) int | list[int][source]
property vocab_info: VocabularyInfo
property unk_token: str | None
property bos_token_id: int | None
property bos_token: str | None
property eos_token_id: int | None
property eos_token: str | None
property pad_token_id: int | None
property pad_token: str | None
property boh_token: str | None
property eoh_token: str | None
property chat_template: str | None
property raw: PreTrainedTokenizer | PreTrainedTokenizerFast
property model: HuggingFaceTokenModel
fairseq2.models.hg.tokenizer.load_hg_tokenizer(path: Path, config: HgTokenizerConfig) HgTokenizer[source]

Load a HuggingFace tokenizer.

Parameters:

config – Tokenizer configuration

Returns:

HgTokenizer instance