fairseq2.models.hg.tokenizer¶
HuggingFace tokenizer integration for fairseq2.
Functions
|
Load a HuggingFace tokenizer. |
Classes
|
HuggingFace tokenizer adapter for fairseq2. |
|
Configuration for HuggingFace tokenizers. |
- class fairseq2.models.hg.tokenizer.HgTokenizerConfig(*, unk_token: str | None = None, bos_token: str | None = None, eos_token: str | None = None, pad_token: str | None = None, boh_token: str | None = None, eoh_token: str | None = None)[source]¶
Bases:
objectConfiguration for HuggingFace tokenizers.
- class fairseq2.models.hg.tokenizer.HgTokenizer(model: HuggingFaceTokenModel)[source]¶
Bases:
TokenizerHuggingFace tokenizer adapter for fairseq2.
This class wraps a HuggingFace tokenizer to make it compatible with fairseq2’s Tokenizer interface. It provides access to both fairseq2 tokenizer methods and the underlying HuggingFace tokenizer.
- Example:
Create a tokenizer from a model:
model = load_hg_token_model("gpt2") tokenizer = HgTokenizer(model) # Use fairseq2 interface tokens = tokenizer.encode("Hello world") text = tokenizer.decode(tokens) # Access underlying HuggingFace tokenizer hf_tokenizer = tokenizer.raw
- create_encoder(*, task: str | None = None, lang: str | None = None, mode: str | None = None, device: device | None = None, pin_memory: bool = False) TokenEncoder[source]¶
- create_decoder(*, skip_special_tokens: bool = False) TokenDecoder[source]¶
- property vocab_info: VocabularyInfo¶
- property raw: PreTrainedTokenizer | PreTrainedTokenizerFast¶
- property model: HuggingFaceTokenModel¶
- fairseq2.models.hg.tokenizer.load_hg_tokenizer(path: Path, config: HgTokenizerConfig) HgTokenizer[source]¶
Load a HuggingFace tokenizer.
- Parameters:
config – Tokenizer configuration
- Returns:
HgTokenizer instance