neuralset.extractors.text.HuggingFaceText¶

pydantic model neuralset.extractors.text.HuggingFaceText[source][source]¶

Get embeddings from HuggingFace language models. This extractor can be applied to any kind of event which has a text attribute: Word, Sentence, etc.

Parameters:

batch_size (int) – Batch size for the language model.
contextualized (bool) – True by default, the context of the event is used to compute the embeddings.

Note

The tokenizer truncates the input to the maximum size specified by the model. An empty context will raise an error to the default HuggingFaceText since contextualized is True by default. To get non-contextualized embeddings, set contextualized to False.

Fields:

batch_size (int)
contextualized (bool)
event_types (Literal['Word', 'Sentence'])
hf_config (neuralset.extractors.text.HuggingFaceTextConfig)
infra (exca.map.MapInfra)
model_name (str)

field model_name: str = 'openai-community/gpt2'[source]¶

field event_types: Literal['Word', 'Sentence'] = 'Word'[source]¶

requirements: ClassVar[tuple[str, ...]] = ('transformers>=4.29.2', 'huggingface_hub>=0.27.0', 'transformers>=4.29.2', 'huggingface_hub>=0.27.0', 'transformers>=4.29.2')[source]¶

field infra: MapInfra = MapInfra(timeout_min=25, cpus_per_task=10, gpus_per_node=1, version='v7', min_samples_per_job=4096)[source]¶

field batch_size: int = 32[source]¶

field contextualized: bool = True[source]¶

field hf_config: HuggingFaceTextConfig = HuggingFaceTextConfig(** { 'model_cls_name': 'AutoModel', 'model_kwargs': None, 'processor_cls_name': 'AutoTokenizer', 'processor_kwargs': {'padding_side': 'right', 'truncation_side': 'left'}} )[source]¶

property tokenizer: Any[source]¶

← Back to API reference