neuralset.extractors.text.HuggingFaceText

pydantic model neuralset.extractors.text.HuggingFaceText[source][source]

Get embeddings from HuggingFace language models. This extractor can be applied to any kind of event which has a text attribute: Word, Sentence, etc.

Parameters:
  • batch_size (int) – Batch size for the language model.

  • contextualized (bool) – True by default, the context of the event is used to compute the embeddings.

  • pretrained (bool or "part-reversal") – use pretrained model if True, untrained initial model if False, or custom scrambling of the model pretrained weights if “part-reveral”

Note

The tokenizer truncates the input to the maximum size specified by the model. An empty context will raise an error to the default HuggingFaceText since contextualized is True by default. To get non-contextualized embeddings, set contextualized to False.

Fields:
field model_name: str = 'openai-community/gpt2'[source]
field event_types: Literal['Word', 'Sentence'] = 'Word'[source]
requirements: ClassVar[tuple[str, ...]] = ('transformers>=4.29.2', 'huggingface_hub>=0.27.0', 'transformers>=4.29.2', 'huggingface_hub>=0.27.0', 'transformers>=4.29.2')[source]
field infra: MapInfra = MapInfra(folder=None, cluster=None, logs='{folder}/logs/{user}/%j', job_name=None, timeout_min=25, nodes=1, tasks_per_node=1, cpus_per_task=10, gpus_per_node=1, mem_gb=None, max_pickle_size_gb=None, slurm_constraint=None, slurm_partition=None, slurm_account=None, slurm_qos=None, slurm_use_srun=False, slurm_additional_parameters=None, slurm_setup=None, conda_env=None, workdir=None, permissions=511, version='v7', keep_in_ram=True, max_jobs=128, min_samples_per_job=4096, forbid_single_item_computation=False, mode='cached')[source]
field batch_size: int = 32[source]
field contextualized: bool = True[source]
field pretrained: bool | Literal['part-reversal'] = True[source]
property model: Module[source]
property tokenizer: Any[source]