neuralset.extractors.video.HuggingFaceVideo

pydantic model neuralset.extractors.video.HuggingFaceVideo[source][source]

Extract video features using a HuggingFace transformer model.

This feature extractor supports two processing modes:

  1. Image-based processing: When using an image model, videos are sampled at the specified frequency and each frame is processed independently.

  2. Video-based processing: When using a native video model (e.g., VideoMAE, XClip), videos are divided into clips of clip_duration seconds at the specified frequency. Each clip is processed by the video model, and features are aggregated over time.

Parameters:
  • image (HuggingFaceImage, default=HuggingFaceImage(model_name="MCG-NJU/videomae-base")) – Image or video feature extractor configuration. If image.model_name refers to an image model (e.g., ViT), frames are extracted and processed independently. If it’s a video model, clips are processed using the native video architecture.

  • use_audio (bool, default=True) – Whether to include audio alongside video frames during feature extraction. Only applicable for models that support multimodal inputs (e.g., LLaVA-Video).

  • clip_duration (float | None, default=None) – Duration (in seconds) of video sub-clips to process. If None, defaults to one timestep (1 / frequency).

  • max_imsize (int | None, default=None) – Maximum image dimension for downsampling before processing. Useful for memory-constrained scenarios. For example, Phi-4 downsizes to 448×448 before tokenization.

  • layer_type (str, default="") –

    Specific layer extraction mode for certain models. For XClip: Use “mit” to extract from Multi-frame Integration Transformer layers instead of vision backbone layers. For LLaVA models: Must be a prompt string containing the <video> token (e.g., "<|user|><video><|end|><|assistant|>").

    Note

    The pipe characters in the example are literal LLaVA tokens.

  • num_frames (int | None, default=None) – Number of frames to pass to the video model per clip. If None, uses the model’s default frame count (e.g., 16 for VideoMAE, 8 for XClip, 64 for VJepa2).

Fields:
field event_types: Literal['Video'] = 'Video'[source]
requirements: ClassVar[tuple[str, ...]] = ('torchvision>=0.15.2', 'julius>=0.2.7', 'moviepy>=2.1.2')[source]
field image: HuggingFaceImage = HuggingFaceImage(** { 'aggregation': 'single',   'allow_missing': False,   'batch_size': 32,   'cache_n_layers': None,   'device': 'cpu',   'event_types': 'Image',   'frequency': 0.0,   'imsize': None,   'infra': { 'cluster': None,              'conda_env': None,              'cpus_per_task': 8,              'folder': None,              'forbid_single_item_computation': False,              'gpus_per_node': 1,              'job_name': None,              'keep_in_ram': False,              'logs': '{folder}/logs/{user}/%j',              'max_jobs': 128,              'max_pickle_size_gb': None,              'mem_gb': None,              'min_samples_per_job': 4096,              'mode': 'cached',              'nodes': 1,              'permissions': 511,              'slurm_account': None,              'slurm_additional_parameters': None,              'slurm_constraint': None,              'slurm_partition': None,              'slurm_qos': None,              'slurm_setup': None,              'slurm_use_srun': False,              'tasks_per_node': 1,              'timeout_min': 25,              'version': 'v5',              'workdir': None},   'layer_aggregation': 'mean',   'layers': 0.6666666666666666,   'model_name': 'MCG-NJU/videomae-base',   'name': 'HuggingFaceImage',   'pretrained': True,   'token_aggregation': 'mean'} )[source]
field use_audio: bool = True[source]
field clip_duration: float | None = None[source]
field max_imsize: int | None = None[source]
field layer_type: str = ''[source]
field num_frames: int | None = None[source]
field infra: MapInfra = MapInfra(folder=None, cluster=None, logs='{folder}/logs/{user}/%j', job_name=None, timeout_min=120, nodes=1, tasks_per_node=1, cpus_per_task=8, gpus_per_node=1, mem_gb=None, max_pickle_size_gb=None, slurm_constraint=None, slurm_partition=None, slurm_account=None, slurm_qos=None, slurm_use_srun=False, slurm_additional_parameters=None, slurm_setup=None, conda_env=None, workdir=None, permissions=511, version='v5', keep_in_ram=True, max_jobs=128, min_samples_per_job=128, forbid_single_item_computation=False, mode='cached')[source]