neuralset.extractors.video.HuggingFaceVideo¶
- class neuralset.extractors.video.HuggingFaceVideo(*, event_types: Literal['Video'] = 'Video', aggregation: Literal['single', 'sum', 'mean', 'first', 'middle', 'last', 'cat', 'stack', 'trigger'] = 'single', allow_missing: bool = False, frequency: float | Literal['native'] = 0.0, image: HuggingFaceImage = HuggingFaceImage(None={'aggregation': 'single', 'allow_missing': False, 'batch_size': 32, 'cache_n_layers': None, 'device': 'cpu', 'event_types': 'Image', 'frequency': 0.0, 'imsize': None, 'infra': {'cluster': None, 'conda_env': None, 'cpus_per_task': 8, 'folder': None, 'forbid_single_item_computation': False, 'gpus_per_node': 1, 'job_name': None, 'keep_in_ram': False, 'logs': '{folder}/logs/{user}/%j', 'max_jobs': 128, 'max_pickle_size_gb': None, 'mem_gb': None, 'min_samples_per_job': 4096, 'mode': 'cached', 'nodes': 1, 'permissions': 511, 'slurm_account': None, 'slurm_additional_parameters': None, 'slurm_constraint': None, 'slurm_partition': None, 'slurm_qos': None, 'slurm_use_srun': False, 'tasks_per_node': 1, 'timeout_min': 25, 'version': 'v5', 'workdir': None}, 'layer_aggregation': 'mean', 'layers': 0.6666666666666666, 'model_name': 'MCG-NJU/videomae-base', 'name': 'HuggingFaceImage', 'pretrained': True, 'token_aggregation': 'mean'}), use_audio: bool = True, clip_duration: float | None = None, max_imsize: int | None = None, layer_type: str = '', num_frames: int | None = None, infra: MapInfra = MapInfra(folder=None, cluster=None, logs='{folder}/logs/{user}/%j', job_name=None, timeout_min=120, nodes=1, tasks_per_node=1, cpus_per_task=8, gpus_per_node=1, mem_gb=None, max_pickle_size_gb=None, slurm_constraint=None, slurm_partition=None, slurm_account=None, slurm_qos=None, slurm_use_srun=False, slurm_additional_parameters=None, conda_env=None, workdir=None, permissions=511, version='v5', keep_in_ram=True, max_jobs=128, min_samples_per_job=128, forbid_single_item_computation=False, mode='cached'))[source][source]¶
Extract video features using a HuggingFace transformer model.
This feature extractor supports two processing modes:
Image-based processing: When using an image model, videos are sampled at the specified frequency and each frame is processed independently.
Video-based processing: When using a native video model (e.g., VideoMAE, XClip), videos are divided into clips of clip_duration seconds at the specified frequency. Each clip is processed by the video model, and features are aggregated over time.
- Parameters:
image (HuggingFaceImage, default=HuggingFaceImage(model_name="MCG-NJU/videomae-base")) – Image or video feature extractor configuration. If image.model_name refers to an image model (e.g., ViT), frames are extracted and processed independently. If it’s a video model, clips are processed using the native video architecture.
use_audio (bool, default=True) – Whether to include audio alongside video frames during feature extraction. Only applicable for models that support multimodal inputs (e.g., LLaVA-Video).
clip_duration (float | None, default=None) – Duration (in seconds) of video sub-clips to process. If None, defaults to one timestep (1 / frequency).
max_imsize (int | None, default=None) – Maximum image dimension for downsampling before processing. Useful for memory-constrained scenarios. For example, Phi-4 downsizes to 448×448 before tokenization.
layer_type (str, default="") –
Specific layer extraction mode for certain models. For XClip: Use “mit” to extract from Multi-frame Integration Transformer layers instead of vision backbone layers. For LLaVA models: Must be a prompt string containing the
<video>token (e.g.,"<|user|><video><|end|><|assistant|>").Note
The pipe characters in the example are literal LLaVA tokens.
num_frames (int | None, default=None) – Number of frames to pass to the video model per clip. If None, uses the model’s default frame count (e.g., 16 for VideoMAE, 8 for XClip, 64 for VJepa2).