Benchmarking cross-modal learning from naturalistic egocentric video data

About the challenge

Human infants acquire language from sparse, weakly-aligned multimodal input — yet today's vision–language models (VLMs), trained on curated web data, fail to generalize to the kinds of egocentric streams produced by wearable cameras and head-cams. This challenge asks: how far can we close that gap algorithmically, without changing the data?

Participants train a VLM on the BabyView 2025.1 corpus (≈863 h of head-mounted-camera video from children, collected by Stanford University and hosted on NYU Databrary) and nothing else: no extra image, video, text or audio data may be used for any encoder pretraining, fine-tuning or evaluation. Submissions are scored on three families of tasks, each with subgroup aggregates and an overall score:

Cross-modal grounding — Machine-DevBench
- Lexical (2): noun & adjective recognition
- Grammatical (8): subject–verb / subject–adjective binding; negation; word order; prepositions; comparatives; counting; embedded relatives
~3,700 contrastive (image, caption) trials sampled from the model's own training vocabulary across log-frequency bins.
Vision
- Object recognition (6): ImageNet-1k (k-NN, linear, ABX); MNIST (linear, ABX); COCO-Stuff segmentation
- Visual properties (3): NYUv2 depth; CountBench (linear, ABX)
Language
- Syntax (3): Zorro; LongTail-Swap (Inflection, Agreement)
- Semantics (2): LongTail-Swap (Word); Visual-Property Swap (color, material, size, shape)

Naturalistic egocentric input drives contrastive and generative baselines to near-chance on the cross-modal probes, while curated captions (COCO) approach off-the-shelf CLIP. Beating the baselines will likely require new training objectives, architectures or curricula — not new data.

Ready to enter? See the submission instructions → or browse the current leaderboard →

Citing our work

If you use our benchmark or find our work useful in your research, please consider citing:

@article{lin2026egobabyvlm,
  title   = {EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data},
  author  = {Lin, Dongyan and Rust, Phillip and Villar-Corrales, Angel and Tan, Alvin W. M.
             and Luthra, Mahi and Saint-James, Charles-{\'E}ric and Moritz, Rashel
             and Krogh-Jespersen, Sheila and Stark, Vanessa and Parimi, Surya
             and Shen, Jiayi and Benchekroun, Youssef and Higuchi, Yosuke
             and Gleize, Martin and Fizycki, Tom and Hamilakis, Nicolas
             and Khentout, Manel and Tsuji, Sho and K{\'e}gl, Bal{\'a}zs
             and Pino, Juan and Frank, Michael C. and Dupoux, Emmanuel},
  journal = {arXiv preprint},
  year    = {2026},
  url     = {https://arxiv.org/abs/2605.19130},
}