EgoBabyVLM Challenge

Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

Train a vision–language model on the BabyView corpus (≈863 h of head-mounted-camera infant video) and nothing else. Beat the baselines on a fixed benchmark suite, and close the gap to web-scale pretrained models.

Dongyan Lin1,∗,† Phillip Rust1,∗,† Angel Villar Corrales1,† Alvin W. M. Tan2 Mahi Luthra1 Charles-Éric Saint-James1 Rashel Moritz1 Sheila Krogh-Jespersen3 Vanessa Stark1 Surya Parimi1 Jiayi Shen1 Youssef Benchekroun1 Yosuke Higuchi1 Martin Gleize1 Tom Fizycki1 Nicolas Hamilakis4 Manel Khentout4 Sho Tsuji5 Balázs Kégl1,‡ Juan Pino1 Michael C. Frank2 Emmanuel Dupoux1,4
1Meta Superintelligence Labs 2Stanford University 3Meta Reality Labs 4École Normale Supérieure 5The University of Tokyo
Equal contribution Core contributors Work done at Meta Correspondence: dongyanlin@meta.com, philliprust@meta.com
10Cross-modal tasks
9Vision tasks
5Language tasks
863hBabyView video
Meta Stanford University École Normale Supérieure The University of Tokyo

About the Challenge

Human infants acquire language from sparse, weakly-aligned multimodal input — yet today's vision–language models (VLMs), trained on curated web data, fail to generalize to the kinds of egocentric streams produced by wearable cameras and head-cams. The EgoBabyVLM Challenge asks: how far can we close that gap algorithmically, without changing the data?

Participants train a VLM on the BabyView 2025.1 corpus (≈863 h of head-mounted-camera video from children) and nothing else: no extra image, video, text, or audio data may be used for any encoder pretraining, fine-tuning, or evaluation. Submissions are scored on three families of tasks, each with subgroup aggregates and an overall score:

Naturalistic egocentric input drives contrastive and generative baselines to near-chance on the cross-modal probes, while curated captions (COCO) approach off-the-shelf CLIP. Beating the baselines will likely require new training objectives, architectures, or curricula — not new data.

Ready to enter? See the submission instructions → or browse the current leaderboard →

Citing our work

If you use our benchmark or find our work useful in your research, please consider citing:

@article{lin2026egobabyvlm,
  title   = {EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data},
  author  = {Lin, Dongyan and Rust, Phillip and Villar-Corrales, Angel and Tan, Alvin W. M.
             and Luthra, Mahi and Saint-James, Charles-{\'E}ric and Moritz, Rashel
             and Krogh-Jespersen, Sheila and Stark, Vanessa and Parimi, Surya
             and Shen, Jiayi and Benchekroun, Youssef and Higuchi, Yosuke
             and Gleize, Martin and Fizycki, Tom and Hamilakis, Nicolas
             and Khentout, Manel and Tsuji, Sho and K{\'e}gl, Bal{\'a}zs
             and Pino, Juan and Frank, Michael C. and Dupoux, Emmanuel},
  journal = {arXiv preprint},
  year    = {2026},
  url     = {https://arxiv.org/abs/2605.19130},
}