About the Challenge
Human infants acquire language from sparse, weakly-aligned multimodal input — yet today's vision–language models (VLMs), trained on curated web data, fail to generalize to the kinds of egocentric streams produced by wearable cameras and head-cams. The EgoBabyVLM Challenge asks: how far can we close that gap algorithmically, without changing the data?
Participants train a VLM on the BabyView 2025.1 corpus (≈863 h of head-mounted-camera video from children) and nothing else: no extra image, video, text, or audio data may be used for any encoder pretraining, fine-tuning, or evaluation. Submissions are scored on three families of tasks, each with subgroup aggregates and an overall score:
-
Cross-modal grounding — Machine-DevBench
- Lexical (2): noun & adjective recognition
- Grammatical (8): subject–verb / subject–adjective binding; negation; word order; prepositions; comparatives; counting; embedded relatives
~3,700 contrastive (image, caption) trials sampled from the model's own training vocabulary across log-frequency bins.
-
Vision
- Object recognition (6): ImageNet-1k (k-NN, linear, ABX); MNIST (linear, ABX); COCO-Stuff segmentation
- Visual properties (3): NYUv2 depth; CountBench (linear, ABX)
-
Language
- Syntax (3): Zorro; LongTail-Swap (Inflection, Agreement)
- Semantics (2): LongTail-Swap (Word); Visual-Property Swap (color, material, size, shape)
Naturalistic egocentric input drives contrastive and generative baselines to near-chance on the cross-modal probes, while curated captions (COCO) approach off-the-shelf CLIP. Beating the baselines will likely require new training objectives, architectures, or curricula — not new data.
Ready to enter? See the submission instructions → or browse the current leaderboard →
Citing our work
If you use our benchmark or find our work useful in your research, please consider citing:
@article{lin2026egobabyvlm,
title = {EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data},
author = {Lin, Dongyan and Rust, Phillip and Villar-Corrales, Angel and Tan, Alvin W. M.
and Luthra, Mahi and Saint-James, Charles-{\'E}ric and Moritz, Rashel
and Krogh-Jespersen, Sheila and Stark, Vanessa and Parimi, Surya
and Shen, Jiayi and Benchekroun, Youssef and Higuchi, Yosuke
and Gleize, Martin and Fizycki, Tom and Hamilakis, Nicolas
and Khentout, Manel and Tsuji, Sho and K{\'e}gl, Bal{\'a}zs
and Pino, Juan and Frank, Michael C. and Dupoux, Emmanuel},
journal = {arXiv preprint},
year = {2026},
url = {https://arxiv.org/abs/2605.19130},
}