Leaderboard

Mean ± std across 3 seeds on the EgoBabyVLM benchmark suite.

Meta Stanford University École Normale Supérieure The University of Tokyo

off-the-shelf rows are web-scale pretrained references (not part of the challenge).
topline marks the curated COCO-MC upper-bound trained under the same recipe.
reference rows are alternative training corpora (Ego4D, HowTo) included for comparison.
baseline rows are the official BabyView challenge baselines.

Loading submissions…

Cross-modal language grounding — Machine-DevBench

Corpus-grounded probe of lexical (noun + adjective recognition) and grammatical (eight sentence-level constructions) competence over ~3,700 trials sampled from each model's own training vocabulary.

# Model Architecture Training Data Lexical Agg ↑ Grammatical Agg ↑ Overall Agg ↑

Unimodal vision tasks

Object recognition aggregate (ImageNet-1k kNN / linear / ABX, MNIST linear / ABX, COCO-Stuff segmentation) and Visual properties aggregate (NYU Depth v2, CountBench linear / ABX) on the trained vision encoder. FT denotes the cross-modal finetuning objective applied to a frozen DINOv2 ViT-B/14 backbone.

Training Data FT Object recogn. Agg ↑ Visual props. Agg ↑ Overall Agg ↑
Chance 1.9 3.7 2.8

Unimodal language tasks

Syntax aggregate (Zorro grammatical acceptability + LongTail-Swap InflectionSwap / AgreementSwap variants) and Semantics aggregate (LongTail-Swap WordSwap + Visual Property Swap) on the trained text encoder. FT denotes the cross-modal finetuning objective applied to the encoder.

Training Data Encoder FT Syntax Agg ↑ Semantics Agg ↑ Overall Agg ↑
Chance 50.0 50.0 50.0