Leaderboard

off-the-shelf rows are web-scale pretrained references (not part of the challenge).
topline marks the curated COCO-MC upper-bound trained under the same recipe.
reference rows are alternative training corpora (Ego4D, HowTo) included for comparison.
baseline rows are the official BabyView challenge baselines.

Scores are mean ± std across 3 seeds unless noted.

Loading submissions…

Cross-modal language grounding — Machine-DevBench

Corpus-grounded probe of lexical (noun + adjective recognition) and grammatical (eight sentence-level constructions) competence over ~3,700 trials sampled from each model's own training vocabulary.

#	Model	Architecture	Training data	Lexical agg ↑	Grammatical agg ↑	Overall agg ↑

Unimodal vision tasks

Object recognition aggregate (ImageNet-1k kNN / linear / ABX, MNIST linear / ABX, COCO-Stuff segmentation) and Visual properties aggregate (NYU Depth v2, CountBench linear / ABX) on the trained vision encoder. FT denotes the cross-modal finetuning objective applied to a frozen DINOv2 ViT-B/14 backbone.

Training data	FT	Object recogn. agg ↑	Visual props. agg ↑	Overall agg ↑
Chance	—	1.9	3.7	2.8

Unimodal language tasks

Syntax aggregate (Zorro grammatical acceptability + LongTail-Swap InflectionSwap / AgreementSwap variants) and Semantics aggregate (LongTail-Swap WordSwap + Visual Property Swap) on the trained text encoder. FT denotes the cross-modal finetuning objective applied to the encoder.

Training data	Encoder	FT	Syntax agg ↑	Semantics agg ↑	Overall agg ↑
Chance	—	—	50.0	50.0	50.0