off-the-shelf rows are web-scale pretrained
references (not part of the challenge).
topline
marks the curated COCO-MC upper-bound trained under the same recipe.
reference rows are alternative training corpora
(Ego4D, HowTo) included for comparison.
baseline
rows are the official BabyView challenge baselines.
Loading submissions…
Cross-modal language grounding — Machine-DevBench
Corpus-grounded probe of lexical (noun + adjective recognition) and grammatical (eight sentence-level constructions) competence over ~3,700 trials sampled from each model's own training vocabulary.
| # | Model | Architecture | Training Data | Lexical Agg ↑ | Grammatical Agg ↑ | Overall Agg ↑ |
|---|
Unimodal vision tasks
Object recognition aggregate (ImageNet-1k kNN / linear / ABX, MNIST linear / ABX, COCO-Stuff segmentation) and Visual properties aggregate (NYU Depth v2, CountBench linear / ABX) on the trained vision encoder. FT denotes the cross-modal finetuning objective applied to a frozen DINOv2 ViT-B/14 backbone.
| Training Data | FT | Object recogn. Agg ↑ | Visual props. Agg ↑ | Overall Agg ↑ |
|---|---|---|---|---|
| Chance | — | 1.9 | 3.7 | 2.8 |
Unimodal language tasks
Syntax aggregate (Zorro grammatical acceptability + LongTail-Swap InflectionSwap / AgreementSwap variants) and Semantics aggregate (LongTail-Swap WordSwap + Visual Property Swap) on the trained text encoder. FT denotes the cross-modal finetuning objective applied to the encoder.
| Training Data | Encoder | FT | Syntax Agg ↑ | Semantics Agg ↑ | Overall Agg ↑ |
|---|---|---|---|---|---|
| Chance | — | — | 50.0 | 50.0 | 50.0 |