What a submission is
A submission is leaderboard scores plus, ideally, the training code that produced them. The reference codebase ships installable CLI entry points, Hydra configs, and four trainers that reproduce the paper baselines — useful as starting points, but the challenge encourages alternative architectures, objectives, and curricula.
Cache the eval datasets
See docs/eval_data.md in the
repo.
The download scripts handle the open-access ones automatically;
only ImageNet and NYUv2 need a one-time manual prep documented inline.
Machine-DevBench ships as a single tarball
on the GitHub release. Grab it with
python -m scripts.eval_data.download_machine_devbench
(or click the Machine-DevBench (Eval) button on the
landing page) and point
MACHINE_DEVBENCH_DATA_ROOT at the extracted
directory.
LT-Swap ships per-corpus pair files for the
four training corpora the paper uses (BabyView, Ego4D, HowTo,
COCO-MC). Grab them with
python -m scripts.eval_data.download_ltswap and
point LTSWAP_DATA_ROOT at the per-corpus subdir
matching your training data. These shipped files are mainly
for reproducing / comparing to the paper — if you trained on
a different snapshot, subset, or preprocessing of one of
these corpora (or on something else entirely), regenerate
the pair files from your own corpus via
apps/swapbench/ so the long-tail vocabulary
matches what your model actually saw.
Validate the setup
Run evaluation/eval_launcher against an off-the-shelf
model (model=dino, model=clip_image,
model=bert_base) to sanity-check that the three eval
families execute on your machine and reproduce numbers in the
ballpark of the leaderboard's off-the-shelf rows.
Train your model on BabyView
Any architecture, any objective, as long as the training data
comes only from BabyView 2025.1 (no extra image / video / text
/ audio data for pretraining, fine-tuning, or evaluation). The
shipped baselines under apps/baselines/ are
optional starting points.
Tip: for validation and method exploration we recommend training on Ego4D first — it's naturalistic egocentric video too (just not developmentally plausible) and is faster to iterate on. Re-train your final submission on BabyView.
Score and open a PR
Re-run eval_launcher against your model over the
three families, then open a pull request with a JSON file
containing your scores and a link to the training
code so the submission can be reproduced. See the
submission details →.
Submissions are marked verified when we can reproduce the scores from the provided code. Leaderboard-only submissions (a paper / report with numbers but no runnable code) are still welcome and will appear on the board, but won't carry the verified mark.
Alternative architectures plug in by implementing one of the
protocols in core/protocols/feature_extractor.py
(ImageFeatureExtractor,
TextFeatureExtractor,
MultiModalFeatureExtractor) and adding an
evaluation/configs/model/<your_name>.yaml
with _target_ pointing at your class.
Submitting your results
Submissions reuse the shared evaluation/eval_launcher
so all entries are scored on identical task definitions.
1. Install
The full environment is pinned in pixi.toml (Python
3.12, PyTorch 2.8 + CUDA 12.6).
# install pixi: https://pixi.sh/latest/installation/ pixi install -e dev
2. Run the three evaluation families
Download the eval datasets once (see docs/eval_data.md
in the repo), then point the launcher at your trained model:
# Vision
python -m evaluation.eval_launcher \
eval=vision/vision_pipeline \
model=dino \
name=my_run
# Cross-modal grounding (Machine-DevBench, realistic + cartoon)
python -m evaluation.eval_launcher \
eval=multimodal/machine_devbench_pipeline \
model=clip_image \
name=my_run
# Language
python -m evaluation.eval_launcher \
eval=text/text_pipeline \
model=bert_base \
name=my_run
Override model=… to swap encoders, or any individual
task YAML in evaluation/configs/eval/. Run ≥ 3 seeds
per condition and report mean ± std for each subgroup and
the overall.
3. Open a pull request
Fork this repository and add a new file under
submissions/<your_run_id>.json (one JSON per
training run). Then append the filename to the
"submissions" list in
submissions/index.json so the leaderboard picks it up.
Each submission JSON has the following shape:
{
"id": "babyview_my_method",
"model": "MyModel",
"team": "My Lab",
"architecture": "contrastive",
"training_data": "BabyView",
"category": "baseline",
"n_seeds": 3,
"multimodal": {
"lexical": "55.1 ± 0.6",
"grammatical": "54.9 ± 1.2",
"overall": "55.0 ± 0.9"
},
"vision": [
{ "ft": "MyMethod",
"object_recognition": "48.0 ± 0.3",
"visual_properties": "35.1 ± 0.4",
"overall": "41.5 ± 0.3" }
],
"language": [
{ "encoder": "BERT", "ft": "MyMethod",
"syntax": "73.0 ± 0.4",
"semantics": "72.1 ± 0.3",
"overall": "72.5 ± 0.3" }
]
}
id— unique slug, must match the filename (without.json).model,team,architecture,training_data,n_seeds— descriptive metadata.category— one ofbaseline/topline/reference/offshelf; controls the row pill and styling.multimodal(object) — mean ± std forlexical,grammatical,overallon Machine-DevBench. Omit if not run.vision(array) — one entry per FT condition; each withft,object_recognition,visual_properties,overall. Omit if not run.language(array) — one entry per (encoder,ft) condition; each withsyntax,semantics,overall. Omit if not run.- Optional:
report_url,code_url.
Organizers will verify the submission with the same launcher and update the leaderboard.
Eligibility: for the official challenge track, no external image, video, text, or audio data may be used at any stage — including encoder pretraining. Reference rows trained on Ego4D / HowTo / COCO are listed for context only.
See the full results on the leaderboard page →