How to submit

What a submission is

A submission is leaderboard scores plus, ideally, the training code that produced them. The reference codebase ships installable CLI entry points, Hydra configs and four trainers that reproduce the paper baselines — useful as starting points, but the challenge encourages alternative architectures, objectives and curricula.

Cache the eval datasets

See docs/eval_data.md in the repo. The download scripts handle the open-access ones automatically; only ImageNet and NYUv2 need a one-time manual prep documented inline.

Machine-DevBench ships as a single tarball on the GitHub release. Grab it with python -m scripts.eval_data.download_machine_devbench (or click the Machine-DevBench (eval) button on the landing page) and point MACHINE_DEVBENCH_DATA_ROOT at the extracted directory.

LT-Swap ships per-corpus pair files for the four training corpora the paper uses (BabyView, Ego4D, HowTo, COCO-MC). Grab them with python -m scripts.eval_data.download_ltswap and point LTSWAP_DATA_ROOT at the per-corpus subdir matching your training data. These shipped files are mainly for reproducing / comparing to the paper — if you trained on a different snapshot, subset or preprocessing of one of these corpora (or on something else entirely), regenerate the pair files from your own corpus via apps/swapbench/ so the long-tail vocabulary matches what your model actually saw.

Validate the setup

Run evaluation/eval_launcher against an off-the-shelf model (model=dino, model=clip_image, model=bert_base) to sanity-check that the three eval families execute on your machine and reproduce numbers in the ballpark of the leaderboard's off-the-shelf rows.

Train your model on BabyView

Any architecture, any objective, as long as the training data comes only from BabyView 2025.1 (no extra image / video / text / audio data for pretraining, fine-tuning or evaluation). The shipped baselines under apps/baselines/ are optional starting points.

Tip: for validation and method exploration we recommend training on Ego4D first — it's naturalistic egocentric video too (just not developmentally plausible) and is faster to iterate on. Re-train your final submission on BabyView.

Score and open a PR

Re-run eval_launcher against your model over the three families, then open a pull request with a JSON file containing your scores and a link to the training code so the submission can be reproduced. See the submission details →.

Submissions linking to runnable training code are marked reproducible. Organizers don't rerun the code — the mark just signals anyone could. Leaderboard-only submissions (paper or report without runnable code) are welcome but unmarked.

Alternative architectures plug in by implementing one of the protocols in core/protocols/feature_extractor.py (ImageFeatureExtractor, TextFeatureExtractor, MultiModalFeatureExtractor) and adding an evaluation/configs/model/<your_name>.yaml with _target_ pointing at your class.

Submitting your results

Submissions reuse the shared evaluation/eval_launcher so all entries are scored on identical task definitions.

1. Install

The full environment is pinned in pixi.toml (Python 3.12, PyTorch 2.8 + CUDA 12.6).

# install pixi: https://pixi.sh/latest/installation/
pixi install -e dev

2. Run the three evaluation families

Download the eval datasets once (see docs/eval_data.md in the repo), then point the launcher at your trained model:

# Vision
python -m evaluation.eval_launcher \
    eval=vision/vision_pipeline \
    model=dino \
    name=my_run

# Cross-modal grounding (Machine-DevBench, realistic + cartoon)
python -m evaluation.eval_launcher \
    eval=multimodal/machine_devbench_pipeline \
    model=clip_image \
    name=my_run

# Language
python -m evaluation.eval_launcher \
    eval=text/text_pipeline \
    model=bert_base \
    name=my_run

Override model=… to swap encoders or any individual task YAML in evaluation/configs/eval/. Run ≥ 3 seeds per condition and report mean ± std for each subgroup and the overall.

3. Open a pull request

Fork this repository and add a new file under submissions/<your_run_id>.json (one JSON per training run). Then append the filename to the "submissions" list in submissions/index.json so the leaderboard picks it up. Each submission JSON has the following shape:

{
  "id": "babyview_my_method",
  "model": "MyModel",
  "team": "My Lab",
  "architecture": "contrastive",
  "training_data": "BabyView",
  "category": "baseline",
  "n_seeds": 3,
  "multimodal": {
    "lexical": "55.1 ± 0.6",
    "grammatical": "54.9 ± 1.2",
    "overall": "55.0 ± 0.9"
  },
  "vision": [
    { "ft": "MyMethod",
      "object_recognition": "48.0 ± 0.3",
      "visual_properties": "35.1 ± 0.4",
      "overall": "41.5 ± 0.3" }
  ],
  "language": [
    { "encoder": "BERT", "ft": "MyMethod",
      "syntax": "73.0 ± 0.4",
      "semantics": "72.1 ± 0.3",
      "overall": "72.5 ± 0.3" }
  ]
}

id — unique slug, must match the filename (without .json).
model, team, architecture, training_data, n_seeds — descriptive metadata.
category — one of baseline / topline / reference / offshelf; controls the row pill and styling.
multimodal (object) — mean ± std for lexical, grammatical, overall on Machine-DevBench. Omit if not run.
vision (array) — one entry per FT condition; each with ft, object_recognition, visual_properties, overall. Omit if not run.
language (array) — one entry per (encoder, ft) condition; each with syntax, semantics, overall. Omit if not run.
Optional: report_url, code_url.

Organizers will review the submission and update the leaderboard.

Eligibility: for the official challenge track, no external image, video, text or audio data may be used at any stage — including encoder pretraining. Reference rows trained on Ego4D / HowTo / COCO are listed for context only.

See the full results on the leaderboard page →