Run, score, and submit

Cache the eval datasets, validate the setup with an off-the-shelf model, score your own model, then open a pull request with your numbers and training code. Verified submissions ship with code we can reproduce.

Meta Stanford University École Normale Supérieure The University of Tokyo

What a submission is

A submission is leaderboard scores plus, ideally, the training code that produced them. The reference codebase ships installable CLI entry points, Hydra configs, and four trainers that reproduce the paper baselines — useful as starting points, but the challenge encourages alternative architectures, objectives, and curricula.

1

Cache the eval datasets

See docs/eval_data.md in the repo. The download scripts handle the open-access ones automatically; only ImageNet and NYUv2 need a one-time manual prep documented inline.

Machine-DevBench ships as a single tarball on the GitHub release. Grab it with python -m scripts.eval_data.download_machine_devbench (or click the Machine-DevBench (Eval) button on the landing page) and point MACHINE_DEVBENCH_DATA_ROOT at the extracted directory.

LT-Swap ships per-corpus pair files for the four training corpora the paper uses (BabyView, Ego4D, HowTo, COCO-MC). Grab them with python -m scripts.eval_data.download_ltswap and point LTSWAP_DATA_ROOT at the per-corpus subdir matching your training data. These shipped files are mainly for reproducing / comparing to the paper — if you trained on a different snapshot, subset, or preprocessing of one of these corpora (or on something else entirely), regenerate the pair files from your own corpus via apps/swapbench/ so the long-tail vocabulary matches what your model actually saw.

2

Validate the setup

Run evaluation/eval_launcher against an off-the-shelf model (model=dino, model=clip_image, model=bert_base) to sanity-check that the three eval families execute on your machine and reproduce numbers in the ballpark of the leaderboard's off-the-shelf rows.

3

Train your model on BabyView

Any architecture, any objective, as long as the training data comes only from BabyView 2025.1 (no extra image / video / text / audio data for pretraining, fine-tuning, or evaluation). The shipped baselines under apps/baselines/ are optional starting points.

Tip: for validation and method exploration we recommend training on Ego4D first — it's naturalistic egocentric video too (just not developmentally plausible) and is faster to iterate on. Re-train your final submission on BabyView.

4

Score and open a PR

Re-run eval_launcher against your model over the three families, then open a pull request with a JSON file containing your scores and a link to the training code so the submission can be reproduced. See the submission details →.

Submissions are marked verified when we can reproduce the scores from the provided code. Leaderboard-only submissions (a paper / report with numbers but no runnable code) are still welcome and will appear on the board, but won't carry the verified mark.

Alternative architectures plug in by implementing one of the protocols in core/protocols/feature_extractor.py (ImageFeatureExtractor, TextFeatureExtractor, MultiModalFeatureExtractor) and adding an evaluation/configs/model/<your_name>.yaml with _target_ pointing at your class.

Submitting your results

Submissions reuse the shared evaluation/eval_launcher so all entries are scored on identical task definitions.

1. Install

The full environment is pinned in pixi.toml (Python 3.12, PyTorch 2.8 + CUDA 12.6).

# install pixi: https://pixi.sh/latest/installation/
pixi install -e dev

2. Run the three evaluation families

Download the eval datasets once (see docs/eval_data.md in the repo), then point the launcher at your trained model:

# Vision
python -m evaluation.eval_launcher \
    eval=vision/vision_pipeline \
    model=dino \
    name=my_run

# Cross-modal grounding (Machine-DevBench, realistic + cartoon)
python -m evaluation.eval_launcher \
    eval=multimodal/machine_devbench_pipeline \
    model=clip_image \
    name=my_run

# Language
python -m evaluation.eval_launcher \
    eval=text/text_pipeline \
    model=bert_base \
    name=my_run

Override model=… to swap encoders, or any individual task YAML in evaluation/configs/eval/. Run ≥ 3 seeds per condition and report mean ± std for each subgroup and the overall.

3. Open a pull request

Fork this repository and add a new file under submissions/<your_run_id>.json (one JSON per training run). Then append the filename to the "submissions" list in submissions/index.json so the leaderboard picks it up. Each submission JSON has the following shape:

{
  "id": "babyview_my_method",
  "model": "MyModel",
  "team": "My Lab",
  "architecture": "contrastive",
  "training_data": "BabyView",
  "category": "baseline",
  "n_seeds": 3,
  "multimodal": {
    "lexical": "55.1 ± 0.6",
    "grammatical": "54.9 ± 1.2",
    "overall": "55.0 ± 0.9"
  },
  "vision": [
    { "ft": "MyMethod",
      "object_recognition": "48.0 ± 0.3",
      "visual_properties": "35.1 ± 0.4",
      "overall": "41.5 ± 0.3" }
  ],
  "language": [
    { "encoder": "BERT", "ft": "MyMethod",
      "syntax": "73.0 ± 0.4",
      "semantics": "72.1 ± 0.3",
      "overall": "72.5 ± 0.3" }
  ]
}

Organizers will verify the submission with the same launcher and update the leaderboard.

Eligibility: for the official challenge track, no external image, video, text, or audio data may be used at any stage — including encoder pretraining. Reference rows trained on Ego4D / HowTo / COCO are listed for context only.

See the full results on the leaderboard page →