Eval Factsheets

Basic Information

Title * The official name of your evaluation or benchmark (e.g., "MMLU", "HumanEval")

Description * A concise description of what your evaluation measures (1-2 sentences)

Org/University/Company/Lab/Authors * Who made this evaluation?

Link to your evaluation code URL to the repository containing your evaluation implementation

Link to the associated research paper URL to the paper or documentation describing your evaluation

Date * Publication or release date of the evaluation

What Does It Evaluate?

Define the purpose and scope of your evaluation.

Purpose

Hold Ctrl/Cmd to select multiple. Click to deselect. Select the primary intended use case(s) for this evaluation.

Capabilities Tested Name the specific capabilities your evaluation targets

Model Properties Evaluated

Hold Ctrl/Cmd to select multiple. Click to deselect. Select the high-level properties your evaluation assesses.

Input Modality (what the model receives)

Hold Ctrl/Cmd to select multiple. Click to deselect. Select the type(s) of input data your evaluation uses.

Output Modality (what the model produces)

Hold Ctrl/Cmd to select multiple. Click to deselect. Select the type(s) of output expected from models.

How Is It Built?

Describe the data sources, size, and structure of your evaluation.

Input Data Source Specify where the input data for your evaluation comes from

Output Reference/Ground Truth Source Specify how the ground truth or reference outputs were created

Size Specify the number of evaluation samples in your benchmark

Evaluation Splits Specify the data splits used for the evaluation (e.g., fine-tuning set, validation, test)

Evaluation Data Type Specify whether the evaluation uses a fixed dataset or adapts based on model responses

How Does It Work?

Explain the methodology and execution of your evaluation.

Judge Type Select who or what evaluates the model outputs

Evaluation Protocol Describe the step-by-step procedure for evaluation, one step per line

Model Access Level Required Specify what level of model access is required to run this evaluation

Has held-out private test set Check if your evaluation includes a private test set not publicly available

Held-out Test Details Provide details about your private held-out test set

Alignment

Document the quality assurance measures and known limitations of your evaluation.

Measurement Validation Describe validation steps taken to ensure construct validity (e.g., expert review, pilot studies, correlation with known measures)

How many conditions of the construct validity checklist of Bean et al. 2025 your evaluations check? Select a score if your evaluation meet somes or all the conditions of the construct validity checklist

Baselines List baselines with their performance scores, one per line

Robustness Measures Applied

Hold Ctrl/Cmd to select multiple. Click to deselect. Select any robustness testing measures you've applied.

Known Sensitivities & Limitations Document any known weaknesses or sensitivities in your evaluation

Similar Evaluations List related benchmarks and explain how your evaluation differs or improves upon them

📄 Generated Code

Submit Your Evaluation to GitHub

Copy the CSV line below (no header):
Click "Add to GitHub" below:
In the GitHub editor, paste the CSV line at the end of evaluation_cards_database.csv, fill in the commit message, and submit your pull request.

-- None --
MS COCO
ImageNet
Common Crawl
Wikipedia
OpenWebText
LibriSpeech
AudioSet
Kinetics (Video)
GitHub Repositories
One or Multiple existing datasets
New dataset (released with evaluation)
Proprietary/Closed dataset
Synthetic/Generated data
Procedurally generated
Crowdsourced
Expert-curated
User-generated content
Real-world deployment data
Mixed sources