Eval Factsheet

Create and explore evaluation benchmarks

Basic Information

The official name of your evaluation or benchmark (e.g., "MMLU", "HumanEval")
A concise description of what your evaluation measures (1-2 sentences)
Who made this evaluation?
URL to the repository containing your evaluation implementation
URL to the paper or documentation describing your evaluation
Publication or release date of the evaluation

What Does It Evaluate?

Define the purpose and scope of your evaluation.

Hold Ctrl/Cmd to select multiple. Click to deselect. Select the primary intended use case(s) for this evaluation.
Name the specific capabilities your evaluation targets
Hold Ctrl/Cmd to select multiple. Click to deselect. Select the high-level properties your evaluation assesses.
Hold Ctrl/Cmd to select multiple. Click to deselect. Select the type(s) of input data your evaluation uses.
Hold Ctrl/Cmd to select multiple. Click to deselect. Select the type(s) of output expected from models.

How Is It Built?

Describe the data sources, size, and structure of your evaluation.

Specify where the input data for your evaluation comes from
Specify how the ground truth or reference outputs were created
Specify the number of evaluation samples in your benchmark
Specify the data splits used for the evaluation (e.g., fine-tuning set, validation, test)
Specify whether the evaluation uses a fixed dataset or adapts based on model responses

How Does It Work?

Explain the methodology and execution of your evaluation.

Select who or what evaluates the model outputs
Describe the step-by-step procedure for evaluation, one step per line
Specify what level of model access is required to run this evaluation
Check if your evaluation includes a private test set not publicly available
Provide details about your private held-out test set

Alignment

Document the quality assurance measures and known limitations of your evaluation.

Describe validation steps taken to ensure construct validity (e.g., expert review, pilot studies, correlation with known measures)
Select a score if your evaluation meet somes or all the conditions of the construct validity checklist
List baselines with their performance scores, one per line
Hold Ctrl/Cmd to select multiple. Click to deselect. Select any robustness testing measures you've applied.
Document any known weaknesses or sensitivities in your evaluation
List related benchmarks and explain how your evaluation differs or improves upon them
0 results
Title Date Input Mod. Output Mod. Input Source Judge Has Heldout Actions