This work develops an LLM evaluation benchmark and training data for reasoning problems whose answers are not numerical values or multiple-choice answers, but full mathematical objects.
The core claim is simple: if you want models that can help with real scientific and mathematical work, you need to train on such data & test whether they can derive things like equations, sets, matrices, intervals, and piecewise functions. We show that this ends up improving the overall reasoning ability of your model for all tasks.
Benchmark: PrincipiaBench on HuggingFace
Training data: Principia Collection on HuggingFace

Figure: RL training on the Principia Collection improves performance on PrincipiaBench and transfers to numerical and MCQ benchmarks.
A large fraction of current reasoning evaluation still rewards models for producing either:
Those formats are convenient to grade, but they hide important weakness: a model does not learn how to manipulate complex objects successfully, e.g. it can learn to solve a multiple-choice question (MCQ) by reasoning backward from the options rather than deriving the answer from first principles.
An example mathematical-object answer we work with looks like:
\[\frac{1}{|G|}\left(2 + \sum_{x \in G,\ x \neq 1}\mathrm{Re}(\chi(x))\right)\]That is a very different capability from selecting B from multiple choice questions, or outputting 42!
Our experiments show that once answer options are removed from SuperGPQA questions that really require mathematical objects, performance drops sharply even for frontier models. The reported drop is typically in the 10 to 20 point range.

Figure: removing answer options reveals a substantial gap between MCQ performance and open-ended derivation ability.
This is the motivation for PrincipiaBench, a benchmark of 2,558 problems collected from RealMath, Physics, ARB, and filtered SuperGPQA items. The benchmark is designed so that the model must generate the mathematical object directly.
To train for this harder setting, we also introduce the Principia Collection, a 248K-example synthetic dataset grounded in:
These topic headings are sampled (e.g. ``Techniques for Polymers & Soft Matter » Resonance techniques » Nuclear magnetic resonance » Magnetic resonance imaging”) and are used to make challenging questions with sufficient diversity.
The target outputs span six answer types:
Here are some illustrative examples:

Figure: example instances from the Principia Collection, showing the level of detail and the mathematical-object answer types the dataset targets.
To create high quality data, there are a number of steps:
Step 1: Topic mining, as just discussed above.
Step 2: Problem Statement Generation. This process consists of three sub-stages:
Step 3: Filtering out Invalid Problem Statements. The filtering is based on three main criteria: (1) the problem statement must consist of only one question and be self-contained, (2) the problem should explicitly require the intended answer type, and (3) extensive hints or the answer itself should not be included in the problem statement. We prompt GPT-OSS-120B to evaluate all three conditions and retain only those instances that are judged as “Yes” for every criterion.
Step 4: Response Generation & Majority-Voting for mathematical objects. To obtain labels for the generated problem statements, we prompt GPT-OSS-120B eight times and apply self-consistency, taking the majority vote as the label. See the following figure:

Figure: illustration of the majority-vote procedure used to determine labels when multiple mathematically equivalent answers may be written in different forms.
Overall, we find this data creation procedure effective in increasing the conceptual depth of the generated problems and ensuring correctness.
Figure: Example of a subject entity (acquired from PhySH), a strategy description (from step 1), an initial problem statement and its corresponding CoT, and a revised problem statement and its corresponding CoT (from step 2).
We conduct detailed experiments comparing various models on PrincipiaBench, including post-training various LM backbones with our training set, Principia Collection. The main findings of our experiments are:

We find that training on Principia Collection gives superior performance on reasoning problems involving mathematical objects of various types compared to existing training datasets, as measured by PrincipiaBench.
Figure: Training directly on complex mathematical objects yields substantially better transfer than training on datasets requiring only numerical values or simple mathematical objects.
We find that training with a strong model-based verifier, rather than a rule-based verifier, becomes highly important when the data is more challenging to verify as in our training set. While there is little difference in terms of performance when training on DeepScaleR, which consists of numerical value and simple mathematical object answers, the performance varies a lot when training on the Principia Collection with consists of complex mathematical object answers.

The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. Likewise, existing RL post-training datasets overrepresent easy-to-verify formats, largely excluding complex mathematical-object answers.
To address these, we introduce the PrincipiaBench, a benchmark designed to evaluate an LM’s ability to derive mathematical objects, and Principia Collection, a synthetic post-training dataset which improves LLM’s on both PrincipiaBench and other reasoning tasks.
Together, the Principia suite provides a unified framework for evaluating and improving LM reasoning.
Seungone Kim, Pranjal Aggarwal, Bo Liu, Swarnadeep Saha, Ping Yu, Anaelia Ovalle, Jack Lanchantin, Jing Xu, Weizhe Yuan, Wenting Zhao, Adina Williams, Marjan Ghazvininejad, Graham Neubig, Sean Welleck, Jason Weston, Ilia Kulikov.
More details can be found in the full technical report.
If you use our training data or benchmark in your own work, please also cite with the following BibTex entry:
@article{principia2026,
title={Reasoning over mathematical objects: on-policy reward modeling and test time aggregation},
author={Pranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim, Ilia Kulikov, Jack Lanchantin, Xian Li, Tianjian Li, Bo Liu, Graham Neubig, Anaelia Ovalle, Swarnadeep Saha, Sainbayar Sukhbaatar, Sean Welleck, Jason Weston, Chenxi Whitehouse, Adina Williams, Jing Xu, Ping Yu, Weizhe Yuan, Jingyu Zhang, Wenting Zhao},
journal={arXiv preprint arXiv:2603.18886},
year={2026}
}