Benchmarking with Meta Agents Research Environments¶
Meta Agents Research Environments (ARE) provides comprehensive benchmarking capabilities for evaluating AI agent performance across various scenarios. This guide covers how to run benchmarks, analyze results, and integrate with evaluation pipelines.
Overview¶
The Benchmark CLI is a powerful tool for systematic agent evaluation. It supports:
Local and Remote Datasets: Work with local scenario files or Hugging Face datasets
Multiple Model Providers: Connect to various AI models through LiteLLM
Parallel Execution: Run multiple scenarios concurrently for efficiency
Result Management: Automatic result collection and upload to Hugging Face
Main Commands¶
The benchmark CLI provides three primary commands:
run
¶
Execute scenarios with AI agents and collect performance metrics.
judge
¶
Validate the scenarios runs against a ground truth of expected outcomes in the environment.
gaia2-run
¶
Complete Gaia2 leaderboard evaluation pipeline that automatically runs all required configurations and phases for submission.
Command Line Reference¶
For complete parameter documentation with all options and examples, see the auto-generated reference:
are-benchmark¶
Main entry point for the Meta Agents Research Environments benchmark CLI.
This function processes command line arguments and runs the benchmark with the specified configuration. It supports three main commands: “run” for executing scenarios, “judge” for offline validation of scenarios, and “gaia2-run” for complete GAIA2 evaluation.
are-benchmark [OPTIONS] {run|judge|gaia2-run}
Options
- --log-level <log_level>¶
Set the logging level
- Options:
DEBUG | INFO | WARNING | ERROR | CRITICAL
- -a, --agent <agent>¶
Agent to use for running the Scenario
- Options:
default
- --endpoint <endpoint>¶
URL of the endpoint to contact for running the agent’s model
- -mp, --provider, --model_provider <provider>¶
Provider of the model
- Options:
azure | meta | local | llama-api | huggingface | mock | black-forest-labs | cerebras | cohere | fal-ai | featherless-ai | fireworks-ai | groq | hf-inference | hyperbolic | nebius | novita | nscale | openai | replicate | sambanova | together
- -m, --model <model>¶
Model used in the agent
- --max_concurrent_scenarios <max_concurrent_scenarios>¶
Maximum number of concurrent scenarios to run. If not specified, automatically sets based on the number of CPUs
- --noise¶
Enable noise augmentation with tool augmentation and environment events configs
- --simulated_generation_time_mode <simulated_generation_time_mode>¶
Mode for simulating LLM generation time
- Options:
measured | fixed | random
- -o, --oracle¶
Run the scenario in Oracle mode where oracle events (i.e. user defined agent events) are ran
- -d, --dataset <dataset>¶
Dataset directory containing scenarios as JSON files, or JSONL file listing scenarios
- --hf-dataset <hf_dataset>¶
HuggingFace dataset path
- --hf-config <hf_config>¶
Dataset config (subset) name for HuggingFace datasets
- --hf-split <hf_split>¶
Dataset split name (e.g., ‘test’, ‘validation’, ‘train’)
- --hf-revision <hf_revision>¶
HuggingFace dataset revision
- -l, --limit <limit>¶
Limit the number of scenarios to run per config
- --enable_caching¶
Enable caching of results.
- --executor_type <executor_type>¶
Type of executor to use for running scenarios.
- Options:
thread | process
- --config <config>¶
Dataset config (subset) name
- --split <split>¶
Dataset split name (e.g., ‘test’, ‘validation’, ‘train’)
- --output_dir, --dump_dir <output_dir>¶
Directory to dump the scenario states and logs
- --trace_dump_format <trace_dump_format>¶
Format in which to dump traces to JSON. ‘hf’ for HuggingFace format, ‘lite’ for lightweight format, ‘both’ for dual export. Must include ‘hf’ for upload to HuggingFace.
- Options:
hf | lite | both
- --hf_upload <hf_upload>¶
Dataset name to upload the traces to HuggingFace. If not specified, the traces are not uploaded.
- --hf_public <hf_public>¶
If true, the dataset is uploaded as a public dataset. If false, the dataset is uploaded as a private dataset.
- --scenario_timeout <scenario_timeout>¶
Timeout for each scenario in seconds. Defaults to 1860 seconds.
- --a2a_app_prop <a2a_app_prop>¶
When set to >0, turns on Agent2Agent mode, spinning up independent agents for a a2a_app_prop fraction of available Apps per scenario.
- --a2a_app_agent <a2a_app_agent>¶
[Agent2Agent mode] Agent used for App agent instances.
- Options:
default_app_agent
- --a2a_model <a2a_model>¶
[Agent2Agent mode] Model used for App agent instances.
- --a2a_model_provider <a2a_model_provider>¶
[Agent2Agent mode] Provider of the App agent model
- Options:
azure | meta | local | llama-api | huggingface | mock | black-forest-labs | cerebras | cohere | fal-ai | featherless-ai | fireworks-ai | groq | hf-inference | hyperbolic | nebius | novita | nscale | openai | replicate | sambanova | together
- --a2a_endpoint <a2a_endpoint>¶
[Agent2Agent mode] URL of the endpoint to contact for running App agent models
- --num_runs <num_runs>¶
Number of times to run each scenario to improve variance. Defaults to 3.
- --judge_model <judge_model>¶
Model to use for the judge system. Use a capable model for best evaluation quality.
- --judge_provider <judge_provider>¶
Provider for the judge model. If not specified, uses the same provider as the main model.
- Options:
azure | meta | local | llama-api | huggingface | mock | black-forest-labs | cerebras | cohere | fal-ai | featherless-ai | fireworks-ai | groq | hf-inference | hyperbolic | nebius | novita | nscale | openai | replicate | sambanova | together
- --judge_endpoint <judge_endpoint>¶
URL of the endpoint for the judge model. Optional for custom endpoints.
Arguments
- COMMAND¶
Required argument
Key Parameters Overview¶
- Command Selection
run
: Execute scenarios with AI agents and collect performance metricsjudge
: Validate scenarios runs against ground truthgaia2-run
: Complete Gaia2 leaderboard evaluation and submission pipeline
- Local Dataset Configuration
--dataset
: Local directory containing JSON scenario files or JSONL file listing scenarios--config
: Dataset config name (e.g.,execution
)--split
: Dataset split name (e.g.,validation
)
- Hugging Face Dataset Configuration
--hf-dataset
: Hugging Face dataset path (e.g.,meta-agents-research-environments/gaia2
)--hf-split
: Dataset split name (e.g.,validation
)--hf-config
: Dataset config/subset name for Hugging Face datasets--hf-revision
: Hugging Face dataset revision
- Model and Agent Configuration
--model
: Model name to use for inference--provider
: Model provider (see supported providers below)--endpoint
: Custom endpoint URL for model API--agent
: Specific agent type to use
- Execution Control
--limit
: Maximum number of scenarios to run per config--max_concurrent_scenarios
: Control parallel execution (auto-detected by default)--num_runs
: Number of times to run each scenario for variance analysis (default: 3)--scenario_timeout
: Timeout for each scenario in seconds (default: 900)--oracle
: Run in oracle mode where oracle events are executed--noise
: Enable noise augmentation with tool and environment configs--executor_type
: Type of executor to use for running scenarios (thread or process, default: process)--enable_caching
: Enable caching of results
- Agent2Agent Configuration
--a2a_app_prop
: Fraction of Apps to run in Agent2Agent mode (0.0-1.0, default: 0)--a2a_app_agent
: Agent used for App agent instances (default: default)--a2a_model
: Model used for App agent instances--a2a_model_provider
: Provider for App agent model--a2a_endpoint
: Endpoint URL for App agent models
- Output and Trace Management
--output_dir
: Directory for saving scenario states and logs--trace_dump_format
: Format for dumping traces (‘hf’, ‘lite’, or ‘both’, default: both)--hf_upload
: Dataset name to upload traces to Hugging Face (if not specified, no upload)--hf_public
: Upload as public dataset (default: false)
- Judge System Configuration
--judge_model
: Model to use for judge system validation (default: “meta-llama/Meta-Llama-3.3-70B-Instruct”)--judge_provider
: Provider for the judge model (default: uses same provider as main model)--judge_endpoint
: Custom endpoint URL for the judge model (optional)
Note
- Reproducible Results: For consistent and reproducible evaluation results, use llama3.3-70B as the judge model.
You can use any provider that offers this model based on your preference, access, and cost considerations.
Basic Usage¶
Simple Benchmark Run¶
Run a basic benchmark with local scenarios:
uvx --from meta-agents-research-environments are-benchmark run --dataset /path/to/scenarios --agent default --limit 10
This command:
--dataset /path/to/scenarios
: Specifies the directory containing scenario files--agent default
: Uses the Meta OSS agent--limit 10
: Runs only the first 10 scenarios
With Hugging Face Datasets¶
Run benchmarks using Hugging Face datasets:
uvx --from meta-agents-research-environments are-benchmark run --hf-dataset meta-agents-research-environments/gaia2 --hf-split validation --agent default
This uses:
--hf-dataset meta-agents-research-environments/gaia2
: Hugging Face dataset path--hf-split validation
: Specific dataset split--agent default
: Agent implementation to use for the runs. default is the only one provided and you should use this one for gaia2 evaluation.
Model Configuration¶
Supported Providers¶
the Agents Research Environments supports multiple model providers through liteLLM:
- API-Based Providers
llama-api
: Llama models via APIanthropic
: Claude modelsopenai
: GPT-3.5, GPT-4, and variantsazure
: Azure OpenAI services
- Third-Party Providers
huggingface
: Models from Hugging Face Hubfireworks-ai
: Fireworks AI modelstogether
: Together AI modelsreplicate
: Replicate models…
- Local Deployments
local
: Local model deployments with OpenAI-compatible APIs
Provider Examples¶
Llama API
export LLAMA_API_KEY="your-api-key"
uvx --from meta-agents-research-environments are-benchmark run --hf-dataset meta-agents-research-environments/gaia2 --hf-split validation \
--model Llama-4-Maverick-17B-128E-Instruct-FP8 --provider llama-api --agent default
OpenAI Models
export OPENAI_API_KEY="your-api-key"
uvx --from meta-agents-research-environments are-benchmark run --hf-dataset meta-agents-research-environments/gaia2 --hf-split validation \
--model gpt-4 --provider openai --agent default
Local OpenAI-Compatible Endpoint
uvx --from meta-agents-research-environments are-benchmark run --hf-dataset meta-agents-research-environments/gaia2 --hf-split validation \
--model your-local-model --provider local \
--endpoint "http://0.0.0.0:4000" --agent default
Hugging Face Models
export HUGGINGFACE_API_TOKEN="your-token"
uvx --from meta-agents-research-environments are-benchmark run --hf-dataset meta-agents-research-environments/gaia2 --hf-split validation \
--model meta-llama/llama3-70b-instruct --provider huggingface --agent default
Complete Benchmark Run with Upload
export model="meta-llama/llama3-70b-instruct"
export provider="huggingface"
uvx --from meta-agents-research-environments are-benchmark gaia2-run --hf-dataset meta-agents-research-environments/gaia2 \
--agent default --output_dir ./benchmark_results/ \
--model $model --provider $provider \
--hf_upload myhforg/${model////.} --hf_public
Note
Judge System LLM Independence: The judge system uses its own separate LLM engine for validation, which is independent of your agent’s model configuration. The judge’s LLM is used for semantic validation of tool arguments, soft comparison of agent outputs, and context-aware evaluation. Hard validation (exact matching, scripted checks) runs without LLM inference.
Important: The judge system does not use the –model or –provider settings, these are for the agent. For setting the judge LLM, use the –judge_model and –judge_provider settings.
Environment Variables¶
Different providers require specific environment variables:
- Llama API
LLAMA_API_KEY
(required): Your Llama API keyLLAMA_API_BASE
(optional): Custom base URL, defaults tohttps://api.llama.com/compat/v1
- OpenAI
OPENAI_API_KEY
(required): Your OpenAI API key
- Azure OpenAI
AZURE_API_KEY
(required): Your Azure OpenAI API keyAZURE_API_BASE
(required): Your Azure OpenAI endpoint URL
- Anthropic
ANTHROPIC_API_KEY
(required): Your Anthropic API key
- Hugging Face
HUGGINGFACE_API_TOKEN
(required for private models): Your Hugging Face token
Provider-Specific Notes¶
- Llama API Configuration
The Llama API provider automatically configures the endpoint and authentication:
Uses OpenAI-compatible format internally
Requires
LLAMA_API_KEY
environment variableSupports custom base URL via
LLAMA_API_BASE
Get an API token at Llama Developer <https://llama.developer.meta.com/>
- Local Deployments
For local model deployments:
Use
--provider local
with--endpoint
pointing to your serverEnsure your local server implements OpenAI-compatible API
Common local deployment tools: vLLM, text-generation-inference, Ollama
- Hugging Face Integration
Hugging Face provider supports:
Direct model loading from Hugging Face Hub
Private model access with authentication tokens
Various Hugging Face inference providers
Execution Control Options¶
Concurrency Control
are-benchmark run -d /path/to/scenarios \
--max_concurrent_scenarios 4 -a default
Output Directory
are-benchmark run -d /path/to/scenarios \
--output_dir ./benchmark_results -a default
Scenario Limiting
are-benchmark run -d /path/to/scenarios \
--limit 50 -a default
Result Management¶
Save Results Locally
are-benchmark run --hf meta-agents-research-environments/gaia2 --hf-split validation \
--output_dir ./benchmark_results -a default
Upload to Hugging Face
are-benchmark gaia2-run --hf meta-agents-research-environments/gaia2 \
--hf_upload my-org/gaia2-results -a default
Public Dataset Upload
are-benchmark gaia2-run --hf meta-agents-research-environments/gaia2 \
--hf_upload my-org/gaia2-results --hf_public -a default
Recommended Workflow¶
Development and Testing¶
Validation Phase
Start with a small validation set to test your setup:
export LLAMA_API_KEY="your-api-key" are-benchmark run --hf meta-agents-research-environments/gaia2 --hf-split validation \ --model Llama-3.1-70B-Instruct --provider llama-api \ --agent default --limit 10 --output_dir ./validation_results
Trace Analysis
Examine the generated traces in
./validation_results
to:Verify agent behavior matches expectations
Check for errors or unexpected patterns
Validate scoring and evaluation metrics
Debug any issues before full evaluation
Iterative Improvement
Based on validation results:
Adjust agent parameters
Modify scenario selection
Fine-tune model settings
Update evaluation criteria
Gaia2 Leaderboard Submission¶
For detailed Gaia2 evaluation guidance, see Gaia2 and Leaderboard Submission.
Scenario Validation¶
Judge Mode¶
Use judge mode to validate scenarios without running agents:
are-benchmark judge -d /path/to/traces
Dataset Integration¶
Hugging Face Dataset Format¶
ARE works with Hugging Face datasets that follow this structure:
{
"scenario_id": "unique_identifier",
"data": "json_serialized_scenario",
"metadata": {
"difficulty": "easy|medium|hard",
"domain": "email|calendar|file_system",
"tags": ["tag1", "tag2"]
}
}
Dataset Configuration¶
- The benchmark supports multiple dataset configurations:
execution
: Multi-step planning and state-changing operationssearch
: Information gathering and combination from multiple sourcesadaptability
: Dynamic adaptation to environmental changestime
: Temporal reasoning with precise timing constraintsambiguity
: Recognition and handling of ambiguous or impossible tasksmini
: A 160 scenarios subset of the above configurations
Dataset Splits¶
Common dataset splits:
validation
: Main evaluation and development set, include the ground truth to run the judge
Result Format¶
Benchmark results are automatically formatted for Hugging Face upload:
{
"scenario_id": "scenario_001",
"success": true,
"execution_time": 45.2,
"steps_taken": 12,
"validation_score": 0.95,
"agent_trace": [...],
"model_info": {
"model": "Llama-3.1-70B-Instruct",
"provider": "llama-api"
}
}
Gaia2 Submission Command¶
The gaia2-run
command provides a comprehensive evaluation pipeline specifically designed for Gaia2 leaderboard submissions. This command automates the complex multi-phase evaluation process required for complete Gaia2 assessment.
Key Features¶
- Automated Multi-Phase Evaluation
The command automatically executes three distinct evaluation phases:
Standard Phase: Base agent performance across all capability configurations (execution, search, adaptability, time, ambiguity)
Agent2Agent Phase: Multi-agent collaboration scenarios on the mini configuration
Noise Phase: Robustness evaluation with environment perturbations and tool augmentation on the mini configuration
- Standardized Evaluation Parameters
3 runs per scenario: Ensures proper variance analysis for leaderboard requirements
Hugging Face format: Traces automatically formatted for submission compatibility
Comprehensive reporting: Generates validation reports and performance summaries
Submission-Specific Parameters¶
- Hugging Face Upload Configuration
--hf_upload
: Dataset name for uploading consolidated results (required for submission)--hf_public
: Upload as public dataset (default: false, making results private), you can submit to the leaderboard with private datasets.
Example Usage
# Complete Gaia2 submission with public upload
are-benchmark gaia2-run --hf meta-agents-research-environments/gaia2 \
--model meta-llama/llama3-70b-instruct --provider huggingface \
--agent default \
--output_dir ./gaia2_submission \
--hf_upload my-org/gaia2-llama3-70b-results \
--hf_public
# Validation run before full submission
are-benchmark gaia2-run --hf meta-agents-research-environments/gaia2 \
--split validation \
--model your-model --provider your-provider \
--agent default \
--limit 20 \
--output_dir ./gaia2_validation
For comprehensive Gaia2 evaluation guidance and submission process, see Gaia2 and Leaderboard Submission.
Variance Analysis¶
The platform supports running each scenario multiple times to analyze performance variance and improve statistical confidence in results.
Multiple Runs Configuration¶
Use the --num_runs
parameter to specify how many times each scenario should be executed:
# Run each scenario 5 times for better variance analysis
are-benchmark run --hf meta-agents-research-environments/gaia2 --hf-split validation \
--model gpt-4 --provider openai -a default --num_runs 5
Note
The gaia2-run
command automatically sets --num_runs 3
to meet leaderboard requirements for variance analysis.
Benefits of Multiple Runs¶
- Statistical Confidence
Multiple runs provide more reliable performance metrics by reducing the impact of random variations.
- Variance Analysis
Understand the consistency of your agent’s performance across identical scenarios.
- Robust Evaluation
Identify scenarios where performance is highly variable vs. consistently good/bad.
Result Structure¶
When using multiple runs, the system automatically:
Creates unique trace files for each run:
scenario_123_run_1_[config hash]_[timestamp].json
,scenario_123_run_2_[config hash]_timestamp.json
, etc.Groups results by base scenario ID for variance calculations
Provides comprehensive statistics in the final report
Report Structure¶
The benchmark results include detailed reports for each configuration and overall summary:
=== Validation Report ===
Model: gemini-2-5-pro
Provider: unknown
=== Time ===
- Scenarios: 10 unique (30 total runs)
- Success rate: 20.0% ± 0.0% (STD: 0.0%)
- Pass@3: 3 scenarios (30.0%)
- Pass^3: 1 scenarios (10.0%)
- Average run duration: 93.4s (STD: 47.4s)
=== Ambiguity ===
- Scenarios: 10 unique (30 total runs)
- Success rate: 10.0% ± 0.0% (STD: 0.0%)
- Pass@3: 2 scenarios (20.0%)
- Pass^3: 0 scenarios (0.0%)
- Average run duration: 94.9s (STD: 62.3s)
=== Execution ===
- Scenarios: 10 unique (30 total runs)
- Success rate: 53.3% ± 6.7% (STD: 11.5%)
- Pass@3: 6 scenarios (60.0%)
- Pass^3: 4 scenarios (40.0%)
- Average run duration: 124.0s (STD: 78.3s)
=== Adaptability ===
- Scenarios: 10 unique (30 total runs)
- Success rate: 20.0% ± 0.0% (STD: 0.0%)
- Pass@3: 2 scenarios (20.0%)
- Pass^3: 2 scenarios (20.0%)
- Average run duration: 95.1s (STD: 47.4s)
=== Search ===
- Scenarios: 10 unique (30 total runs)
- Success rate: 56.7% ± 3.3% (STD: 5.8%)
- Pass@3: 8 scenarios (80.0%)
- Pass^3: 3 scenarios (30.0%)
- Average run duration: 168.0s (STD: 91.5s)
=== Global Summary ===
- Scenarios: 50 unique (150 total runs)
- Macro success rate: 32.0% ± 1.2% (STD: 2.0%)
- Micro success rate: 32.0% ± 1.2% (STD: 2.0%)
- Pass@3: 21 scenarios (42.0%)
- Pass^3: 10 scenarios (20.0%)
- Average run duration: 115.1s (STD: 72.7s)
- Job duration: 277.5 seconds
Understanding Report Metrics¶
The validation reports provide comprehensive statistical analysis:
- Per Config Metrics
Success rate:: Percentage of individual runs that succeeded (counts each run separately)
Pass^k: Percentage of scenarios that succeed in all k runs
Pass@k: Percentage of scenarios that succeed in at least 1 out of k runs
Average run duration: Average time taken per run
- Global Metrics
Macro success rate:: Average of per-capability success rates
Micro success rate: Average of per-run success rates
Job duration: Total time taken for all runs
- Variance Analysis
STD: Standard deviation of the metric across runs
±: Standard error (STD / sqrt(n) where n is the number of runs)
Results Caching¶
Meta Agents Research Environments includes a caching system that stores scenario execution results to avoid re-running identical scenarios with the same configuration.
When enabled with --enable_caching
, the system generates unique cache keys based on both the scenario content and runner configuration (model, provider, agent settings, etc.).
Results are stored as JSON files in ~/.cache/are/simulation/scenario_results/
by default, or in a custom location specified by the Meta Agents Research Environments_CACHE_DIR
environment variable.
Next Steps¶
With benchmarking knowledge:
Run Systematic Evaluations: Use benchmarks for comprehensive agent testing
Contribute Results: Share findings with the research community
Iterate and Improve: Use results to enhance agent capabilities
Develop Custom Benchmarks: Create domain-specific evaluation suites
Ready to create your own scenarios? Continue to Scenario Development for detailed guidance on scenario creation.