Benchmarking with Meta Agents Research Environments¶

Meta Agents Research Environments (ARE) provides comprehensive benchmarking capabilities for evaluating AI agent performance across various scenarios. This guide covers how to run benchmarks, analyze results, and integrate with evaluation pipelines.

Overview¶

The Benchmark CLI is a powerful tool for systematic agent evaluation. It supports:

Local and Remote Datasets: Work with local scenario files or Hugging Face datasets
Multiple Model Providers: Connect to various AI models through LiteLLM
Parallel Execution: Run multiple scenarios concurrently for efficiency
Result Management: Automatic result collection and upload to Hugging Face

Main Commands¶

The benchmark CLI provides three primary commands:

`run`¶

Execute scenarios with AI agents and collect performance metrics.

`judge`¶

Validate the scenarios runs against a ground truth of expected outcomes in the environment.

`gaia2-run`¶

Complete Gaia2 leaderboard evaluation pipeline that automatically runs all required configurations and phases for submission.

Command Line Reference¶

For complete parameter documentation with all options and examples, see the auto-generated reference:

are-benchmark¶

Main entry point for the Meta Agents Research Environments benchmark CLI.

This function processes command line arguments and runs the benchmark with the specified configuration. It supports three main commands: “run” for executing scenarios, “judge” for offline validation of scenarios, and “gaia2-run” for complete GAIA2 evaluation.

are-benchmark [OPTIONS] {run|judge|gaia2-run}

Options

--log-level <log_level>¶

Set the logging level

Options:: DEBUG | INFO | WARNING | ERROR | CRITICAL

-a, --agent <agent>¶

Agent to use for running the Scenario

Options:: default

--endpoint <endpoint>¶: URL of the endpoint to contact for running the agent’s model

-mp, --provider, --model_provider <provider>¶

Provider of the model

Options:: azure | meta | local | llama-api | huggingface | mock | black-forest-labs | cerebras | cohere | fal-ai | featherless-ai | fireworks-ai | groq | hf-inference | hyperbolic | nebius | novita | nscale | openai | replicate | sambanova | together

-m, --model <model>¶: Model used in the agent

--max_concurrent_scenarios <max_concurrent_scenarios>¶: Maximum number of concurrent scenarios to run. If not specified, automatically sets based on the number of CPUs

--noise¶: Enable noise augmentation with tool augmentation and environment events configs

--simulated_generation_time_mode <simulated_generation_time_mode>¶

Mode for simulating LLM generation time

Options:: measured | fixed | random

-o, --oracle¶: Run the scenario in Oracle mode where oracle events (i.e. user defined agent events) are ran

-d, --dataset <dataset>¶: Dataset directory containing scenarios as JSON files, or JSONL file listing scenarios

--hf-dataset <hf_dataset>¶: HuggingFace dataset path

--hf-config <hf_config>¶: Dataset config (subset) name for HuggingFace datasets

--hf-split <hf_split>¶: Dataset split name (e.g., ‘test’, ‘validation’, ‘train’)

--hf-revision <hf_revision>¶: HuggingFace dataset revision

-l, --limit <limit>¶: Limit the number of scenarios to run per config

--enable_caching¶: Enable caching of results.

--executor_type <executor_type>¶

Type of executor to use for running scenarios.

Options:: thread | process

--config <config>¶: Dataset config (subset) name

--split <split>¶: Dataset split name (e.g., ‘test’, ‘validation’, ‘train’)

--output_dir, --dump_dir <output_dir>¶: Directory to dump the scenario states and logs

--trace_dump_format <trace_dump_format>¶

Format in which to dump traces to JSON. ‘hf’ for HuggingFace format, ‘lite’ for lightweight format, ‘both’ for dual export. Must include ‘hf’ for upload to HuggingFace.

Options:: hf | lite | both

--hf_upload <hf_upload>¶: Dataset name to upload the traces to HuggingFace. If not specified, the traces are not uploaded.

--hf_public <hf_public>¶: If true, the dataset is uploaded as a public dataset. If false, the dataset is uploaded as a private dataset.

--scenario_timeout <scenario_timeout>¶: Timeout for each scenario in seconds. Defaults to 1860 seconds.

--a2a_app_prop <a2a_app_prop>¶: When set to >0, turns on Agent2Agent mode, spinning up independent agents for a a2a_app_prop fraction of available Apps per scenario.

--a2a_app_agent <a2a_app_agent>¶

[Agent2Agent mode] Agent used for App agent instances.

Options:: default_app_agent

--a2a_model <a2a_model>¶: [Agent2Agent mode] Model used for App agent instances.

--a2a_model_provider <a2a_model_provider>¶

[Agent2Agent mode] Provider of the App agent model

Options:: azure | meta | local | llama-api | huggingface | mock | black-forest-labs | cerebras | cohere | fal-ai | featherless-ai | fireworks-ai | groq | hf-inference | hyperbolic | nebius | novita | nscale | openai | replicate | sambanova | together

--a2a_endpoint <a2a_endpoint>¶: [Agent2Agent mode] URL of the endpoint to contact for running App agent models

--num_runs <num_runs>¶: Number of times to run each scenario to improve variance. Defaults to 3.

--judge_model <judge_model>¶: Model to use for the judge system. Use a capable model for best evaluation quality.

--judge_provider <judge_provider>¶

Provider for the judge model. If not specified, uses the same provider as the main model.

Options:: azure | meta | local | llama-api | huggingface | mock | black-forest-labs | cerebras | cohere | fal-ai | featherless-ai | fireworks-ai | groq | hf-inference | hyperbolic | nebius | novita | nscale | openai | replicate | sambanova | together

--judge_endpoint <judge_endpoint>¶: URL of the endpoint for the judge model. Optional for custom endpoints.

Arguments

COMMAND¶: Required argument

Key Parameters Overview¶

Command Selection

run: Execute scenarios with AI agents and collect performance metrics
judge: Validate scenarios runs against ground truth
gaia2-run: Complete Gaia2 leaderboard evaluation and submission pipeline

Local Dataset Configuration

--dataset: Local directory containing JSON scenario files or JSONL file listing scenarios
--config: Dataset config name (e.g., execution)
--split: Dataset split name (e.g., validation)

Hugging Face Dataset Configuration

--hf-dataset: Hugging Face dataset path (e.g., meta-agents-research-environments/gaia2)
--hf-split: Dataset split name (e.g., validation)
--hf-config: Dataset config/subset name for Hugging Face datasets
--hf-revision: Hugging Face dataset revision

Model and Agent Configuration

--model: Model name to use for inference
--provider: Model provider (see supported providers below)
--endpoint: Custom endpoint URL for model API
--agent: Specific agent type to use

Execution Control

--limit: Maximum number of scenarios to run per config
--max_concurrent_scenarios: Control parallel execution (auto-detected by default)
--num_runs: Number of times to run each scenario for variance analysis (default: 3)
--scenario_timeout: Timeout for each scenario in seconds (default: 900)
--oracle: Run in oracle mode where oracle events are executed
--noise: Enable noise augmentation with tool and environment configs
--executor_type: Type of executor to use for running scenarios (thread or process, default: process)
--enable_caching: Enable caching of results

Agent2Agent Configuration

--a2a_app_prop: Fraction of Apps to run in Agent2Agent mode (0.0-1.0, default: 0)
--a2a_app_agent: Agent used for App agent instances (default: default)
--a2a_model: Model used for App agent instances
--a2a_model_provider: Provider for App agent model
--a2a_endpoint: Endpoint URL for App agent models

Output and Trace Management

--output_dir: Directory for saving scenario states and logs
--trace_dump_format: Format for dumping traces (‘hf’, ‘lite’, or ‘both’, default: both)
--hf_upload: Dataset name to upload traces to Hugging Face (if not specified, no upload)
--hf_public: Upload as public dataset (default: false)

Judge System Configuration

--judge_model: Model to use for judge system validation (default: “meta-llama/Meta-Llama-3.3-70B-Instruct”)
--judge_provider: Provider for the judge model (default: uses same provider as main model)
--judge_endpoint: Custom endpoint URL for the judge model (optional)

Note

Reproducible Results: For consistent and reproducible evaluation results, use llama3.3-70B as the judge model.: You can use any provider that offers this model based on your preference, access, and cost considerations.

Basic Usage¶

Simple Benchmark Run¶

Run a basic benchmark with local scenarios:

uvx --from meta-agents-research-environments are-benchmark run --dataset /path/to/scenarios --agent default --limit 10

This command:

--dataset /path/to/scenarios: Specifies the directory containing scenario files
--agent default: Uses the Meta OSS agent
--limit 10: Runs only the first 10 scenarios

With Hugging Face Datasets¶

Run benchmarks using Hugging Face datasets:

uvx --from meta-agents-research-environments are-benchmark run --hf-dataset meta-agents-research-environments/gaia2 --hf-split validation --agent default

This uses:

--hf-dataset meta-agents-research-environments/gaia2: Hugging Face dataset path
--hf-split validation: Specific dataset split
--agent default: Agent implementation to use for the runs. default is the only one provided and you should use this one for gaia2 evaluation.

Model Configuration¶

Supported Providers¶

the Agents Research Environments supports multiple model providers through liteLLM:

API-Based Providers

llama-api: Llama models via API
anthropic: Claude models
openai: GPT-3.5, GPT-4, and variants
azure: Azure OpenAI services

Third-Party Providers

huggingface: Models from Hugging Face Hub
fireworks-ai: Fireworks AI models
together: Together AI models
replicate: Replicate models
…

Local Deployments

local: Local model deployments with OpenAI-compatible APIs

Provider Examples¶

Llama API

export LLAMA_API_KEY="your-api-key"
uvx --from meta-agents-research-environments are-benchmark run --hf-dataset meta-agents-research-environments/gaia2 --hf-split validation \
  --model Llama-4-Maverick-17B-128E-Instruct-FP8 --provider llama-api --agent default

OpenAI Models

export OPENAI_API_KEY="your-api-key"
uvx --from meta-agents-research-environments are-benchmark run --hf-dataset meta-agents-research-environments/gaia2 --hf-split validation \
  --model gpt-4 --provider openai --agent default

Local OpenAI-Compatible Endpoint

uvx --from meta-agents-research-environments are-benchmark run --hf-dataset meta-agents-research-environments/gaia2 --hf-split validation \
  --model your-local-model --provider local \
  --endpoint "http://0.0.0.0:4000" --agent default

Hugging Face Models

export HUGGINGFACE_API_TOKEN="your-token"
uvx --from meta-agents-research-environments are-benchmark run --hf-dataset meta-agents-research-environments/gaia2 --hf-split validation \
  --model meta-llama/llama3-70b-instruct --provider huggingface --agent default

Complete Benchmark Run with Upload

export model="meta-llama/llama3-70b-instruct"
export provider="huggingface"

uvx --from meta-agents-research-environments are-benchmark gaia2-run --hf-dataset meta-agents-research-environments/gaia2 \
  --agent default --output_dir ./benchmark_results/ \
  --model $model --provider $provider \
  --hf_upload myhforg/${model////.} --hf_public

Note

Judge System LLM Independence: The judge system uses its own separate LLM engine for validation, which is independent of your agent’s model configuration. The judge’s LLM is used for semantic validation of tool arguments, soft comparison of agent outputs, and context-aware evaluation. Hard validation (exact matching, scripted checks) runs without LLM inference.

Important: The judge system does not use the –model or –provider settings, these are for the agent. For setting the judge LLM, use the –judge_model and –judge_provider settings.

Environment Variables¶

Different providers require specific environment variables:

Llama API

LLAMA_API_KEY (required): Your Llama API key
LLAMA_API_BASE (optional): Custom base URL, defaults to https://api.llama.com/compat/v1

OpenAI

OPENAI_API_KEY (required): Your OpenAI API key

Azure OpenAI

AZURE_API_KEY (required): Your Azure OpenAI API key
AZURE_API_BASE (required): Your Azure OpenAI endpoint URL

Anthropic

ANTHROPIC_API_KEY (required): Your Anthropic API key

Hugging Face

HUGGINGFACE_API_TOKEN (required for private models): Your Hugging Face token

Provider-Specific Notes¶

Llama API Configuration

The Llama API provider automatically configures the endpoint and authentication:

Uses OpenAI-compatible format internally
Requires LLAMA_API_KEY environment variable
Supports custom base URL via LLAMA_API_BASE

Get an API token at Llama Developer <https://llama.developer.meta.com/>

Local Deployments

For local model deployments:

Use --provider local with --endpoint pointing to your server
Ensure your local server implements OpenAI-compatible API
Common local deployment tools: vLLM, text-generation-inference, Ollama

Hugging Face Integration

Hugging Face provider supports:

Direct model loading from Hugging Face Hub
Private model access with authentication tokens
Various Hugging Face inference providers

Execution Control Options¶

Concurrency Control

are-benchmark run -d /path/to/scenarios \
  --max_concurrent_scenarios 4 -a default

Output Directory

are-benchmark run -d /path/to/scenarios \
  --output_dir ./benchmark_results -a default

Scenario Limiting

are-benchmark run -d /path/to/scenarios \
  --limit 50 -a default

Result Management¶

Save Results Locally

are-benchmark run --hf meta-agents-research-environments/gaia2 --hf-split validation \
  --output_dir ./benchmark_results -a default

Upload to Hugging Face

are-benchmark gaia2-run --hf meta-agents-research-environments/gaia2 \
  --hf_upload my-org/gaia2-results -a default

Public Dataset Upload

are-benchmark gaia2-run --hf meta-agents-research-environments/gaia2 \
  --hf_upload my-org/gaia2-results --hf_public -a default

Recommended Workflow¶

Development and Testing¶

Validation Phase

Start with a small validation set to test your setup:

export LLAMA_API_KEY="your-api-key"
are-benchmark run --hf meta-agents-research-environments/gaia2 --hf-split validation \
  --model Llama-3.1-70B-Instruct --provider llama-api \
  --agent default --limit 10 --output_dir ./validation_results

Trace Analysis

Examine the generated traces in ./validation_results to:
- Verify agent behavior matches expectations
- Check for errors or unexpected patterns
- Validate scoring and evaluation metrics
- Debug any issues before full evaluation
Iterative Improvement

Based on validation results:
- Adjust agent parameters
- Modify scenario selection
- Fine-tune model settings
- Update evaluation criteria

Gaia2 Leaderboard Submission¶

For detailed Gaia2 evaluation guidance, see Gaia2 and Leaderboard Submission.

Scenario Validation¶

Judge Mode¶

Use judge mode to validate scenarios without running agents:

are-benchmark judge -d /path/to/traces

Dataset Integration¶

Hugging Face Dataset Format¶

ARE works with Hugging Face datasets that follow this structure:

{
    "scenario_id": "unique_identifier",
    "data": "json_serialized_scenario",
    "metadata": {
        "difficulty": "easy|medium|hard",
        "domain": "email|calendar|file_system",
        "tags": ["tag1", "tag2"]
    }
}

Dataset Configuration¶

The benchmark supports multiple dataset configurations:

execution: Multi-step planning and state-changing operations
search: Information gathering and combination from multiple sources
adaptability: Dynamic adaptation to environmental changes
time: Temporal reasoning with precise timing constraints
ambiguity: Recognition and handling of ambiguous or impossible tasks
mini: A 160 scenarios subset of the above configurations

Dataset Splits¶

Common dataset splits:

validation: Main evaluation and development set, include the ground truth to run the judge

Result Format¶

Benchmark results are automatically formatted for Hugging Face upload:

{
    "scenario_id": "scenario_001",
    "success": true,
    "execution_time": 45.2,
    "steps_taken": 12,
    "validation_score": 0.95,
    "agent_trace": [...],
    "model_info": {
        "model": "Llama-3.1-70B-Instruct",
        "provider": "llama-api"
    }
}

Gaia2 Submission Command¶

The gaia2-run command provides a comprehensive evaluation pipeline specifically designed for Gaia2 leaderboard submissions. This command automates the complex multi-phase evaluation process required for complete Gaia2 assessment.

Key Features¶

Automated Multi-Phase Evaluation

The command automatically executes three distinct evaluation phases:

Standard Phase: Base agent performance across all capability configurations (execution, search, adaptability, time, ambiguity)
Agent2Agent Phase: Multi-agent collaboration scenarios on the mini configuration
Noise Phase: Robustness evaluation with environment perturbations and tool augmentation on the mini configuration

Standardized Evaluation Parameters

3 runs per scenario: Ensures proper variance analysis for leaderboard requirements
Hugging Face format: Traces automatically formatted for submission compatibility
Comprehensive reporting: Generates validation reports and performance summaries

Submission-Specific Parameters¶

Hugging Face Upload Configuration

--hf_upload: Dataset name for uploading consolidated results (required for submission)
--hf_public: Upload as public dataset (default: false, making results private), you can submit to the leaderboard with private datasets.

Example Usage

# Complete Gaia2 submission with public upload
are-benchmark gaia2-run --hf meta-agents-research-environments/gaia2 \
  --model meta-llama/llama3-70b-instruct --provider huggingface \
  --agent default \
  --output_dir ./gaia2_submission \
  --hf_upload my-org/gaia2-llama3-70b-results \
  --hf_public

# Validation run before full submission
are-benchmark gaia2-run --hf meta-agents-research-environments/gaia2 \
  --split validation \
  --model your-model --provider your-provider \
  --agent default \
  --limit 20 \
  --output_dir ./gaia2_validation

For comprehensive Gaia2 evaluation guidance and submission process, see Gaia2 and Leaderboard Submission.

Variance Analysis¶

The platform supports running each scenario multiple times to analyze performance variance and improve statistical confidence in results.

Multiple Runs Configuration¶

Use the --num_runs parameter to specify how many times each scenario should be executed:

# Run each scenario 5 times for better variance analysis
are-benchmark run --hf meta-agents-research-environments/gaia2 --hf-split validation \
  --model gpt-4 --provider openai -a default --num_runs 5

Note

The gaia2-run command automatically sets --num_runs 3 to meet leaderboard requirements for variance analysis.

Benefits of Multiple Runs¶

Statistical Confidence: Multiple runs provide more reliable performance metrics by reducing the impact of random variations.
Variance Analysis: Understand the consistency of your agent’s performance across identical scenarios.
Robust Evaluation: Identify scenarios where performance is highly variable vs. consistently good/bad.

Result Structure¶

When using multiple runs, the system automatically:

Creates unique trace files for each run: scenario_123_run_1_[config hash]_[timestamp].json, scenario_123_run_2_[config hash]_timestamp.json, etc.
Groups results by base scenario ID for variance calculations
Provides comprehensive statistics in the final report

Report Structure¶

The benchmark results include detailed reports for each configuration and overall summary:

=== Validation Report ===
Model: gemini-2-5-pro
Provider: unknown

=== Time ===
- Scenarios: 10 unique (30 total runs)
- Success rate: 20.0% ± 0.0% (STD: 0.0%)
- Pass@3: 3 scenarios (30.0%)
- Pass^3: 1 scenarios (10.0%)
- Average run duration: 93.4s (STD: 47.4s)

=== Ambiguity ===
- Scenarios: 10 unique (30 total runs)
- Success rate: 10.0% ± 0.0% (STD: 0.0%)
- Pass@3: 2 scenarios (20.0%)
- Pass^3: 0 scenarios (0.0%)
- Average run duration: 94.9s (STD: 62.3s)

=== Execution ===
- Scenarios: 10 unique (30 total runs)
- Success rate: 53.3% ± 6.7% (STD: 11.5%)
- Pass@3: 6 scenarios (60.0%)
- Pass^3: 4 scenarios (40.0%)
- Average run duration: 124.0s (STD: 78.3s)

=== Adaptability ===
- Scenarios: 10 unique (30 total runs)
- Success rate: 20.0% ± 0.0% (STD: 0.0%)
- Pass@3: 2 scenarios (20.0%)
- Pass^3: 2 scenarios (20.0%)
- Average run duration: 95.1s (STD: 47.4s)

=== Search ===
- Scenarios: 10 unique (30 total runs)
- Success rate: 56.7% ± 3.3% (STD: 5.8%)
- Pass@3: 8 scenarios (80.0%)
- Pass^3: 3 scenarios (30.0%)
- Average run duration: 168.0s (STD: 91.5s)

=== Global Summary ===
- Scenarios: 50 unique (150 total runs)
- Macro success rate: 32.0% ± 1.2% (STD: 2.0%)
- Micro success rate: 32.0% ± 1.2% (STD: 2.0%)
- Pass@3: 21 scenarios (42.0%)
- Pass^3: 10 scenarios (20.0%)
- Average run duration: 115.1s (STD: 72.7s)
- Job duration: 277.5 seconds

Understanding Report Metrics¶

The validation reports provide comprehensive statistical analysis:

Per Config Metrics

Success rate:: Percentage of individual runs that succeeded (counts each run separately)
Pass^k: Percentage of scenarios that succeed in all k runs
Pass@k: Percentage of scenarios that succeed in at least 1 out of k runs
Average run duration: Average time taken per run

Global Metrics

Macro success rate:: Average of per-capability success rates
Micro success rate: Average of per-run success rates
Job duration: Total time taken for all runs

Variance Analysis

STD: Standard deviation of the metric across runs
±: Standard error (STD / sqrt(n) where n is the number of runs)

Results Caching¶

Meta Agents Research Environments includes a caching system that stores scenario execution results to avoid re-running identical scenarios with the same configuration. When enabled with --enable_caching, the system generates unique cache keys based on both the scenario content and runner configuration (model, provider, agent settings, etc.). Results are stored as JSON files in ~/.cache/are/simulation/scenario_results/ by default, or in a custom location specified by the Meta Agents Research Environments_CACHE_DIR environment variable.

Next Steps¶

With benchmarking knowledge:

Run Systematic Evaluations: Use benchmarks for comprehensive agent testing
Contribute Results: Share findings with the research community
Iterate and Improve: Use results to enhance agent capabilities
Develop Custom Benchmarks: Create domain-specific evaluation suites

Ready to create your own scenarios? Continue to Scenario Development for detailed guidance on scenario creation.

Benchmarking with Meta Agents Research Environments¶

Overview¶

Main Commands¶

run¶

judge¶

gaia2-run¶

Command Line Reference¶

are-benchmark¶

Key Parameters Overview¶

Basic Usage¶

Simple Benchmark Run¶

With Hugging Face Datasets¶

Model Configuration¶

Supported Providers¶

Provider Examples¶

Environment Variables¶

Provider-Specific Notes¶

Execution Control Options¶

Result Management¶

Recommended Workflow¶

Development and Testing¶

Gaia2 Leaderboard Submission¶

Scenario Validation¶

Judge Mode¶

Dataset Integration¶

Hugging Face Dataset Format¶

Dataset Configuration¶

Dataset Splits¶

Result Format¶

Gaia2 Submission Command¶

Key Features¶

Submission-Specific Parameters¶

Variance Analysis¶

Multiple Runs Configuration¶

Benefits of Multiple Runs¶

Result Structure¶

Report Structure¶

Understanding Report Metrics¶

Results Caching¶

Next Steps¶

`run`¶

`judge`¶

`gaia2-run`¶