The Agents Research Environments judge system provides comprehensive validation capabilities for evaluating agent performance against ground truth oracle scenarios.
The judge system compares agent execution traces against oracle (ground truth) scenarios to determine success or failure. It operates in two main modes:
Online Validation: Real-time validation during scenario execution
Offline Validation: Post-execution validation using the judge command
The system uses a hierarchical approach with multiple types of judges that evaluate different aspects of agent behavior:
Event-level validation: Comparing individual agent actions against oracle actions
Tool-level validation: Verifying correct tool usage and parameters
Temporal validation: Ensuring actions occur within acceptable time windows
Causal validation: Verifying proper dependency ordering between actions
Tool Validation: Uses specialized tool judges for event comparison
Key Features:
Topological ordering of oracle events based on dependencies
Support for extra send_message_to_user calls from agents
Detailed failure reporting with specific mismatch information
Usage Example:
# Run judge mode for oneline validation
uvx--frommeta-agents-research-environmentsare-benchmarkrun-d/path/to/scenarios--limit10# Run judge mode for offline validation
uvx--frommeta-agents-research-environmentsare-benchmarkjudge-d/path/to/scenarios--limit10
A specialized implementation of the GraphPerEventJudge that uses purely deterministic, scripted validation without LLM-based soft validation.
This judge is ideal for scenarios where you need predictable, reproducible validation results and have well-defined validation criteria.
Core Characteristics:
LLM-Free Operation: Completely deactivates soft judges and relies only on hard, scripted checkers
Event-Specific Validation: Uses custom validation rules per oracle event
Key Configuration:
The judge requires an event_id_to_checker_params mapping that defines specific validation rules for each oracle event:
# Example configuration for scripted validationscripted_config=ScriptedGraphPerEventJudgeConfig(event_id_to_checker_params={"oracle_send_email":[ToolCheckerParam(arg_name="recipient",checker_type=CheckerType.eq_checker,tool_name="EmailApp__send_email"),ToolCheckerParam(arg_name="subject",checker_type=CheckerType.contain_any_checker,tool_name="EmailApp__send_email",checker_args={"targets":["urgent","important"]})],"oracle_send_message":[ToolCheckerParam(arg_name="content",checker_type=CheckerType.contain_all_checker,tool_name="MessagingApp__send_message",checker_args={"targets":["meeting","2pm"]})]},extra_send_message_to_user_allowed=0,pre_event_tolerance_seconds=5.0,post_event_tolerance_seconds=20.0)
ToolCheckerParam Structure:
Each ToolCheckerParam defines validation rules for specific tool arguments:
arg_name: The argument name to validate (e.g., “content”, “recipient”)
checker_type: The type of validation to perform (see available checkers below)
tool_name: The full tool name including app prefix (e.g., “EmailApp__send_email”)
checker_args: Additional parameters for the checker (optional)
Available Checker Types:
CheckerType.eq_checker: Exact equality comparison
CheckerType.contain_any_checker: Checks if argument contains any of the target strings
CheckerType.contain_all_checker: Checks if argument contains all target strings
CheckerType.unordered_list_checker: Set-based list comparison ignoring order
CheckerType.phone_number_checker: Phone number format validation
CheckerType.datetime_checker: Date/time format validation
Example Scenario Integration:
classMyScenario(Scenario):def__init__(self):# Define checker parameters for each oracle eventself.d_checker_params={"oracle_book_restaurant":[ToolCheckerParam(arg_name="restaurant_name",checker_type=CheckerType.contain_any_checker,tool_name="BookingApp__make_reservation",checker_args={"targets":["Italian","Chinese"]}),ToolCheckerParam(arg_name="party_size",checker_type=CheckerType.eq_checker,tool_name="BookingApp__make_reservation")]}definitialize(self,**kwargs):super().initialize(**kwargs)self.judge=JudgeFactory()(ScriptedGraphPerEventJudgeConfig(event_id_to_checker_params=self.d_checker_params,extra_send_message_to_user_allowed=1))
# Hard validation for exact matchesarg_to_checker_type={"recipient":CheckerType.eq_checker,"file_paths":CheckerType.unordered_path_list_checker,"phone":CheckerType.phone_number_checker}
The SoftToolJudge uses specialized LLM-based validation for semantic comparison of tool arguments.
It employs multiple targeted checkers, each optimized for specific validation scenarios.
Architecture:
The judge operates in a two-phase approach:
Equality Pre-check: Quick comparison to avoid unnecessary LLM calls when arguments are identical
Specialized Soft Checkers: LLM-powered validation using domain-specific checkers
Available Soft Checkers:
content_checker: Validates semantic equivalence of content against oracle and task context
signature_checker: Ensures proper user name/signature usage in communications
sanity_checker: Performs basic reasonableness checks on agent outputs
placeholder_checker: Detects and rejects placeholder text (e.g., “[User’s Name]”, “[Your Name]”)
cab_checker: Validates cab/ride booking details against user address
email_checker: Specialized validation for email compositions
message_checker: Validates message content and formatting
user_message_checker: Validates user-directed messages for appropriateness
event_checker: Validates event details against context and user information
tone_checker: Ensures appropriate communication tone
Key Features:
Equality Pre-check: Avoids LLM calls when agent and oracle arguments are identical
Subtask Extraction: Automatically extracts relevant subtasks from the broader task context
Context-Aware Validation: Uses user details, dates, and task context for validation
Placeholder Detection: Built-in detection for common placeholder text patterns
Validation Process:
Initial Setup: Identifies arguments marked for LLM checking
Equality Check: Compares normalized agent and oracle arguments
Context Preparation: Extracts user details, dates, and subtasks as needed
Checker Execution: Runs configured soft checkers in sequence
Result Aggregation: Returns success only if all checkers pass
Example Configuration:
# Configure split soft judge with multiple checkersconfig=SplitSoftToolJudgeConfig(tool_name="send_email",arg_to_checker_type={"subject":CheckerType.llm_checker,"body":CheckerType.llm_checker,"recipient":CheckerType.eq_checker# Hard check for exact match},soft_checker_types=[SoftCheckerType.placeholder_checker,# Reject placeholder textSoftCheckerType.content_checker,# Semantic content validationSoftCheckerType.tone_checker,# Appropriate communication toneSoftCheckerType.signature_checker# Proper user signature],engine=llm_engine)
Context Extraction:
The judge automatically extracts relevant context for validation:
User Details: Name and address from scenario user information
Temporal Context: Event date/time formatted for validation
Task Context: Extracts relevant subtasks using LLM-based extraction
Previous Tasks: Maintains context from prior scenario steps
Placeholder Detection:
Built-in detection for common placeholder patterns:
Agent event must occur within tolerance window around oracle time
LESS_THAN
Agent event must occur before oracle time (plus post-tolerance)
GREATER_THAN
Agent event must occur after oracle time (minus pre-tolerance)
Configuration Options:
# Time validation settingscheck_time_threshold_seconds=30.0# Minimum time gap to checkpre_event_tolerance_seconds=5.0# Allowed time before oraclepost_event_tolerance_seconds=20.0# Allowed time after oracle
The judge system enforces proper dependency ordering:
Dependency Graph
Oracle events include parent-child relationships
Causality Rules
All parent events must be matched before child events
Agent events must respect the same ordering constraints
Violations result in validation failure
Example:
Oracle Event Graph:
A → B → D
A → C → D
Valid Agent Sequence: A, B, C, D (or A, C, B, D)
Invalid Agent Sequence: B, A, C, D (B before A violates causality)
Indicates mismatched tool usage counts between agent and oracle:
Agent and oracle counters do not match for the following tools:
- Tool 'send_email': Agent count 2, Oracle count 1
- Tool 'read_file': Agent count 0, Oracle count 1
# Run judge on local scenarios
are-benchmarkjudge-d/path/to/scenarios
# Run judge on Hugging Face dataset
are-benchmarkjudge--hfare-benchmark/gaia2--hf-splitvalidation
The judge system uses its own separate LLM engine for soft validation (semantic comparison of tool arguments). This LLM engine is independent of the model configuration you specify for your main agent, and can be customized for cost control and performance optimization.
Default Judge Configuration
# Judge command - uses default judge model configuration
uvx--frommeta-agents-research-environmentsare-benchmarkjudge--hfare-benchmark/gaia2--hf-splitvalidation\--output_dir./judge_results
Custom Judge Model Configuration
You can specify custom judge model settings to control costs and performance:
# Use custom judge model and provider
uvx--frommeta-agents-research-environmentsare-benchmarkjudge--hfare-benchmark/gaia2--hf-splitvalidation\--judge_modelcustom-juge-model--judge_providercustom-provider\--output_dir./judge_results
# Use different provider for judge vs main agent
uvx--frommeta-agents-research-environmentsare-benchmarkrun--hfare-benchmark/gaia2--hf-splittest\--modelcustom-model_model--model_providercustom-model-provider\--judge_modelcustom-juge-model--judge_providercustom-judge-model-provider
# Use custom endpoint for judge model
uvx--frommeta-agents-research-environmentsare-benchmarkjudge--hfare-benchmark/gaia2--hf-splitvalidation\--judge_modelcustom-judge-model--judge_providercustom-provider\--judge_endpointhttp://localhost:8000
Judge Model Configuration Options
–judge_model
Model to use for the judge system validation. Use a capable model for best evaluation quality.
Provider for the judge model. If not specified, uses the same provider as the main model.
Supports all LiteLLM providers: openai, anthropic, huggingface, llama-api, etc.
Allows separate billing control from your main agent model
–judge_endpoint
Custom endpoint URL for the judge model (optional).
Useful for local deployments or custom inference servers
Must be OpenAI-compatible API format
Note
Reproducible Results: For consistent and reproducible evaluation results, use llama3.3-70B as the judge model.
Note
Judge LLM Independence: The judge system uses its own configurable LLM engine, which is separate from and independent of the model configuration you specify for your main agent (–model, –model_provider, etc.). The judge’s LLM is used for:
SoftToolJudge: Semantic comparison of tool arguments when exact matching isn’t sufficient
InContextJudge: LLM-based evaluation of entire agent traces
Hard validation (exact matching, scripted checks) does not require LLM inference and runs regardless of any model configuration.
# Configure judge for email scenario validationjudge_config=GraphPerEventJudgeConfig(# Time validationcheck_time_threshold_seconds=30.0,pre_event_tolerance_seconds=5.0,post_event_tolerance_seconds=20.0,# Checker types for each toolper_tool_arg_to_checker_type={"send_email":{"recipient":CheckerType.eq_checker,"subject":CheckerType.llm_checker,"body":CheckerType.llm_checker},"read_file":{"file_path":CheckerType.path_checker}},# Soft checkersper_tool_soft_checker_types={"send_email":[SoftCheckerType.placeholder_checker,SoftCheckerType.content_checker,SoftCheckerType.tone_checker,SoftCheckerType.signature_checker],},# Allow extra user messagesextra_send_message_to_user_allowed=1)
Verifies the correctness of agent events by comparing them with oracle events.
The judge performs a preliminary check to ensure that the tool call counts are the same in both the oracle and agent events.
If this check passes, it orders the oracle events in topological order based on their dependencies.
Then in attempts to match each one with an agent event using two types of tool judges:
- A hard tool judge for specific arguments
- A soft tool judge (LLM-based) for other arguments.
Once a match is found, the judge verifies the causality by ensuring that all parent events of the oracle event have already been matched with previous agent events.
If all oracle events are successfully matched, the judge returns a success.
Check if the agent call the same tools the same number of times as the oracle
except for sending a message to the user where extra calls may be allowed
A judge that compares a pair of environment/user events from the agent log and the oracle agent log.
The two events match if their event ids is the same.
agent_event_time (float): The time of the agent event (relative or absolute)
oracle_event_time (float): The time of the oracle event (relative or absolute).
pre_event_tolerance_seconds (float): The allowed time in seconds before the oracle event time.
post_event_tolerance_seconds (float): The allowed time in seconds after the oracle event time.
event_time_comparator (str | None): The type of comparison to perform between the agent and oracle event times. The arg type is str instead of EventTimeComparator for better readability in the tracer.
Returns:
bool: True if the agent event time is within the allowed tolerance range, False otherwise.
Base class for a tool judge. A tool judge compares a agent and oracle event representing a tool call.
It decides if the two actions of the events, representing a tool call match.
A mild judge that combines a hard and soft judge to compare an agent and oracle event representing a tool call.
If first call the hard judge and if it passes, then call the soft judge.
Config for the scripted graph per event judge.
Scripted judge is a judge where the soft judge is deactivated and instead scripted checks will be performed by the hard judge.
The event_id_to_checker_params field is used to specify the scripted checks to perform.