Judge System for Scenario Validation

The Agents Research Environments judge system provides comprehensive validation capabilities for evaluating agent performance against ground truth oracle scenarios.

Overview

The judge system compares agent execution traces against oracle (ground truth) scenarios to determine success or failure. It operates in two main modes:

  • Online Validation: Real-time validation during scenario execution

  • Offline Validation: Post-execution validation using the judge command

The system uses a hierarchical approach with multiple types of judges that evaluate different aspects of agent behavior:

  • Event-level validation: Comparing individual agent actions against oracle actions

  • Tool-level validation: Verifying correct tool usage and parameters

  • Temporal validation: Ensuring actions occur within acceptable time windows

  • Causal validation: Verifying proper dependency ordering between actions

Judge Architecture

Base Judge Classes

BaseJudge

Abstract base class for all judges that compare agent and oracle event logs for a given scenario.

ToolJudge

Base class for judges that compare individual tool calls between agent and oracle events.

EventJudge

Base class for judges that compare agent and oracle events and determine if they match.

Judge State Management

The judge system maintains state throughout validation:

@dataclass
class BaseJudgeState:
    # Initialization flag
    initialized: bool = False

    # Turn tracking
    nb_turns: int = -1
    turn_idx: int = -1
    last_turn_success: bool = True
    last_turn_rationale: str = ""

   # Scenario data
   scenario_start_time: float = 0.0
   scenario_tasks: list[str] = field(default_factory=list)
   user_details: Contact | None = None

   # Oracle events
   turn_to_oracle_events: list[list[CompletedOracleEvent]] = field(
      default_factory=list
   )
   turn_to_oracle_graph: list[dict[str, list[str]]] = field(default_factory=list)
   oracle_event_id_to_turn_idx: dict[str, int] = field(default_factory=dict)

   # Agent events
   turn_to_agent_events: list[list[CompletedEvent]] = field(default_factory=list)

Judge Types

GraphPerEventJudge

The primary judge implementation that performs comprehensive validation by:

  1. Preliminary Checks: Verifies tool call counts match between agent and oracle

  2. Event Matching: Attempts to match each oracle event with an agent event

  3. Causality Verification: Ensures proper dependency ordering

  4. Tool Validation: Uses specialized tool judges for event comparison

Key Features:

  • Topological ordering of oracle events based on dependencies

  • Support for extra send_message_to_user calls from agents

  • Detailed failure reporting with specific mismatch information

Usage Example:

# Run judge mode for oneline validation
uvx --from meta-agents-research-environments are-benchmark run -d /path/to/scenarios --limit 10

# Run judge mode for offline validation
uvx --from meta-agents-research-environments are-benchmark judge -d /path/to/scenarios --limit 10

ScriptedGraphPerEventJudge

A specialized implementation of the GraphPerEventJudge that uses purely deterministic, scripted validation without LLM-based soft validation. This judge is ideal for scenarios where you need predictable, reproducible validation results and have well-defined validation criteria.

Core Characteristics:

  • LLM-Free Operation: Completely deactivates soft judges and relies only on hard, scripted checkers

  • Event-Specific Validation: Uses custom validation rules per oracle event

Key Configuration:

The judge requires an event_id_to_checker_params mapping that defines specific validation rules for each oracle event:

# Example configuration for scripted validation
scripted_config = ScriptedGraphPerEventJudgeConfig(
    event_id_to_checker_params={
        "oracle_send_email": [
            ToolCheckerParam(
                arg_name="recipient",
                checker_type=CheckerType.eq_checker,
                tool_name="EmailApp__send_email"
            ),
            ToolCheckerParam(
                arg_name="subject",
                checker_type=CheckerType.contain_any_checker,
                tool_name="EmailApp__send_email",
                checker_args={"targets": ["urgent", "important"]}
            )
        ],
        "oracle_send_message": [
            ToolCheckerParam(
                arg_name="content",
                checker_type=CheckerType.contain_all_checker,
                tool_name="MessagingApp__send_message",
                checker_args={"targets": ["meeting", "2pm"]}
            )
        ]
    },
    extra_send_message_to_user_allowed=0,
    pre_event_tolerance_seconds=5.0,
    post_event_tolerance_seconds=20.0
)

ToolCheckerParam Structure:

Each ToolCheckerParam defines validation rules for specific tool arguments:

  • arg_name: The argument name to validate (e.g., “content”, “recipient”)

  • checker_type: The type of validation to perform (see available checkers below)

  • tool_name: The full tool name including app prefix (e.g., “EmailApp__send_email”)

  • checker_args: Additional parameters for the checker (optional)

Available Checker Types:

  • CheckerType.eq_checker: Exact equality comparison

  • CheckerType.contain_any_checker: Checks if argument contains any of the target strings

  • CheckerType.contain_all_checker: Checks if argument contains all target strings

  • CheckerType.unordered_list_checker: Set-based list comparison ignoring order

  • CheckerType.path_checker: Normalized file path comparison

  • CheckerType.phone_number_checker: Phone number format validation

  • CheckerType.datetime_checker: Date/time format validation

Example Scenario Integration:

class MyScenario(Scenario):
    def __init__(self):
        # Define checker parameters for each oracle event
        self.d_checker_params = {
            "oracle_book_restaurant": [
                ToolCheckerParam(
                    arg_name="restaurant_name",
                    checker_type=CheckerType.contain_any_checker,
                    tool_name="BookingApp__make_reservation",
                    checker_args={"targets": ["Italian", "Chinese"]}
                ),
                ToolCheckerParam(
                    arg_name="party_size",
                    checker_type=CheckerType.eq_checker,
                    tool_name="BookingApp__make_reservation"
                )
            ]
        }

    def initialize(self, **kwargs):
        super().initialize(**kwargs)
        self.judge = JudgeFactory()(
            ScriptedGraphPerEventJudgeConfig(
                event_id_to_checker_params=self.d_checker_params,
                extra_send_message_to_user_allowed=1
            )
        )

InContextJudge

A baseline judge that uses LLM-based evaluation by providing all agent and oracle events in context to a model for comparison.

Features:

  • LLM-powered validation using configurable evaluation criteria

  • Support for tool-specific evaluation templates

  • Time-based validation with configurable tolerance windows

Tool Judges

The system includes three types of tool judges for validating individual tool calls:

HardToolJudge

Performs scripted, deterministic checks on tool arguments using predefined checkers:

Available Checkers:

  • eq_checker: Exact equality comparison

  • unordered_list_checker: Set-based list comparison

  • datetime_checker: Date/time format validation

  • phone_number_checker: Phone number format validation

  • path_checker: File path normalization and comparison

  • contain_any_checker: Substring containment validation

  • contain_all_checker: Multiple substring validation

Example Configuration:

# Hard validation for exact matches
arg_to_checker_type = {
    "recipient": CheckerType.eq_checker,
    "file_paths": CheckerType.unordered_path_list_checker,
    "phone": CheckerType.phone_number_checker
}

SoftToolJudge

The SoftToolJudge uses specialized LLM-based validation for semantic comparison of tool arguments. It employs multiple targeted checkers, each optimized for specific validation scenarios.

Architecture:

The judge operates in a two-phase approach:

  1. Equality Pre-check: Quick comparison to avoid unnecessary LLM calls when arguments are identical

  2. Specialized Soft Checkers: LLM-powered validation using domain-specific checkers

Available Soft Checkers:

  • content_checker: Validates semantic equivalence of content against oracle and task context

  • signature_checker: Ensures proper user name/signature usage in communications

  • sanity_checker: Performs basic reasonableness checks on agent outputs

  • placeholder_checker: Detects and rejects placeholder text (e.g., “[User’s Name]”, “[Your Name]”)

  • cab_checker: Validates cab/ride booking details against user address

  • email_checker: Specialized validation for email compositions

  • message_checker: Validates message content and formatting

  • user_message_checker: Validates user-directed messages for appropriateness

  • event_checker: Validates event details against context and user information

  • tone_checker: Ensures appropriate communication tone

Key Features:

  • Equality Pre-check: Avoids LLM calls when agent and oracle arguments are identical

  • Subtask Extraction: Automatically extracts relevant subtasks from the broader task context

  • Context-Aware Validation: Uses user details, dates, and task context for validation

  • Placeholder Detection: Built-in detection for common placeholder text patterns

Validation Process:

  1. Initial Setup: Identifies arguments marked for LLM checking

  2. Equality Check: Compares normalized agent and oracle arguments

  3. Context Preparation: Extracts user details, dates, and subtasks as needed

  4. Checker Execution: Runs configured soft checkers in sequence

  5. Result Aggregation: Returns success only if all checkers pass

Example Configuration:

# Configure split soft judge with multiple checkers
config = SplitSoftToolJudgeConfig(
    tool_name="send_email",
    arg_to_checker_type={
        "subject": CheckerType.llm_checker,
        "body": CheckerType.llm_checker,
        "recipient": CheckerType.eq_checker  # Hard check for exact match
    },
    soft_checker_types=[
        SoftCheckerType.placeholder_checker,  # Reject placeholder text
        SoftCheckerType.content_checker,      # Semantic content validation
        SoftCheckerType.tone_checker,         # Appropriate communication tone
        SoftCheckerType.signature_checker     # Proper user signature
    ],
    engine=llm_engine
)

Context Extraction:

The judge automatically extracts relevant context for validation:

  • User Details: Name and address from scenario user information

  • Temporal Context: Event date/time formatted for validation

  • Task Context: Extracts relevant subtasks using LLM-based extraction

  • Previous Tasks: Maintains context from prior scenario steps

Placeholder Detection:

Built-in detection for common placeholder patterns:

Detected Placeholders (automatically rejected):
- "[User's Name]", "[User Name]", "[User]"
- "[Your Name]", "[My Name]"
- "Best regards,\nYour Name"
- "Best,\nYour Name"

MildToolJudge

Combines hard and soft validation approaches:

  1. Hard Validation: Performs scripted checks first

  2. Soft Validation: Falls back to LLM validation for remaining arguments

This approach provides both reliability (hard checks) and flexibility (soft checks).

Temporal Validation

The judge system includes sophisticated time-based validation:

Time Comparison Types

EQUAL (default)

Agent event must occur within tolerance window around oracle time

LESS_THAN

Agent event must occur before oracle time (plus post-tolerance)

GREATER_THAN

Agent event must occur after oracle time (minus pre-tolerance)

Configuration Options:

# Time validation settings
check_time_threshold_seconds = 30.0      # Minimum time gap to check
pre_event_tolerance_seconds = 5.0        # Allowed time before oracle
post_event_tolerance_seconds = 20.0      # Allowed time after oracle

Absolute vs Relative Time

Absolute Time

Direct comparison against wall-clock time

Relative Time

Comparison relative to parent event completion times

Failure Types and Diagnostics

The judge system provides detailed failure information:

ToolCallCountsFailure

Indicates mismatched tool usage counts between agent and oracle:

Agent and oracle counters do not match for the following tools:
- Tool 'send_email': Agent count 2, Oracle count 1
- Tool 'read_file': Agent count 0, Oracle count 1

EventComparisonFailure

Specific event matching failures with detailed context:

  • CAUSALITY: Dependency ordering violation

  • ALREADY_MATCHED: Agent event already matched to another oracle event

  • TOOL_JUDGE_REJECT: Tool-specific validation failed

OracleEventMatchingFailure

Comprehensive failure when no agent event matches an oracle event:

Failure: Agent did not perform the following oracle tool call:
tool name: send_email
tool args:
- recipient: john@example.com
- subject: Meeting Reminder
- body: Don't forget about our 2pm meeting

List of matching attempts:
- Failure matching agent event (ID: evt_123) with oracle event (ID: oracle_456), reason: tool judge reject

Command Line Usage

Basic Judge Command

# Run judge on local scenarios
are-benchmark judge -d /path/to/scenarios

# Run judge on Hugging Face dataset
are-benchmark judge --hf are-benchmark/gaia2 --hf-split validation

Judge System LLM Configuration

The judge system uses its own separate LLM engine for soft validation (semantic comparison of tool arguments). This LLM engine is independent of the model configuration you specify for your main agent, and can be customized for cost control and performance optimization.

Default Judge Configuration

# Judge command - uses default judge model configuration
uvx --from meta-agents-research-environments are-benchmark judge --hf are-benchmark/gaia2 --hf-split validation \
  --output_dir ./judge_results

Custom Judge Model Configuration

You can specify custom judge model settings to control costs and performance:

# Use custom judge model and provider
uvx --from meta-agents-research-environments are-benchmark judge --hf are-benchmark/gaia2 --hf-split validation \
  --judge_model custom-juge-model --judge_provider custom-provider \
  --output_dir ./judge_results

# Use different provider for judge vs main agent
uvx --from meta-agents-research-environments are-benchmark run --hf are-benchmark/gaia2 --hf-split test \
  --model custom-model_model --model_provider custom-model-provider\
  --judge_model custom-juge-model --judge_provider custom-judge-model-provider

# Use custom endpoint for judge model
uvx --from meta-agents-research-environments are-benchmark judge --hf are-benchmark/gaia2 --hf-split validation \
  --judge_model custom-judge-model --judge_provider custom-provider \
  --judge_endpoint http://localhost:8000

Judge Model Configuration Options

–judge_model

Model to use for the judge system validation. Use a capable model for best evaluation quality.

  • Default: “meta-llama/Meta-Llama-3.3-70B-Instruct”

  • Examples: “gpt-4”, “claude-3-opus”, “meta-llama/llama3-70b-instruct”

–judge_provider

Provider for the judge model. If not specified, uses the same provider as the main model.

  • Supports all LiteLLM providers: openai, anthropic, huggingface, llama-api, etc.

  • Allows separate billing control from your main agent model

–judge_endpoint

Custom endpoint URL for the judge model (optional).

  • Useful for local deployments or custom inference servers

  • Must be OpenAI-compatible API format

Note

Reproducible Results: For consistent and reproducible evaluation results, use llama3.3-70B as the judge model.

Note

Judge LLM Independence: The judge system uses its own configurable LLM engine, which is separate from and independent of the model configuration you specify for your main agent (–model, –model_provider, etc.). The judge’s LLM is used for:

  • SoftToolJudge: Semantic comparison of tool arguments when exact matching isn’t sufficient

  • InContextJudge: LLM-based evaluation of entire agent traces

Hard validation (exact matching, scripted checks) does not require LLM inference and runs regardless of any model configuration.

Configuration

Judge Configuration Classes

GraphPerEventJudgeConfig

Primary judge configuration with tool-specific validation rules

Note

Reproducible Results: For consistent and reproducible evaluation results, use default GraphPerEventJudgeConfig config.

InContextJudgeConfig

Baseline LLM-based judge configuration with evaluation criteria

AgentEventJudgeConfig

Agent event validation with time tolerance settings

Tool Judge Configs
  • HardToolJudgeConfig: Scripted validation rules

  • SoftToolJudgeConfig: LLM validation settings

  • MildToolJudgeConfig: Combined validation approach

Example Configuration

# Configure judge for email scenario validation
judge_config = GraphPerEventJudgeConfig(
    # Time validation
    check_time_threshold_seconds=30.0,
    pre_event_tolerance_seconds=5.0,
    post_event_tolerance_seconds=20.0,

    # Checker types for each tool
    per_tool_arg_to_checker_type={
        "send_email": {
            "recipient": CheckerType.eq_checker,
            "subject": CheckerType.llm_checker,
            "body": CheckerType.llm_checker
        },
        "read_file": {
            "file_path": CheckerType.path_checker
        }
    },

    # Soft checkers
    per_tool_soft_checker_types={
        "send_email": [
            SoftCheckerType.placeholder_checker,
            SoftCheckerType.content_checker,
            SoftCheckerType.tone_checker,
            SoftCheckerType.signature_checker
         ],
    },

    # Allow extra user messages
    extra_send_message_to_user_allowed=1
)

Next Steps

For comprehensive benchmarking workflows that incorporate judge validation, see Benchmarking with Meta Agents Research Environments. Ready to develop custom scenarios? Continue to Scenario Development for detailed guidance on creating scenarios with proper oracle events for validation.

Judge System Classes

class are.simulation.validation.base.BaseJudge(config)[source]

Bases: ABC

Base class for a judge. A judge compares an agent and oracle event log for a given scenario.

abstract initialize_state(scenario)[source]
Return type:

None

validate(env)[source]
Return type:

ScenarioValidationResult

validate_current_turn(env)[source]
Return type:

ScenarioValidationResult

trigger_condition(env, turn_idx)[source]
Return type:

tuple[bool, dict[str, str]]

class are.simulation.validation.judge.GraphPerEventJudge(config=None)[source]

Bases: BaseJudge

Verifies the correctness of agent events by comparing them with oracle events. The judge performs a preliminary check to ensure that the tool call counts are the same in both the oracle and agent events. If this check passes, it orders the oracle events in topological order based on their dependencies. Then in attempts to match each one with an agent event using two types of tool judges: - A hard tool judge for specific arguments - A soft tool judge (LLM-based) for other arguments. Once a match is found, the judge verifies the causality by ensuring that all parent events of the oracle event have already been matched with previous agent events. If all oracle events are successfully matched, the judge returns a success.

get_judge_kwargs(oracle_event)[source]
Return type:

dict[str, Any]

initialize_state(scenario)[source]
update_state(env)[source]
check_tool_call_counts(agent_counter, agent_aui_count, oracle_counter, oracle_aui_count, extra_send_message_to_user_allowed=0)[source]
Return type:

Judgment

preliminary_checks(agent_events, oracle_events)[source]

Check if the agent call the same tools the same number of times as the oracle except for sending a message to the user where extra calls may be allowed

Return type:

Judgment

match_oracle_event(oracle_event)[source]

Match an oracle event with an agent event.

Return type:

Judgment

inner_call(env)[source]
Return type:

Judgment

class are.simulation.validation.judge.InContextJudge(config)[source]

Bases: BaseJudge

A baseline judge that put all the agent and oracle events in the context of a model.

initialize_state(scenario)[source]
update_state(env)[source]
get_oracle_graph_str(event_id)[source]
Return type:

str

list_events_str(events, oracle_events=False)[source]
Return type:

str

build_user_prompt(agent_events, oracle_events)[source]
Return type:

str

inner_call(env)[source]
Return type:

bool | None

Judge States

class are.simulation.validation.base.BaseJudgeState(initialized=False, nb_turns=-1, turn_idx=-1, last_turn_success=True, last_turn_rationale='', scenario_start_time=0.0, scenario_tasks=<factory>, user_details=None, turn_to_oracle_events=<factory>, turn_to_oracle_graph=<factory>, oracle_event_id_to_turn_idx=<factory>, turn_to_agent_events=<factory>)[source]

Bases: object

initialized: bool = False
nb_turns: int = -1
turn_idx: int = -1
last_turn_success: bool = True
last_turn_rationale: str = ''
scenario_start_time: float = 0.0
scenario_tasks: list[str]
user_details: Contact | None = None
turn_to_oracle_events: list[list[CompletedOracleEvent]]
turn_to_oracle_graph: list[dict[str, list[str]]]
oracle_event_id_to_turn_idx: dict[str, int]
turn_to_agent_events: list[list[CompletedEvent]]
property agent_events: list[CompletedEvent]
property current_turn_agent_events: list[CompletedEvent]
property current_turn_oracle_events: list[CompletedOracleEvent]
property current_turn_oracle_graph: dict[str, list[str]]
__init__(initialized=False, nb_turns=-1, turn_idx=-1, last_turn_success=True, last_turn_rationale='', scenario_start_time=0.0, scenario_tasks=<factory>, user_details=None, turn_to_oracle_events=<factory>, turn_to_oracle_graph=<factory>, oracle_event_id_to_turn_idx=<factory>, turn_to_agent_events=<factory>)
class are.simulation.validation.judge_states.GraphPerEventJudgeState(initialized=False, nb_turns=-1, turn_idx=-1, last_turn_success=True, last_turn_rationale='', scenario_start_time=0.0, scenario_tasks=<factory>, user_details=None, turn_to_oracle_events=<factory>, turn_to_oracle_graph=<factory>, oracle_event_id_to_turn_idx=<factory>, turn_to_agent_events=<factory>, agent_idx_to_oracle_id=<factory>, oracle_id_to_agent_idx=<factory>, agent_id_to_oracle_id=<factory>)[source]

Bases: BaseJudgeState

agent_idx_to_oracle_id: dict[int, str]
oracle_id_to_agent_idx: dict[str, int]
agent_id_to_oracle_id: dict[str, str]
add_match(agent_idx, oracle_id)[source]
class are.simulation.validation.judge_states.InContextJudgeState(initialized=False, nb_turns=-1, turn_idx=-1, last_turn_success=True, last_turn_rationale='', scenario_start_time=0.0, scenario_tasks=<factory>, user_details=None, turn_to_oracle_events=<factory>, turn_to_oracle_graph=<factory>, oracle_event_id_to_turn_idx=<factory>, turn_to_agent_events=<factory>)[source]

Bases: BaseJudgeState

property agent_id_to_oracle_id: dict[str, str]

There is no agent event to oracle event one to one matching in this judge

property user_name: str

Event Judge Classes

class are.simulation.validation.base.EventJudge(config, event_type, judge_type='')[source]

Bases: ABC

Base class for an event judge. An event judge compares a agent and oracle events and decides if the two match.

abstract compare(agent_event, oracle_event, **kwargs)[source]
Return type:

bool | None

class are.simulation.validation.event_judge.EnvUserEventJudge(event_type, config)[source]

Bases: EventJudge

A judge that compares a pair of environment/user events from the agent log and the oracle agent log. The two events match if their event ids is the same.

eq_checker(x_agent, x_oracle, **kwargs)[source]
Return type:

bool

compare(agent_event, oracle_event, **kwargs)[source]
Return type:

bool | None

class are.simulation.validation.event_judge.AgentEventJudge(config)[source]

Bases: EventJudge

A judge that compares a pair of agent events from the agent log and the oracle agent log.

event_time_checker(agent_event_time, oracle_event_time, pre_event_tolerance_seconds=5.0, post_event_tolerance_seconds=20.0, event_time_comparator=None)[source]

Checks if the agent event time is within the allowed tolerance range compared to the oracle event time.

Return type:

bool

Args:

agent_event_time (float): The time of the agent event (relative or absolute) oracle_event_time (float): The time of the oracle event (relative or absolute). pre_event_tolerance_seconds (float): The allowed time in seconds before the oracle event time. post_event_tolerance_seconds (float): The allowed time in seconds after the oracle event time. event_time_comparator (str | None): The type of comparison to perform between the agent and oracle event times. The arg type is str instead of EventTimeComparator for better readability in the tracer.

Returns:

bool: True if the agent event time is within the allowed tolerance range, False otherwise.

check_time(agent_event, oracle_event, max_parent_oracle_event_time, max_parent_agent_event_time)[source]
Return type:

bool

compare(agent_event, oracle_event, **kwargs)[source]
Return type:

bool | None

Tool Judge Classes

class are.simulation.validation.base.ToolJudge(config, judge_type)[source]

Bases: ABC

Base class for a tool judge. A tool judge compares a agent and oracle event representing a tool call. It decides if the two actions of the events, representing a tool call match.

abstract compare(agent_event, oracle_event, **kwargs)[source]
Return type:

bool | None

class are.simulation.validation.tool_judge.HardToolJudge(config)[source]

Bases: ToolJudge

A judge that performs a scripted check on some action args to compare an agent and oracle event representing a tool call.

eq_checker(x_agent, x_oracle, **kwargs)[source]
Return type:

bool

unordered_list_checker(x_agent, x_oracle, **kwargs)[source]
Return type:

bool

path_checker(x_agent, x_oracle, **kwargs)[source]
Return type:

bool

unordered_path_list_checker(x_agent, x_oracle, **kwargs)[source]
Return type:

bool

list_attendees_checker(x_agent, x_oracle, tolerance_list_str=None, **kwargs)[source]
Return type:

bool

unordered_str_list_with_tolerance_checker(x_agent, x_oracle, tolerance_list_str=None, **kwargs)[source]
Return type:

bool

datetime_checker(x_agent, x_oracle, **kwargs)[source]
Return type:

bool

eq_str_strip_checker(x_agent, x_oracle, **kwargs)[source]
Return type:

bool

phone_number_checker(x_agent, x_oracle, **kwargs)[source]
Return type:

bool

contain_any_checker(x_agent, targets, **kwargs)[source]
Return type:

bool

contain_all_checker(x_agent, targets, **kwargs)[source]
Return type:

bool

compare(agent_event, oracle_event, **kwargs)[source]
Return type:

bool

class are.simulation.validation.tool_judge.SoftToolJudge(config)[source]

Bases: ToolJudge

A soft judge that compares some action args of an agent and oracle event representing a tool call with an llm.

describe_action_args(args)[source]
Return type:

str

equality_checker(agent_args, oracle_args, **kwargs)[source]
Return type:

bool

placeholder_checker(agent_args, **kwargs)[source]
Return type:

bool

extract_subtask(oracle_action_call, task)[source]
Return type:

str

content_checker(agent_args, oracle_args, today_date, user_address, subtask, **kwargs)[source]
Return type:

bool | None

signature_checker(agent_args, user_name, **kwargs)[source]
Return type:

bool | None

sanity_checker(agent_args, task='', previous_task='', **kwargs)[source]
Return type:

bool | None

cab_checker(agent_args, oracle_args, user_address, **kwargs)[source]
Return type:

bool | None

email_checker(agent_args, oracle_args, today_date, **kwargs)[source]
Return type:

bool | None

message_checker(agent_args, oracle_args, today_date, **kwargs)[source]
Return type:

bool | None

event_checker(agent_args, oracle_args, user_address, subtask, **kwargs)[source]
Return type:

bool | None

user_message_checker(agent_args, oracle_args, subtask, **kwargs)[source]
Return type:

bool | None

tone_checker(agent_args, **kwargs)[source]
Return type:

bool | None

get_checker_kwargs(kwargs, oracle_event, oracle_args)[source]
Return type:

dict[str, Any]

compare(agent_event, oracle_event, **kwargs)[source]
Return type:

bool | None

class are.simulation.validation.tool_judge.MildToolJudge(config)[source]

Bases: ToolJudge

A mild judge that combines a hard and soft judge to compare an agent and oracle event representing a tool call. If first call the hard judge and if it passes, then call the soft judge.

compare(agent_event, oracle_event, **kwargs)[source]
Return type:

bool | None

Configs Classes

class are.simulation.validation.configs.BaseJudgeConfig(tracer=None)[source]

Bases: object

tracer: Optional[Callable] = None
class are.simulation.validation.configs.GraphPerEventJudgeConfig(tracer=None, check_time_threshold_seconds=1.0, pre_event_tolerance_seconds=10.0, post_event_tolerance_seconds=25.0, per_tool_arg_to_checker_type=<factory>, engine=<factory>, per_tool_soft_checker_types=<factory>, event_id_to_checker_params=None, extra_send_message_to_user_allowed=1)[source]

Bases: BaseJudgeConfig

check_time_threshold_seconds: float = 1.0
pre_event_tolerance_seconds: float = 10.0
post_event_tolerance_seconds: float = 25.0
per_tool_arg_to_checker_type: ToolArgCheckerTypeRegistry
engine: Callable
per_tool_soft_checker_types: ToolSoftCheckerTypeRegistry
event_id_to_checker_params: dict[str, list[ToolCheckerParam]] | None = None
extra_send_message_to_user_allowed: int = 1
class are.simulation.validation.configs.ScriptedGraphPerEventJudgeConfig(tracer=None, check_time_threshold_seconds=1.0, pre_event_tolerance_seconds=10.0, post_event_tolerance_seconds=25.0, per_tool_arg_to_checker_type=<factory>, engine=<factory>, per_tool_soft_checker_types=<factory>, event_id_to_checker_params=<factory>, extra_send_message_to_user_allowed=1)[source]

Bases: GraphPerEventJudgeConfig

Config for the scripted graph per event judge. Scripted judge is a judge where the soft judge is deactivated and instead scripted checks will be performed by the hard judge. The event_id_to_checker_params field is used to specify the scripted checks to perform.

class are.simulation.validation.configs.InContextJudgeConfig(tracer=None, check_time_threshold_seconds=1.0, pre_event_tolerance_seconds=10.0, post_event_tolerance_seconds=25.0, time_system_prompt_template='\\nAll agent actions matching an oracle action with a delay exceeding {{check_time_threshold_seconds}} seconds relative to its parent should be executed within the following time window:\\n[oracle action delay - {{pre_event_tolerance_seconds}} seconds, oracle action delay + {{post_event_tolerance_seconds}} seconds]\\n', per_tool_evaluation_criteria=<factory>, tool_to_selected_args=<factory>, engine=<factory>, system_prompt_template='### Evaluator Task\\nYou are an impartial evaluator responsible for assessing the success of an agent assisting a user within an environment in comparison to an oracle agent.\\nIn this environment, the user communicates with the agent via a UserAgentInterface, and the agent utilizes tools from various apps to fulfill user requests.\\nYou will receive two lists of actions (each action is a tool call): one representing actions taken by the agent and another representing actions performed by a skilled oracle agent that perfectly fulfilled the user\\'s request.\\n\\n###Instructions\\nFirst, you will list the differences and similarities between the actions taken by the agent and those performed by the oracle agent.\\nThen, based on the evaluation criteria below, you will decide if the agent\\'s actions match the oracle agent\\'s actions within acceptable tolerance limits.\\n\\n### Evaluation Criteria\\nThe agent\\'s actions should be executed in an order that does not violate the causal relationships between oracle actions provided by with the parent tool call ids.\\nThe number of calls to each tool should be the same for the agent and the oracle agent actions.\\nThe agent\\'s action call parameters should be free of significant grammatical or spelling errors and maintain an appropriate tone.\\n{{evaluation_criteria}}\\n\\n### Input Format\\nThe input will be provided in the following format:\\n\\nAgent Actions:\\n\\n< List of agent actions in the format:\\n    Tool name: <name of the tool used in the action>\\n    Tool call time: <time of the action>\\n    Arguments:\\n    <tool arguments>\\n>\\n\\nOracle Actions:\\n\\n< List of oracle actions in the format:\\n    Tool call id: <id of the oracle tool call>\\n    Parent tool call ids: <ids of the parent tool calls>\\n    Tool name: <name of the tool used in the action>\\n    Tool call time: <time of the action>\\n    Arguments:\\n    <tool arguments>\\n>\\n\\nTask: <user\\'s task>\\n\\nPrevious task: <previous task solved by the agent>\\n\\nUser name: <name of the user>\\n\\n### Output Format\\nFor the evaluation, first list the differences and similarities between the agent and oracle agent actions.\\nThen give your reasoning as to why the agent\\'s actions match or critically differ from the oracle agent actions.\\nFinally, provide your final evaluation by strictly following this format: "[[success]]" if the agent actions match the oracle agent actions otherwise "[[failure]]".\\nReport your evaluation in the following format:\\n\\n-Similarities and differences: <List the differences and similarities between the agent and oracle agent actions.>\\n-Reasoning: <Detailed explanation of why the agent\\'s actions match or not the oracle agent actions.>\\n-Evaluation: <[[success]] if the agent actions match oracle agent actions [[failure]] otherwise.>\\n\\n### Your Evaluation\\nFor the following input, provide your evaluation following the output format specified above.\\n')[source]

Bases: BaseJudgeConfig

check_time_threshold_seconds: float = 1.0
pre_event_tolerance_seconds: float = 10.0
post_event_tolerance_seconds: float = 25.0
time_system_prompt_template: str = '\nAll agent actions matching an oracle action with a delay exceeding {{check_time_threshold_seconds}} seconds relative to its parent should be executed within the following time window:\n[oracle action delay - {{pre_event_tolerance_seconds}} seconds, oracle action delay + {{post_event_tolerance_seconds}} seconds]\n'
per_tool_evaluation_criteria: ToolCriteriaRegistry
tool_to_selected_args: ToolArgCheckerTypeRegistry
engine: Callable
system_prompt_template: str = '### Evaluator Task\nYou are an impartial evaluator responsible for assessing the success of an agent assisting a user within an environment in comparison to an oracle agent.\nIn this environment, the user communicates with the agent via a UserAgentInterface, and the agent utilizes tools from various apps to fulfill user requests.\nYou will receive two lists of actions (each action is a tool call): one representing actions taken by the agent and another representing actions performed by a skilled oracle agent that perfectly fulfilled the user\'s request.\n\n###Instructions\nFirst, you will list the differences and similarities between the actions taken by the agent and those performed by the oracle agent.\nThen, based on the evaluation criteria below, you will decide if the agent\'s actions match the oracle agent\'s actions within acceptable tolerance limits.\n\n### Evaluation Criteria\nThe agent\'s actions should be executed in an order that does not violate the causal relationships between oracle actions provided by with the parent tool call ids.\nThe number of calls to each tool should be the same for the agent and the oracle agent actions.\nThe agent\'s action call parameters should be free of significant grammatical or spelling errors and maintain an appropriate tone.\n{{evaluation_criteria}}\n\n### Input Format\nThe input will be provided in the following format:\n\nAgent Actions:\n\n< List of agent actions in the format:\n    Tool name: <name of the tool used in the action>\n    Tool call time: <time of the action>\n    Arguments:\n    <tool arguments>\n>\n\nOracle Actions:\n\n< List of oracle actions in the format:\n    Tool call id: <id of the oracle tool call>\n    Parent tool call ids: <ids of the parent tool calls>\n    Tool name: <name of the tool used in the action>\n    Tool call time: <time of the action>\n    Arguments:\n    <tool arguments>\n>\n\nTask: <user\'s task>\n\nPrevious task: <previous task solved by the agent>\n\nUser name: <name of the user>\n\n### Output Format\nFor the evaluation, first list the differences and similarities between the agent and oracle agent actions.\nThen give your reasoning as to why the agent\'s actions match or critically differ from the oracle agent actions.\nFinally, provide your final evaluation by strictly following this format: "[[success]]" if the agent actions match the oracle agent actions otherwise "[[failure]]".\nReport your evaluation in the following format:\n\n-Similarities and differences: <List the differences and similarities between the agent and oracle agent actions.>\n-Reasoning: <Detailed explanation of why the agent\'s actions match or not the oracle agent actions.>\n-Evaluation: <[[success]] if the agent actions match oracle agent actions [[failure]] otherwise.>\n\n### Your Evaluation\nFor the following input, provide your evaluation following the output format specified above.\n'

Factory classes

class are.simulation.validation.factory.JudgeFactory[source]

Bases: object

Judgment Classes

class are.simulation.validation.judgment.Failure[source]

Bases: object

class are.simulation.validation.judgment.ToolCallCountsFailure(agent_counter, agent_aui_count, oracle_counter, oracle_aui_count, extra_send_message_to_user_allowed=0)[source]

Bases: Failure

agent_counter: Counter
agent_aui_count: int
oracle_counter: Counter
oracle_aui_count: int
extra_send_message_to_user_allowed: int = 0
class are.simulation.validation.judgment.EventComparisonFailureType(value)[source]

Bases: Enum

An enumeration.

CAUSALITY = 'causality'
ALREADY_MATCHED = 'already matched'
TOOL_JUDGE_REJECT = 'tool judge reject'
class are.simulation.validation.judgment.EventComparisonFailure(agent_tool_name, agent_event_id, oracle_tool_name, oracle_event_id, failure_type)[source]

Bases: Failure

agent_tool_name: str
agent_event_id: str
oracle_tool_name: str
oracle_event_id: str
failure_type: EventComparisonFailureType
class are.simulation.validation.judgment.OracleEventMatchingFailure(oracle_tool_name, oracle_tool_args, comparison_failures)[source]

Bases: Failure

oracle_tool_name: str
oracle_tool_args: dict[str, str]
comparison_failures: list[EventComparisonFailure]
class are.simulation.validation.judgment.EnvOracleMatchingFailure(oracle_event_id)[source]

Bases: Failure

oracle_event_id: str
class are.simulation.validation.judgment.Judgment(success=False, failure=None, agent_event_id_to_oracle_event_id=<factory>)[source]

Bases: object

success: bool | None = False
failure: str | Failure | None = None
agent_event_id_to_oracle_event_id: dict[str, str]