Judge System for Scenario Validation¶

The Agents Research Environments judge system provides comprehensive validation capabilities for evaluating agent performance against ground truth oracle scenarios.

Overview ¶

The judge system compares agent execution traces against oracle (ground truth) scenarios to determine success or failure. It operates in two main modes:

Online Validation: Real-time validation during scenario execution
Offline Validation: Post-execution validation using the judge command

The system uses a hierarchical approach with multiple types of judges that evaluate different aspects of agent behavior:

Event-level validation: Comparing individual agent actions against oracle actions
Tool-level validation: Verifying correct tool usage and parameters
Temporal validation: Ensuring actions occur within acceptable time windows
Causal validation: Verifying proper dependency ordering between actions

Judge Architecture ¶

Base Judge Classes ¶

BaseJudge: Abstract base class for all judges that compare agent and oracle event logs for a given scenario.
ToolJudge: Base class for judges that compare individual tool calls between agent and oracle events.
EventJudge: Base class for judges that compare agent and oracle events and determine if they match.

Judge State Management ¶

The judge system maintains state throughout validation:

@dataclass
class BaseJudgeState:
    # Initialization flag
    initialized: bool = False

    # Turn tracking
    nb_turns: int = -1
    turn_idx: int = -1
    last_turn_success: bool = True
    last_turn_rationale: str = ""

   # Scenario data
   scenario_start_time: float = 0.0
   scenario_tasks: list[str] = field(default_factory=list)
   user_details: Contact | None = None

   # Oracle events
   turn_to_oracle_events: list[list[CompletedOracleEvent]] = field(
      default_factory=list
   )
   turn_to_oracle_graph: list[dict[str, list[str]]] = field(default_factory=list)
   oracle_event_id_to_turn_idx: dict[str, int] = field(default_factory=dict)

   # Agent events
   turn_to_agent_events: list[list[CompletedEvent]] = field(default_factory=list)

Judge Types ¶

GraphPerEventJudge ¶

The primary judge implementation that performs comprehensive validation by:

Preliminary Checks: Verifies tool call counts match between agent and oracle
Event Matching: Attempts to match each oracle event with an agent event
Causality Verification: Ensures proper dependency ordering
Tool Validation: Uses specialized tool judges for event comparison

Key Features:

Topological ordering of oracle events based on dependencies
Support for extra send_message_to_user calls from agents
Detailed failure reporting with specific mismatch information

Usage Example:

# Run judge mode for oneline validation
uvx --from meta-agents-research-environments are-benchmark run -d /path/to/scenarios --limit 10

# Run judge mode for offline validation
uvx --from meta-agents-research-environments are-benchmark judge -d /path/to/scenarios --limit 10

A specialized implementation of the GraphPerEventJudge that uses purely deterministic, scripted validation without LLM-based soft validation. This judge is ideal for scenarios where you need predictable, reproducible validation results and have well-defined validation criteria.

Core Characteristics:

LLM-Free Operation: Completely deactivates soft judges and relies only on hard, scripted checkers
Event-Specific Validation: Uses custom validation rules per oracle event

Key Configuration:

The judge requires an event_id_to_checker_params mapping that defines specific validation rules for each oracle event:

# Example configuration for scripted validation
scripted_config = ScriptedGraphPerEventJudgeConfig(
    event_id_to_checker_params={
        "oracle_send_email": [
            ToolCheckerParam(
                arg_name="recipient",
                checker_type=CheckerType.eq_checker,
                tool_name="EmailApp__send_email"
            ),
            ToolCheckerParam(
                arg_name="subject",
                checker_type=CheckerType.contain_any_checker,
                tool_name="EmailApp__send_email",
                checker_args={"targets": ["urgent", "important"]}
            )
        ],
        "oracle_send_message": [
            ToolCheckerParam(
                arg_name="content",
                checker_type=CheckerType.contain_all_checker,
                tool_name="MessagingApp__send_message",
                checker_args={"targets": ["meeting", "2pm"]}
            )
        ]
    },
    extra_send_message_to_user_allowed=0,
    pre_event_tolerance_seconds=5.0,
    post_event_tolerance_seconds=20.0
)

ToolCheckerParam Structure:

Each ToolCheckerParam defines validation rules for specific tool arguments:

arg_name: The argument name to validate (e.g., “content”, “recipient”)
checker_type: The type of validation to perform (see available checkers below)
tool_name: The full tool name including app prefix (e.g., “EmailApp__send_email”)
checker_args: Additional parameters for the checker (optional)

Available Checker Types:

CheckerType.eq_checker: Exact equality comparison
CheckerType.contain_any_checker: Checks if argument contains any of the target strings
CheckerType.contain_all_checker: Checks if argument contains all target strings
CheckerType.unordered_list_checker: Set-based list comparison ignoring order
CheckerType.path_checker: Normalized file path comparison
CheckerType.phone_number_checker: Phone number format validation
CheckerType.datetime_checker: Date/time format validation

Example Scenario Integration:

class MyScenario(Scenario):
    def __init__(self):
        # Define checker parameters for each oracle event
        self.d_checker_params = {
            "oracle_book_restaurant": [
                ToolCheckerParam(
                    arg_name="restaurant_name",
                    checker_type=CheckerType.contain_any_checker,
                    tool_name="BookingApp__make_reservation",
                    checker_args={"targets": ["Italian", "Chinese"]}
                ),
                ToolCheckerParam(
                    arg_name="party_size",
                    checker_type=CheckerType.eq_checker,
                    tool_name="BookingApp__make_reservation"
                )
            ]
        }

    def initialize(self, **kwargs):
        super().initialize(**kwargs)
        self.judge = JudgeFactory()(
            ScriptedGraphPerEventJudgeConfig(
                event_id_to_checker_params=self.d_checker_params,
                extra_send_message_to_user_allowed=1
            )
        )

InContextJudge ¶

A baseline judge that uses LLM-based evaluation by providing all agent and oracle events in context to a model for comparison.

Features:

LLM-powered validation using configurable evaluation criteria
Support for tool-specific evaluation templates
Time-based validation with configurable tolerance windows

Tool Judges ¶

The system includes three types of tool judges for validating individual tool calls:

HardToolJudge ¶

Performs scripted, deterministic checks on tool arguments using predefined checkers:

Available Checkers:

eq_checker: Exact equality comparison
unordered_list_checker: Set-based list comparison
datetime_checker: Date/time format validation
phone_number_checker: Phone number format validation
path_checker: File path normalization and comparison
contain_any_checker: Substring containment validation
contain_all_checker: Multiple substring validation

Example Configuration:

# Hard validation for exact matches
arg_to_checker_type = {
    "recipient": CheckerType.eq_checker,
    "file_paths": CheckerType.unordered_path_list_checker,
    "phone": CheckerType.phone_number_checker
}

SoftToolJudge ¶

The SoftToolJudge uses specialized LLM-based validation for semantic comparison of tool arguments. It employs multiple targeted checkers, each optimized for specific validation scenarios.

Architecture:

The judge operates in a two-phase approach:

Equality Pre-check: Quick comparison to avoid unnecessary LLM calls when arguments are identical
Specialized Soft Checkers: LLM-powered validation using domain-specific checkers

Available Soft Checkers:

content_checker: Validates semantic equivalence of content against oracle and task context
signature_checker: Ensures proper user name/signature usage in communications
sanity_checker: Performs basic reasonableness checks on agent outputs
placeholder_checker: Detects and rejects placeholder text (e.g., “[User’s Name]”, “[Your Name]”)
cab_checker: Validates cab/ride booking details against user address
email_checker: Specialized validation for email compositions
message_checker: Validates message content and formatting
user_message_checker: Validates user-directed messages for appropriateness
event_checker: Validates event details against context and user information
tone_checker: Ensures appropriate communication tone

Key Features:

Equality Pre-check: Avoids LLM calls when agent and oracle arguments are identical
Subtask Extraction: Automatically extracts relevant subtasks from the broader task context
Context-Aware Validation: Uses user details, dates, and task context for validation
Placeholder Detection: Built-in detection for common placeholder text patterns

Validation Process:

Initial Setup: Identifies arguments marked for LLM checking
Equality Check: Compares normalized agent and oracle arguments
Context Preparation: Extracts user details, dates, and subtasks as needed
Checker Execution: Runs configured soft checkers in sequence
Result Aggregation: Returns success only if all checkers pass

Example Configuration:

# Configure split soft judge with multiple checkers
config = SplitSoftToolJudgeConfig(
    tool_name="send_email",
    arg_to_checker_type={
        "subject": CheckerType.llm_checker,
        "body": CheckerType.llm_checker,
        "recipient": CheckerType.eq_checker  # Hard check for exact match
    },
    soft_checker_types=[
        SoftCheckerType.placeholder_checker,  # Reject placeholder text
        SoftCheckerType.content_checker,      # Semantic content validation
        SoftCheckerType.tone_checker,         # Appropriate communication tone
        SoftCheckerType.signature_checker     # Proper user signature
    ],
    engine=llm_engine
)

Context Extraction:

The judge automatically extracts relevant context for validation:

User Details: Name and address from scenario user information
Temporal Context: Event date/time formatted for validation
Task Context: Extracts relevant subtasks using LLM-based extraction
Previous Tasks: Maintains context from prior scenario steps

Placeholder Detection:

Built-in detection for common placeholder patterns:

Detected Placeholders (automatically rejected):
- "[User's Name]", "[User Name]", "[User]"
- "[Your Name]", "[My Name]"
- "Best regards,\nYour Name"
- "Best,\nYour Name"

MildToolJudge ¶

Combines hard and soft validation approaches:

Hard Validation: Performs scripted checks first
Soft Validation: Falls back to LLM validation for remaining arguments

This approach provides both reliability (hard checks) and flexibility (soft checks).

Temporal Validation ¶

The judge system includes sophisticated time-based validation:

Time Comparison Types ¶

EQUAL (default): Agent event must occur within tolerance window around oracle time
LESS_THAN: Agent event must occur before oracle time (plus post-tolerance)
GREATER_THAN: Agent event must occur after oracle time (minus pre-tolerance)

Configuration Options:

# Time validation settings
check_time_threshold_seconds = 30.0      # Minimum time gap to check
pre_event_tolerance_seconds = 5.0        # Allowed time before oracle
post_event_tolerance_seconds = 20.0      # Allowed time after oracle

Absolute vs Relative Time ¶

Absolute Time: Direct comparison against wall-clock time
Relative Time: Comparison relative to parent event completion times

Causality Validation ¶

The judge system enforces proper dependency ordering:

Dependency Graph

Oracle events include parent-child relationships

Causality Rules

All parent events must be matched before child events
Agent events must respect the same ordering constraints
Violations result in validation failure

Example:

Oracle Event Graph:
A → B → D
A → C → D

Valid Agent Sequence: A, B, C, D (or A, C, B, D)
Invalid Agent Sequence: B, A, C, D (B before A violates causality)

Failure Types and Diagnostics ¶

The judge system provides detailed failure information:

ToolCallCountsFailure ¶

Indicates mismatched tool usage counts between agent and oracle:

Agent and oracle counters do not match for the following tools:
- Tool 'send_email': Agent count 2, Oracle count 1
- Tool 'read_file': Agent count 0, Oracle count 1

EventComparisonFailure ¶

Specific event matching failures with detailed context:

CAUSALITY: Dependency ordering violation
ALREADY_MATCHED: Agent event already matched to another oracle event
TOOL_JUDGE_REJECT: Tool-specific validation failed

OracleEventMatchingFailure ¶

Comprehensive failure when no agent event matches an oracle event:

Failure: Agent did not perform the following oracle tool call:
tool name: send_email
tool args:
- recipient: john@example.com
- subject: Meeting Reminder
- body: Don't forget about our 2pm meeting

List of matching attempts:
- Failure matching agent event (ID: evt_123) with oracle event (ID: oracle_456), reason: tool judge reject

Command Line Usage ¶

Basic Judge Command ¶

# Run judge on local scenarios
are-benchmark judge -d /path/to/scenarios

# Run judge on Hugging Face dataset
are-benchmark judge --hf are-benchmark/gaia2 --hf-split validation

Judge System LLM Configuration ¶

The judge system uses its own separate LLM engine for soft validation (semantic comparison of tool arguments). This LLM engine is independent of the model configuration you specify for your main agent, and can be customized for cost control and performance optimization.

Default Judge Configuration

# Judge command - uses default judge model configuration
uvx --from meta-agents-research-environments are-benchmark judge --hf are-benchmark/gaia2 --hf-split validation \
  --output_dir ./judge_results

Custom Judge Model Configuration

You can specify custom judge model settings to control costs and performance:

# Use custom judge model and provider
uvx --from meta-agents-research-environments are-benchmark judge --hf are-benchmark/gaia2 --hf-split validation \
  --judge_model custom-juge-model --judge_provider custom-provider \
  --output_dir ./judge_results

# Use different provider for judge vs main agent
uvx --from meta-agents-research-environments are-benchmark run --hf are-benchmark/gaia2 --hf-split test \
  --model custom-model_model --model_provider custom-model-provider\
  --judge_model custom-juge-model --judge_provider custom-judge-model-provider

# Use custom endpoint for judge model
uvx --from meta-agents-research-environments are-benchmark judge --hf are-benchmark/gaia2 --hf-split validation \
  --judge_model custom-judge-model --judge_provider custom-provider \
  --judge_endpoint http://localhost:8000

Judge Model Configuration Options

–judge_model

Model to use for the judge system validation. Use a capable model for best evaluation quality.

Default: “meta-llama/Meta-Llama-3.3-70B-Instruct”
Examples: “gpt-4”, “claude-3-opus”, “meta-llama/llama3-70b-instruct”

–judge_provider

Provider for the judge model. If not specified, uses the same provider as the main model.

Supports all LiteLLM providers: openai, anthropic, huggingface, llama-api, etc.
Allows separate billing control from your main agent model

–judge_endpoint

Custom endpoint URL for the judge model (optional).

Useful for local deployments or custom inference servers
Must be OpenAI-compatible API format

Note

Reproducible Results: For consistent and reproducible evaluation results, use llama3.3-70B as the judge model.

Note

Judge LLM Independence: The judge system uses its own configurable LLM engine, which is separate from and independent of the model configuration you specify for your main agent (–model, –model_provider, etc.). The judge’s LLM is used for:

SoftToolJudge: Semantic comparison of tool arguments when exact matching isn’t sufficient
InContextJudge: LLM-based evaluation of entire agent traces

Hard validation (exact matching, scripted checks) does not require LLM inference and runs regardless of any model configuration.

Configuration ¶

Judge Configuration Classes ¶

GraphPerEventJudgeConfig: Primary judge configuration with tool-specific validation rules

Note

Reproducible Results: For consistent and reproducible evaluation results, use default GraphPerEventJudgeConfig config.

InContextJudgeConfig

Baseline LLM-based judge configuration with evaluation criteria

AgentEventJudgeConfig

Agent event validation with time tolerance settings

Tool Judge Configs

HardToolJudgeConfig: Scripted validation rules
SoftToolJudgeConfig: LLM validation settings
MildToolJudgeConfig: Combined validation approach

Example Configuration ¶

# Configure judge for email scenario validation
judge_config = GraphPerEventJudgeConfig(
    # Time validation
    check_time_threshold_seconds=30.0,
    pre_event_tolerance_seconds=5.0,
    post_event_tolerance_seconds=20.0,

    # Checker types for each tool
    per_tool_arg_to_checker_type={
        "send_email": {
            "recipient": CheckerType.eq_checker,
            "subject": CheckerType.llm_checker,
            "body": CheckerType.llm_checker
        },
        "read_file": {
            "file_path": CheckerType.path_checker
        }
    },

    # Soft checkers
    per_tool_soft_checker_types={
        "send_email": [
            SoftCheckerType.placeholder_checker,
            SoftCheckerType.content_checker,
            SoftCheckerType.tone_checker,
            SoftCheckerType.signature_checker
         ],
    },

    # Allow extra user messages
    extra_send_message_to_user_allowed=1
)

Next Steps ¶

For comprehensive benchmarking workflows that incorporate judge validation, see Benchmarking with Meta Agents Research Environments. Ready to develop custom scenarios? Continue to Scenario Development for detailed guidance on creating scenarios with proper oracle events for validation.

Judge System Classes ¶

class are.simulation.validation.base.BaseJudge(config)[source]

Bases: ABC

Base class for a judge. A judge compares an agent and oracle event log for a given scenario.

abstractmethod initialize_state(scenario)[source]

Return type:: None

validate(env)[source]

Return type:: ScenarioValidationResult

validate_current_turn(env)[source]

Return type:: ScenarioValidationResult

trigger_condition(env, turn_idx)[source]

Return type:: tuple[bool, dict[str, str]]

class are.simulation.validation.judge.GraphPerEventJudge(config=None)[source]

Bases: BaseJudge

Verifies the correctness of agent events by comparing them with oracle events. The judge performs a preliminary check to ensure that the tool call counts are the same in both the oracle and agent events. If this check passes, it orders the oracle events in topological order based on their dependencies. Then in attempts to match each one with an agent event using two types of tool judges: - A hard tool judge for specific arguments - A soft tool judge (LLM-based) for other arguments. Once a match is found, the judge verifies the causality by ensuring that all parent events of the oracle event have already been matched with previous agent events. If all oracle events are successfully matched, the judge returns a success.

get_judge_kwargs(oracle_event)[source]

Return type:: dict[str, Any]

initialize_state(scenario)[source]

update_state(env)[source]

check_tool_call_counts(agent_counter, agent_aui_count, oracle_counter, oracle_aui_count, extra_send_message_to_user_allowed=0)[source]

Return type:: Judgment

preliminary_checks(agent_events, oracle_events)[source]

Check if the agent call the same tools the same number of times as the oracle except for sending a message to the user where extra calls may be allowed

Return type:: Judgment

match_oracle_event(oracle_event)[source]

Match an oracle event with an agent event.

Return type:: Judgment

inner_call(env)[source]

Return type:: Judgment

class are.simulation.validation.judge.InContextJudge(config)[source]

Bases: BaseJudge

A baseline judge that put all the agent and oracle events in the context of a model.

initialize_state(scenario)[source]

update_state(env)[source]

get_oracle_graph_str(event_id)[source]

Return type:: str

list_events_str(events, oracle_events=False)[source]

Return type:: str

build_user_prompt(agent_events, oracle_events)[source]

Return type:: str

inner_call(env)[source]

Return type:: bool | None

Judge States ¶

class are.simulation.validation.base.BaseJudgeState(initialized=False, nb_turns=-1, turn_idx=-1, last_turn_success=True, last_turn_rationale='', scenario_start_time=0.0, scenario_tasks=<factory>, user_details=None, turn_to_oracle_events=<factory>, turn_to_oracle_graph=<factory>, oracle_event_id_to_turn_idx=<factory>, turn_to_agent_events=<factory>)[source]

Bases: object

initialized: bool = False

nb_turns: int = -1

turn_idx: int = -1

last_turn_success: bool = True

last_turn_rationale: str = ''

scenario_start_time: float = 0.0

scenario_tasks: list[str]

user_details: Contact | None = None

turn_to_oracle_events: list[list[CompletedOracleEvent]]

turn_to_oracle_graph: list[dict[str, list[str]]]

oracle_event_id_to_turn_idx: dict[str, int]

turn_to_agent_events: list[list[CompletedEvent]]

property agent_events: list[CompletedEvent]

property current_turn_agent_events: list[CompletedEvent]

property current_turn_oracle_events: list[CompletedOracleEvent]

property current_turn_oracle_graph: dict[str, list[str]]

__init__(initialized=False, nb_turns=-1, turn_idx=-1, last_turn_success=True, last_turn_rationale='', scenario_start_time=0.0, scenario_tasks=<factory>, user_details=None, turn_to_oracle_events=<factory>, turn_to_oracle_graph=<factory>, oracle_event_id_to_turn_idx=<factory>, turn_to_agent_events=<factory>)

class are.simulation.validation.judge_states.GraphPerEventJudgeState(initialized=False, nb_turns=-1, turn_idx=-1, last_turn_success=True, last_turn_rationale='', scenario_start_time=0.0, scenario_tasks=<factory>, user_details=None, turn_to_oracle_events=<factory>, turn_to_oracle_graph=<factory>, oracle_event_id_to_turn_idx=<factory>, turn_to_agent_events=<factory>, agent_idx_to_oracle_id=<factory>, oracle_id_to_agent_idx=<factory>, agent_id_to_oracle_id=<factory>)[source]

Bases: BaseJudgeState

agent_idx_to_oracle_id: dict[int, str]

oracle_id_to_agent_idx: dict[str, int]

agent_id_to_oracle_id: dict[str, str]

add_match(agent_idx, oracle_id)[source]

class are.simulation.validation.judge_states.InContextJudgeState(initialized=False, nb_turns=-1, turn_idx=-1, last_turn_success=True, last_turn_rationale='', scenario_start_time=0.0, scenario_tasks=<factory>, user_details=None, turn_to_oracle_events=<factory>, turn_to_oracle_graph=<factory>, oracle_event_id_to_turn_idx=<factory>, turn_to_agent_events=<factory>)[source]

Bases: BaseJudgeState

property agent_id_to_oracle_id: dict[str, str]: There is no agent event to oracle event one to one matching in this judge

property user_name: str

Event Judge Classes ¶

class are.simulation.validation.base.EventJudge(config, event_type, judge_type='')[source]

Bases: ABC

Base class for an event judge. An event judge compares a agent and oracle events and decides if the two match.

abstractmethod compare(agent_event, oracle_event, **kwargs)[source]

Return type:: bool | None

class are.simulation.validation.event_judge.EnvUserEventJudge(event_type, config)[source]

Bases: EventJudge

A judge that compares a pair of environment/user events from the agent log and the oracle agent log. The two events match if their event ids is the same.

eq_checker(x_agent, x_oracle, **kwargs)[source]

Return type:: bool

compare(agent_event, oracle_event, **kwargs)[source]

Return type:: bool | None

class are.simulation.validation.event_judge.AgentEventJudge(config)[source]

Bases: EventJudge

A judge that compares a pair of agent events from the agent log and the oracle agent log.

event_time_checker(agent_event_time, oracle_event_time, pre_event_tolerance_seconds=5.0, post_event_tolerance_seconds=20.0, event_time_comparator=None)[source]

Checks if the agent event time is within the allowed tolerance range compared to the oracle event time.

Return type:: bool

Args:: agent_event_time (float): The time of the agent event (relative or absolute) oracle_event_time (float): The time of the oracle event (relative or absolute). pre_event_tolerance_seconds (float): The allowed time in seconds before the oracle event time. post_event_tolerance_seconds (float): The allowed time in seconds after the oracle event time. event_time_comparator (str | None): The type of comparison to perform between the agent and oracle event times. The arg type is str instead of EventTimeComparator for better readability in the tracer.
Returns:: bool: True if the agent event time is within the allowed tolerance range, False otherwise.

check_time(agent_event, oracle_event, max_parent_oracle_event_time, max_parent_agent_event_time)[source]

Return type:: bool

compare(agent_event, oracle_event, **kwargs)[source]

Return type:: bool | None

Tool Judge Classes ¶

class are.simulation.validation.base.ToolJudge(config, judge_type)[source]

Bases: ABC

Base class for a tool judge. A tool judge compares a agent and oracle event representing a tool call. It decides if the two actions of the events, representing a tool call match.

abstractmethod compare(agent_event, oracle_event, **kwargs)[source]

Return type:: bool | None

class are.simulation.validation.tool_judge.HardToolJudge(config)[source]

Bases: ToolJudge

A judge that performs a scripted check on some action args to compare an agent and oracle event representing a tool call.

eq_checker(x_agent, x_oracle, **kwargs)[source]

Return type:: bool

unordered_list_checker(x_agent, x_oracle, **kwargs)[source]

Return type:: bool

path_checker(x_agent, x_oracle, **kwargs)[source]

Return type:: bool

unordered_path_list_checker(x_agent, x_oracle, **kwargs)[source]

Return type:: bool

list_attendees_checker(x_agent, x_oracle, tolerance_list_str=None, **kwargs)[source]

Return type:: bool

unordered_str_list_with_tolerance_checker(x_agent, x_oracle, tolerance_list_str=None, **kwargs)[source]

Return type:: bool

datetime_checker(x_agent, x_oracle, **kwargs)[source]

Return type:: bool

eq_str_strip_checker(x_agent, x_oracle, **kwargs)[source]

Return type:: bool

phone_number_checker(x_agent, x_oracle, **kwargs)[source]

Return type:: bool

contain_any_checker(x_agent, targets, **kwargs)[source]

Return type:: bool

contain_all_checker(x_agent, targets, **kwargs)[source]

Return type:: bool

compare(agent_event, oracle_event, **kwargs)[source]

Return type:: bool

class are.simulation.validation.tool_judge.SoftToolJudge(config)[source]

Bases: ToolJudge

A soft judge that compares some action args of an agent and oracle event representing a tool call with an llm.

describe_action_args(args)[source]

Return type:: str

equality_checker(agent_args, oracle_args, **kwargs)[source]

Return type:: bool

placeholder_checker(agent_args, **kwargs)[source]

Return type:: bool

extract_subtask(oracle_action_call, task)[source]

Return type:: str

content_checker(agent_args, oracle_args, today_date, user_address, subtask, **kwargs)[source]

Return type:: bool | None

signature_checker(agent_args, user_name, **kwargs)[source]

Return type:: bool | None

sanity_checker(agent_args, task='', previous_task='', **kwargs)[source]

Return type:: bool | None

cab_checker(agent_args, oracle_args, user_address, **kwargs)[source]

Return type:: bool | None

email_checker(agent_args, oracle_args, today_date, **kwargs)[source]

Return type:: bool | None

message_checker(agent_args, oracle_args, today_date, **kwargs)[source]

Return type:: bool | None

event_checker(agent_args, oracle_args, user_address, subtask, **kwargs)[source]

Return type:: bool | None

user_message_checker(agent_args, oracle_args, subtask, **kwargs)[source]

Return type:: bool | None

tone_checker(agent_args, **kwargs)[source]

Return type:: bool | None

get_checker_kwargs(kwargs, oracle_event, oracle_args)[source]

Return type:: dict[str, Any]

compare(agent_event, oracle_event, **kwargs)[source]

Return type:: bool | None

class are.simulation.validation.tool_judge.MildToolJudge(config)[source]

Bases: ToolJudge

A mild judge that combines a hard and soft judge to compare an agent and oracle event representing a tool call. If first call the hard judge and if it passes, then call the soft judge.

compare(agent_event, oracle_event, **kwargs)[source]

Return type:: bool | None

Configs Classes ¶

class are.simulation.validation.configs.BaseJudgeConfig(tracer=None)[source]

Bases: object

tracer: Optional[Callable] = None

class are.simulation.validation.configs.GraphPerEventJudgeConfig(tracer=None, check_time_threshold_seconds=1.0, pre_event_tolerance_seconds=10.0, post_event_tolerance_seconds=25.0, per_tool_arg_to_checker_type=<factory>, engine=<factory>, per_tool_soft_checker_types=<factory>, event_id_to_checker_params=None, extra_send_message_to_user_allowed=1)[source]

Bases: BaseJudgeConfig

check_time_threshold_seconds: float = 1.0

pre_event_tolerance_seconds: float = 10.0

post_event_tolerance_seconds: float = 25.0

per_tool_arg_to_checker_type: ToolArgCheckerTypeRegistry

engine: Callable

per_tool_soft_checker_types: ToolSoftCheckerTypeRegistry

event_id_to_checker_params: dict[str, list[ToolCheckerParam]] | None = None

extra_send_message_to_user_allowed: int = 1

class are.simulation.validation.configs.ScriptedGraphPerEventJudgeConfig(tracer=None, check_time_threshold_seconds=1.0, pre_event_tolerance_seconds=10.0, post_event_tolerance_seconds=25.0, per_tool_arg_to_checker_type=<factory>, engine=<factory>, per_tool_soft_checker_types=<factory>, event_id_to_checker_params=<factory>, extra_send_message_to_user_allowed=1)[source]

Bases: GraphPerEventJudgeConfig

Config for the scripted graph per event judge. Scripted judge is a judge where the soft judge is deactivated and instead scripted checks will be performed by the hard judge. The event_id_to_checker_params field is used to specify the scripted checks to perform.

class are.simulation.validation.configs.InContextJudgeConfig(tracer=None, check_time_threshold_seconds=1.0, pre_event_tolerance_seconds=10.0, post_event_tolerance_seconds=25.0, time_system_prompt_template='\\nAll agent actions matching an oracle action with a delay exceeding {{check_time_threshold_seconds}} seconds relative to its parent should be executed within the following time window:\\n[oracle action delay - {{pre_event_tolerance_seconds}} seconds, oracle action delay + {{post_event_tolerance_seconds}} seconds]\\n', per_tool_evaluation_criteria=<factory>, tool_to_selected_args=<factory>, engine=<factory>, system_prompt_template='### Evaluator Task\\nYou are an impartial evaluator responsible for assessing the success of an agent assisting a user within an environment in comparison to an oracle agent.\\nIn this environment, the user communicates with the agent via a UserAgentInterface, and the agent utilizes tools from various apps to fulfill user requests.\\nYou will receive two lists of actions (each action is a tool call): one representing actions taken by the agent and another representing actions performed by a skilled oracle agent that perfectly fulfilled the user\\'s request.\\n\\n###Instructions\\nFirst, you will list the differences and similarities between the actions taken by the agent and those performed by the oracle agent.\\nThen, based on the evaluation criteria below, you will decide if the agent\\'s actions match the oracle agent\\'s actions within acceptable tolerance limits.\\n\\n### Evaluation Criteria\\nThe agent\\'s actions should be executed in an order that does not violate the causal relationships between oracle actions provided by with the parent tool call ids.\\nThe number of calls to each tool should be the same for the agent and the oracle agent actions.\\nThe agent\\'s action call parameters should be free of significant grammatical or spelling errors and maintain an appropriate tone.\\n{{evaluation_criteria}}\\n\\n### Input Format\\nThe input will be provided in the following format:\\n\\nAgent Actions:\\n\\n< List of agent actions in the format:\\n Tool name: <name of the tool used in the action>\\n Tool call time: <time of the action>\\n Arguments:\\n <tool arguments>\\n>\\n\\nOracle Actions:\\n\\n< List of oracle actions in the format:\\n Tool call id: <id of the oracle tool call>\\n Parent tool call ids: <ids of the parent tool calls>\\n Tool name: <name of the tool used in the action>\\n Tool call time: <time of the action>\\n Arguments:\\n <tool arguments>\\n>\\n\\nTask: <user\\'s task>\\n\\nPrevious task: <previous task solved by the agent>\\n\\nUser name: <name of the user>\\n\\n### Output Format\\nFor the evaluation, first list the differences and similarities between the agent and oracle agent actions.\\nThen give your reasoning as to why the agent\\'s actions match or critically differ from the oracle agent actions.\\nFinally, provide your final evaluation by strictly following this format: "[[success]]" if the agent actions match the oracle agent actions otherwise "[[failure]]".\\nReport your evaluation in the following format:\\n\\n-Similarities and differences: <List the differences and similarities between the agent and oracle agent actions.>\\n-Reasoning: <Detailed explanation of why the agent\\'s actions match or not the oracle agent actions.>\\n-Evaluation: <[[success]] if the agent actions match oracle agent actions [[failure]] otherwise.>\\n\\n### Your Evaluation\\nFor the following input, provide your evaluation following the output format specified above.\\n')[source]

Bases: BaseJudgeConfig

check_time_threshold_seconds: float = 1.0

pre_event_tolerance_seconds: float = 10.0

post_event_tolerance_seconds: float = 25.0

time_system_prompt_template: str = '\nAll agent actions matching an oracle action with a delay exceeding {{check_time_threshold_seconds}} seconds relative to its parent should be executed within the following time window:\n[oracle action delay - {{pre_event_tolerance_seconds}} seconds, oracle action delay + {{post_event_tolerance_seconds}} seconds]\n'

per_tool_evaluation_criteria: ToolCriteriaRegistry

tool_to_selected_args: ToolArgCheckerTypeRegistry

engine: Callable

system_prompt_template: str = '### Evaluator Task\nYou are an impartial evaluator responsible for assessing the success of an agent assisting a user within an environment in comparison to an oracle agent.\nIn this environment, the user communicates with the agent via a UserAgentInterface, and the agent utilizes tools from various apps to fulfill user requests.\nYou will receive two lists of actions (each action is a tool call): one representing actions taken by the agent and another representing actions performed by a skilled oracle agent that perfectly fulfilled the user\'s request.\n\n###Instructions\nFirst, you will list the differences and similarities between the actions taken by the agent and those performed by the oracle agent.\nThen, based on the evaluation criteria below, you will decide if the agent\'s actions match the oracle agent\'s actions within acceptable tolerance limits.\n\n### Evaluation Criteria\nThe agent\'s actions should be executed in an order that does not violate the causal relationships between oracle actions provided by with the parent tool call ids.\nThe number of calls to each tool should be the same for the agent and the oracle agent actions.\nThe agent\'s action call parameters should be free of significant grammatical or spelling errors and maintain an appropriate tone.\n{{evaluation_criteria}}\n\n### Input Format\nThe input will be provided in the following format:\n\nAgent Actions:\n\n< List of agent actions in the format:\n Tool name: <name of the tool used in the action>\n Tool call time: <time of the action>\n Arguments:\n <tool arguments>\n>\n\nOracle Actions:\n\n< List of oracle actions in the format:\n Tool call id: <id of the oracle tool call>\n Parent tool call ids: <ids of the parent tool calls>\n Tool name: <name of the tool used in the action>\n Tool call time: <time of the action>\n Arguments:\n <tool arguments>\n>\n\nTask: <user\'s task>\n\nPrevious task: <previous task solved by the agent>\n\nUser name: <name of the user>\n\n### Output Format\nFor the evaluation, first list the differences and similarities between the agent and oracle agent actions.\nThen give your reasoning as to why the agent\'s actions match or critically differ from the oracle agent actions.\nFinally, provide your final evaluation by strictly following this format: "[[success]]" if the agent actions match the oracle agent actions otherwise "[[failure]]".\nReport your evaluation in the following format:\n\n-Similarities and differences: <List the differences and similarities between the agent and oracle agent actions.>\n-Reasoning: <Detailed explanation of why the agent\'s actions match or not the oracle agent actions.>\n-Evaluation: <[[success]] if the agent actions match oracle agent actions [[failure]] otherwise.>\n\n### Your Evaluation\nFor the following input, provide your evaluation following the output format specified above.\n'

Factory classes ¶

class are.simulation.validation.factory.JudgeFactory[source]

Bases: object

Judgment Classes ¶

class are.simulation.validation.judgment.Failure[source]

Bases: object

class are.simulation.validation.judgment.ToolCallCountsFailure(agent_counter, agent_aui_count, oracle_counter, oracle_aui_count, extra_send_message_to_user_allowed=0)[source]

Bases: Failure

agent_counter: Counter

agent_aui_count: int

oracle_counter: Counter

oracle_aui_count: int

extra_send_message_to_user_allowed: int = 0

class are.simulation.validation.judgment.EventComparisonFailureType(*values)[source]

Bases: Enum

CAUSALITY = 'causality'

ALREADY_MATCHED = 'already matched'

TOOL_JUDGE_REJECT = 'tool judge reject'

class are.simulation.validation.judgment.EventComparisonFailure(agent_tool_name, agent_event_id, oracle_tool_name, oracle_event_id, failure_type)[source]

Bases: Failure

agent_tool_name: str

agent_event_id: str

oracle_tool_name: str

oracle_event_id: str

failure_type: EventComparisonFailureType

class are.simulation.validation.judgment.OracleEventMatchingFailure(oracle_tool_name, oracle_tool_args, comparison_failures)[source]

Bases: Failure

oracle_tool_name: str

oracle_tool_args: dict[str, str]

comparison_failures: list[EventComparisonFailure]

class are.simulation.validation.judgment.EnvOracleMatchingFailure(oracle_event_id)[source]

Bases: Failure

oracle_event_id: str

class are.simulation.validation.judgment.Judgment(success=False, failure=None, agent_event_id_to_oracle_event_id=<factory>)[source]

Bases: object

success: bool | None = False

failure: str | Failure | None = None

agent_event_id_to_oracle_event_id: dict[str, str]