Tool Order

The Tool Order scorer is an agentic scorer that evaluates whether tools are called in the correct sequence and optionally with the correct parameters. This is particularly useful for evaluating agent workflows where the order of tool calls matters for the overall success of the task.

Scorer Breakdown

The Tool Order scorer offers two distinct scoring modes that can be configured during initialization to match your evaluation needs. Additionally, if tool parameters are included in the expected ordering, they will be checked against the actual parameters used. If parameters are not specified, only the tool order will be evaluated.

Ordering Match (Default)

Checks that the ordering of the tools called is the same as the expected ordering. Returns a score of 1.0 if they match, otherwise 0.0. This score is useful when you care about the ordering of the tools called but are fine with other tools being within the path.

scorer = ToolOrderScorer()

Exact Match

Checks that the ordering of the tools called is exactly the same as the expected ordering. Returns a score of 1.0 if they match, otherwise 0.0. This score is useful when you care about the exact ordering of the tools called.

scorer = ToolOrderScorer(exact_match=True)

Example Agent Tool Structure

Here's how to structure your tools with the judgment.observe decorator:

from judgeval import Tracer

judgment = Tracer(project_name="my_agent")

class MyAgent:
    @judgment.observe(span_type="tool")
    def get_attractions(self, destination: str) -> str:
        """Get attractions for a destination"""
        pass

    @judgment.observe(span_type="tool")
    def get_weather(self, destination: str, start_date: str, end_date: str) -> str:
        """Get weather forecast for a destination"""
        pass

    @judgment.observe(span_type="function")
    def run_agent(self, prompt: str) -> str:
        """Run the agent with the given prompt"""
        pass

Sample Implementation

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import ToolOrderScorer

client = JudgmentClient()

# Define example with expected tool sequence
example = Example(
    input={"prompt": "What's the attraction and weather in Paris for early June?"},
    expected_tools=[
        {
            "tool_name": "get_attractions",
            "parameters": {
                "destination": "Paris"
            }
        },
        {
            "tool_name": "get_weather",
            "parameters": {
                "destination": "Paris",
                "start_date": "2025-06-01",
                "end_date": "2025-06-02"
            }
        }
    ])

scorer = ToolOrderScorer(exact_match=True)
agent = MyAgent()

# Run evaluation
results = client.assert_test(
    examples=[example],
    scorers=[scorer],
    function=agent.run_agent,
)

You can also define your test cases in a YAML file:

# tests.yaml
examples:
  - input:
      prompt: "What's the attraction and weather in Paris for early June?"
    expected_tools:
      - tool_name: "get_attractions"
        parameters:
            destination: "Paris"
      - tool_name: "get_weather"
        parameters:
            destination: "Paris"
            start_date: "2025-06-01"
            end_date: "2025-06-02"

Then run the evaluation using the YAML file:

client.assert_test(
    test_file="test.yaml",
    scorers=[scorer],
    function=agent.run_agent,
)