Developer Tools/Development with Cursor
Cursor Rules
Cursor rules file to help integrate Judgment with your codebase.
When building agents and LLM workflows in Cursor, providing proper context to your coding assistant helps ensure seamless integration with Judgment. This rule file supplies the essential context your coding assistant needs for successful implementation.
Cursor Rules File
To implement this rule file, simply copy the text below and save it in a ".cursor/rules" directory in your project's root directory. Save this file as an .mdc file.
Cursor Rule File
---
You are an expert in helping users integrate Judgment with their codebase. When you are helping someone integrate Judgment tracing or evaluations with their agents/workflows, refer to this file.
---
# Common Questions You May Get from the User (and How to Handle These Cases):
## Sample Agent 1:
```
from uuid import uuid4
import openai
import os
import asyncio
from tavily import TavilyClient
from dotenv import load_dotenv
import chromadb
from chromadb.utils import embedding_functions
destinations_data = [
{
"destination": "Paris, France",
"information": """
Paris is the capital city of France and a global center for art, fashion, and culture.
Key Information:
- Best visited during spring (March-May) or fall (September-November)
- Famous landmarks: Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, Arc de Triomphe
- Known for: French cuisine, café culture, fashion, art galleries
- Local transportation: Metro system is extensive and efficient
- Popular neighborhoods: Le Marais, Montmartre, Latin Quarter
- Cultural tips: Basic French phrases are appreciated; many restaurants close between lunch and dinner
- Must-try experiences: Seine River cruise, visiting local bakeries, Luxembourg Gardens
"""
},
{
"destination": "Tokyo, Japan",
"information": """
Tokyo is Japan's bustling capital, blending ultramodern and traditional elements.
Key Information:
- Best visited during spring (cherry blossoms) or fall (autumn colors)
- Famous areas: Shibuya, Shinjuku, Harajuku, Akihabara
- Known for: Technology, anime culture, sushi, efficient public transport
- Local transportation: Extensive train and subway network
- Cultural tips: Bow when greeting, remove shoes indoors, no tipping
- Must-try experiences: Robot Restaurant, teamLab Borderless, Tsukiji Outer Market
- Popular day trips: Mount Fuji, Kamakura, Nikko
"""
},
{
"destination": "New York City, USA",
"information": """
New York City is a global metropolis known for its diversity, culture, and iconic skyline.
Key Information:
- Best visited during spring (April-June) or fall (September-November)
- Famous landmarks: Statue of Liberty, Times Square, Central Park, Empire State Building
- Known for: Broadway shows, diverse cuisine, shopping, museums
- Local transportation: Extensive subway system, yellow cabs, ride-sharing
- Popular areas: Manhattan, Brooklyn, Queens
- Cultural tips: Fast-paced environment, tipping expected (15-20%)
- Must-try experiences: Broadway show, High Line walk, food tours
"""
},
{
"destination": "Barcelona, Spain",
"information": """
Barcelona is a vibrant city known for its art, architecture, and Mediterranean culture.
Key Information:
- Best visited during spring and fall for mild weather
- Famous landmarks: Sagrada Familia, Park Güell, Casa Batlló
- Known for: Gaudi architecture, tapas, beach culture, FC Barcelona
- Local transportation: Metro, buses, and walkable city center
- Popular areas: Gothic Quarter, Eixample, La Barceloneta
- Cultural tips: Late dinner times (after 8 PM), siesta tradition
- Must-try experiences: La Rambla walk, tapas crawl, local markets
"""
},
{
"destination": "Bangkok, Thailand",
"information": """
Bangkok is Thailand's capital city, famous for its temples, street food, and vibrant culture.
Key Information:
- Best visited during November to February (cool and dry season)
- Famous sites: Grand Palace, Wat Phra Kaew, Wat Arun
- Known for: Street food, temples, markets, nightlife
- Local transportation: BTS Skytrain, MRT, tuk-tuks, river boats
- Popular areas: Sukhumvit, Old City, Chinatown
- Cultural tips: Dress modestly at temples, respect royal family
- Must-try experiences: Street food tours, river cruises, floating markets
"""
}
]
client = openai.Client(api_key=os.getenv("OPENAI_API_KEY"))
def populate_vector_db(collection, destinations_data):
"""
Populate the vector DB with travel information.
destinations_data should be a list of dictionaries with 'destination' and 'information' keys
"""
for data in destinations_data:
collection.add(
documents=[data['information']],
metadatas=[{"destination": data['destination']}],
ids=[f"destination_{data['destination'].lower().replace(' ', '_')}"]
)
def search_tavily(query):
"""Fetch travel data using Tavily API."""
API_KEY = os.getenv("TAVILY_API_KEY")
client = TavilyClient(api_key=API_KEY)
results = client.search(query, num_results=3)
return results
async def get_attractions(destination):
"""Search for top attractions in the destination."""
prompt = f"Best tourist attractions in {destination}"
attractions_search = search_tavily(prompt)
return attractions_search
async def get_hotels(destination):
"""Search for hotels in the destination."""
prompt = f"Best hotels in {destination}"
hotels_search = search_tavily(prompt)
return hotels_search
async def get_flights(destination):
"""Search for flights to the destination."""
prompt = f"Flights to {destination} from major cities"
flights_search = search_tavily(prompt)
return flights_search
async def get_weather(destination, start_date, end_date):
"""Search for weather information."""
prompt = f"Weather forecast for {destination} from {start_date} to {end_date}"
weather_search = search_tavily(prompt)
return weather_search
def initialize_vector_db():
"""Initialize ChromaDB with OpenAI embeddings."""
client = chromadb.Client()
embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.getenv("OPENAI_API_KEY"),
model_name="text-embedding-3-small"
)
res = client.get_or_create_collection(
"travel_information",
embedding_function=embedding_fn
)
populate_vector_db(res, destinations_data)
return res
def query_vector_db(collection, destination, k=3):
"""Query the vector database for existing travel information."""
try:
results = collection.query(
query_texts=[destination],
n_results=k
)
return results['documents'][0] if results['documents'] else []
except Exception:
return []
async def research_destination(destination, start_date, end_date):
"""Gather all necessary travel information for a destination."""
# First, check the vector database
collection = initialize_vector_db()
existing_info = query_vector_db(collection, destination)
# Get real-time information from Tavily
tavily_data = {
"attractions": await get_attractions(destination),
"hotels": await get_hotels(destination),
"flights": await get_flights(destination),
"weather": await get_weather(destination, start_date, end_date)
}
return {
"vector_db_results": existing_info,
**tavily_data
}
async def create_travel_plan(destination, start_date, end_date, research_data):
"""Generate a travel itinerary using the researched data."""
vector_db_context = "\n".join(research_data['vector_db_results']) if research_data['vector_db_results'] else "No pre-stored information available."
prompt = f"""
Create a structured travel itinerary for a trip to {destination} from {start_date} to {end_date}.
Pre-stored destination information:
{vector_db_context}
Current travel data:
- Attractions: {research_data['attractions']}
- Hotels: {research_data['hotels']}
- Flights: {research_data['flights']}
- Weather: {research_data['weather']}
"""
response = client.chat.completions.create(
model="gpt-4.1",
messages=[
{"role": "system", "content": "You are an expert travel planner. Combine both historical and current information to create the best possible itinerary."},
{"role": "user", "content": prompt}
]
).choices[0].message.content
return response
async def generate_itinerary(destination, start_date, end_date):
"""Main function to generate a travel itinerary."""
research_data = await research_destination(destination, start_date, end_date)
res = await create_travel_plan(destination, start_date, end_date, research_data)
return res
if __name__ == "__main__":
load_dotenv()
destination = input("Enter your travel destination: ")
start_date = input("Enter start date (YYYY-MM-DD): ")
end_date = input("Enter end date (YYYY-MM-DD): ")
itinerary = asyncio.run(generate_itinerary(destination, start_date, end_date))
print("\nGenerated Itinerary:\n", itinerary)
```
## Sample Query 1:
Can you add Judgment tracing to my file?
## Example of Modified Code after Query 1:
```
from uuid import uuid4
import openai
import os
import asyncio
from tavily import TavilyClient
from dotenv import load_dotenv
import chromadb
from chromadb.utils import embedding_functions
from judgeval.tracer import Tracer, wrap
from judgeval.scorers import AnswerRelevancyScorer, FaithfulnessScorer
from judgeval.data import Example
destinations_data = [
{
"destination": "Paris, France",
"information": """
Paris is the capital city of France and a global center for art, fashion, and culture.
Key Information:
- Best visited during spring (March-May) or fall (September-November)
- Famous landmarks: Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, Arc de Triomphe
- Known for: French cuisine, café culture, fashion, art galleries
- Local transportation: Metro system is extensive and efficient
- Popular neighborhoods: Le Marais, Montmartre, Latin Quarter
- Cultural tips: Basic French phrases are appreciated; many restaurants close between lunch and dinner
- Must-try experiences: Seine River cruise, visiting local bakeries, Luxembourg Gardens
"""
},
{
"destination": "Tokyo, Japan",
"information": """
Tokyo is Japan's bustling capital, blending ultramodern and traditional elements.
Key Information:
- Best visited during spring (cherry blossoms) or fall (autumn colors)
- Famous areas: Shibuya, Shinjuku, Harajuku, Akihabara
- Known for: Technology, anime culture, sushi, efficient public transport
- Local transportation: Extensive train and subway network
- Cultural tips: Bow when greeting, remove shoes indoors, no tipping
- Must-try experiences: Robot Restaurant, teamLab Borderless, Tsukiji Outer Market
- Popular day trips: Mount Fuji, Kamakura, Nikko
"""
},
{
"destination": "New York City, USA",
"information": """
New York City is a global metropolis known for its diversity, culture, and iconic skyline.
Key Information:
- Best visited during spring (April-June) or fall (September-November)
- Famous landmarks: Statue of Liberty, Times Square, Central Park, Empire State Building
- Known for: Broadway shows, diverse cuisine, shopping, museums
- Local transportation: Extensive subway system, yellow cabs, ride-sharing
- Popular areas: Manhattan, Brooklyn, Queens
- Cultural tips: Fast-paced environment, tipping expected (15-20%)
- Must-try experiences: Broadway show, High Line walk, food tours
"""
},
{
"destination": "Barcelona, Spain",
"information": """
Barcelona is a vibrant city known for its art, architecture, and Mediterranean culture.
Key Information:
- Best visited during spring and fall for mild weather
- Famous landmarks: Sagrada Familia, Park Güell, Casa Batlló
- Known for: Gaudi architecture, tapas, beach culture, FC Barcelona
- Local transportation: Metro, buses, and walkable city center
- Popular areas: Gothic Quarter, Eixample, La Barceloneta
- Cultural tips: Late dinner times (after 8 PM), siesta tradition
- Must-try experiences: La Rambla walk, tapas crawl, local markets
"""
},
{
"destination": "Bangkok, Thailand",
"information": """
Bangkok is Thailand's capital city, famous for its temples, street food, and vibrant culture.
Key Information:
- Best visited during November to February (cool and dry season)
- Famous sites: Grand Palace, Wat Phra Kaew, Wat Arun
- Known for: Street food, temples, markets, nightlife
- Local transportation: BTS Skytrain, MRT, tuk-tuks, river boats
- Popular areas: Sukhumvit, Old City, Chinatown
- Cultural tips: Dress modestly at temples, respect royal family
- Must-try experiences: Street food tours, river cruises, floating markets
"""
}
]
client = wrap(openai.Client(api_key=os.getenv("OPENAI_API_KEY")))
judgment = Tracer(api_key=os.getenv("JUDGMENT_API_KEY"), project_name="travel_agent_demo")
def populate_vector_db(collection, destinations_data):
"""
Populate the vector DB with travel information.
destinations_data should be a list of dictionaries with 'destination' and 'information' keys
"""
for data in destinations_data:
collection.add(
documents=[data['information']],
metadatas=[{"destination": data['destination']}],
ids=[f"destination_{data['destination'].lower().replace(' ', '_')}"]
)
@judgment.observe(span_type="search_tool")
def search_tavily(query):
"""Fetch travel data using Tavily API."""
API_KEY = os.getenv("TAVILY_API_KEY")
client = TavilyClient(api_key=API_KEY)
results = client.search(query, num_results=3)
return results
@judgment.observe(span_type="tool")
async def get_attractions(destination):
"""Search for top attractions in the destination."""
prompt = f"Best tourist attractions in {destination}"
attractions_search = search_tavily(prompt)
return attractions_search
@judgment.observe(span_type="tool")
async def get_hotels(destination):
"""Search for hotels in the destination."""
prompt = f"Best hotels in {destination}"
hotels_search = search_tavily(prompt)
return hotels_search
@judgment.observe(span_type="tool")
async def get_flights(destination):
"""Search for flights to the destination."""
prompt = f"Flights to {destination} from major cities"
flights_search = search_tavily(prompt)
example = Example(
input=prompt,
actual_output=str(flights_search["results"])
)
judgment.async_evaluate(
scorers=[AnswerRelevancyScorer(threshold=0.5)],
example=example,
model="gpt-4.1"
)
return flights_search
@judgment.observe(span_type="tool")
async def get_weather(destination, start_date, end_date):
"""Search for weather information."""
prompt = f"Weather forecast for {destination} from {start_date} to {end_date}"
weather_search = search_tavily(prompt)
example = Example(
input=prompt,
actual_output=str(weather_search["results"])
)
judgment.async_evaluate(
scorers=[AnswerRelevancyScorer(threshold=0.5)],
example=example,
model="gpt-4.1"
)
return weather_search
def initialize_vector_db():
"""Initialize ChromaDB with OpenAI embeddings."""
client = chromadb.Client()
embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.getenv("OPENAI_API_KEY"),
model_name="text-embedding-3-small"
)
res = client.get_or_create_collection(
"travel_information",
embedding_function=embedding_fn
)
populate_vector_db(res, destinations_data)
return res
@judgment.observe(span_type="retriever")
def query_vector_db(collection, destination, k=3):
"""Query the vector database for existing travel information."""
try:
results = collection.query(
query_texts=[destination],
n_results=k
)
return results['documents'][0] if results['documents'] else []
except Exception:
return []
@judgment.observe(span_type="Research")
async def research_destination(destination, start_date, end_date):
"""Gather all necessary travel information for a destination."""
# First, check the vector database
collection = initialize_vector_db()
existing_info = query_vector_db(collection, destination)
# Get real-time information from Tavily
tavily_data = {
"attractions": await get_attractions(destination),
"hotels": await get_hotels(destination),
"flights": await get_flights(destination),
"weather": await get_weather(destination, start_date, end_date)
}
return {
"vector_db_results": existing_info,
**tavily_data
}
@judgment.observe(span_type="function")
async def create_travel_plan(destination, start_date, end_date, research_data):
"""Generate a travel itinerary using the researched data."""
vector_db_context = "\n".join(research_data['vector_db_results']) if research_data['vector_db_results'] else "No pre-stored information available."
prompt = f"""
Create a structured travel itinerary for a trip to {destination} from {start_date} to {end_date}.
Pre-stored destination information:
{vector_db_context}
Current travel data:
- Attractions: {research_data['attractions']}
- Hotels: {research_data['hotels']}
- Flights: {research_data['flights']}
- Weather: {research_data['weather']}
"""
response = client.chat.completions.create(
model="gpt-4.1",
messages=[
{"role": "system", "content": "You are an expert travel planner. Combine both historical and current information to create the best possible itinerary."},
{"role": "user", "content": prompt}
]
).choices[0].message.content
example = Example(
input=prompt,
actual_output=str(response),
retrieval_context=[str(vector_db_context), str(research_data)]
)
judgment.async_evaluate(
scorers=[FaithfulnessScorer(threshold=0.5)],
example=example,
model="gpt-4.1"
)
return response
@judgment.observe(span_type="function")
async def generate_itinerary(destination, start_date, end_date):
"""Main function to generate a travel itinerary."""
research_data = await research_destination(destination, start_date, end_date)
res = await create_travel_plan(destination, start_date, end_date, research_data)
return res
if __name__ == "__main__":
load_dotenv()
destination = input("Enter your travel destination: ")
start_date = input("Enter start date (YYYY-MM-DD): ")
end_date = input("Enter end date (YYYY-MM-DD): ")
itinerary = asyncio.run(generate_itinerary(destination, start_date, end_date))
print("\nGenerated Itinerary:\n", itinerary)
```
## Sample Agent 2
```
from langchain_openai import ChatOpenAI
import asyncio
import os
import chromadb
from chromadb.utils import embedding_functions
from vectordbdocs import financial_data
from typing import Optional
from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage, ChatMessage
from typing_extensions import TypedDict
from langgraph.graph import StateGraph
# Define our state type
class AgentState(TypedDict):
messages: list[BaseMessage]
category: Optional[str]
documents: Optional[str]
def populate_vector_db(collection, raw_data):
"""
Populate the vector DB with financial information.
"""
for data in raw_data:
collection.add(
documents=[data['information']],
metadatas=[{"category": data['category']}],
ids=[f"category_{data['category'].lower().replace(' ', '_')}_{os.urandom(4).hex()}"]
)
# Define a ChromaDB collection for document storage
client = chromadb.Client()
collection = client.get_or_create_collection(
name="financial_docs",
embedding_function=embedding_functions.OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))
)
populate_vector_db(collection, financial_data)
def pnl_retriever(state: AgentState) -> AgentState:
query = state["messages"][-1].content
results = collection.query(
query_texts=[query],
where={"category": "pnl"},
n_results=3
)
documents = []
for document in results["documents"]:
documents += document
return {"messages": state["messages"], "documents": documents}
def balance_sheet_retriever(state: AgentState) -> AgentState:
query = state["messages"][-1].content
results = collection.query(
query_texts=[query],
where={"category": "balance_sheets"},
n_results=3
)
documents = []
for document in results["documents"]:
documents += document
return {"messages": state["messages"], "documents": documents}
def stock_retriever(state: AgentState) -> AgentState:
query = state["messages"][-1].content
results = collection.query(
query_texts=[query],
where={"category": "stocks"},
n_results=3
)
documents = []
for document in results["documents"]:
documents += document
return {"messages": state["messages"], "documents": documents}
async def bad_classifier(state: AgentState) -> AgentState:
return {"messages": state["messages"], "category": "stocks"}
async def bad_classify(state: AgentState) -> AgentState:
category = await bad_classifier(state)
return {"messages": state["messages"], "category": category["category"]}
async def bad_sql_generator(state: AgentState) -> AgentState:
ACTUAL_OUTPUT = "SELECT * FROM pnl WHERE stock_symbol = 'GOOGL'"
return {"messages": state["messages"] + [ChatMessage(content=ACTUAL_OUTPUT, role="text2sql")]}
# Create the classifier node with a system prompt
async def classify(state: AgentState) -> AgentState:
messages = state["messages"]
input_msg = [
SystemMessage(content="""You are a financial query classifier. Your job is to classify user queries into one of three categories:
- 'pnl' for Profit and Loss related queries
- 'balance_sheets' for Balance Sheet related queries
- 'stocks' for Stock market related queries
Respond ONLY with the category name in lowercase, nothing else."""),
*messages
]
response = ChatOpenAI(model="gpt-4.1", temperature=0).invoke(
input=input_msg
)
return {"messages": state["messages"], "category": response.content}
# Add router node to direct flow based on classification
def router(state: AgentState) -> str:
return state["category"]
async def generate_response(state: AgentState) -> AgentState:
messages = state["messages"]
documents = state.get("documents", "")
OUTPUT = """
SELECT
stock_symbol,
SUM(CASE WHEN transaction_type = 'buy' THEN quantity ELSE -quantity END) AS total_shares,
SUM(CASE WHEN transaction_type = 'buy' THEN quantity * price_per_share ELSE -quantity * price_per_share END) AS total_cost,
MAX(CASE WHEN transaction_type = 'buy' THEN price_per_share END) AS current_market_price
FROM
stock_transactions
WHERE
stock_symbol = 'META'
GROUP BY
stock_symbol;
"""
return {"messages": messages + [ChatMessage(content=OUTPUT, role="text2sql")], "documents": documents}
async def main():
# Initialize the graph
graph_builder = StateGraph(AgentState)
# Add classifier node
# For failure test, pass in bad_classifier
graph_builder.add_node("classifier", classify)
# graph_builder.add_node("classifier", bad_classify)
# Add conditional edges based on classification
graph_builder.add_conditional_edges(
"classifier",
router,
{
"pnl": "pnl_retriever",
"balance_sheets": "balance_sheet_retriever",
"stocks": "stock_retriever"
}
)
# Add retriever nodes (placeholder functions for now)
graph_builder.add_node("pnl_retriever", pnl_retriever)
graph_builder.add_node("balance_sheet_retriever", balance_sheet_retriever)
graph_builder.add_node("stock_retriever", stock_retriever)
# Add edges from retrievers to response generator
graph_builder.add_node("response_generator", generate_response)
# graph_builder.add_node("response_generator", bad_sql_generator)
graph_builder.add_edge("pnl_retriever", "response_generator")
graph_builder.add_edge("balance_sheet_retriever", "response_generator")
graph_builder.add_edge("stock_retriever", "response_generator")
graph_builder.set_entry_point("classifier")
graph_builder.set_finish_point("response_generator")
# Compile the graph
graph = graph_builder.compile()
response = await graph.ainvoke({
"messages": [HumanMessage(content="Please calculate our PNL on Apple stock. Refer to table information from documents provided.")],
"category": None,
})
print(f"Response: {response['messages'][-1].content}")
if __name__ == "__main__":
asyncio.run(main())
```
## Sample Query 2:
Can you add Judgment tracing to my file?
## Example of Modified Code after Query 2:
```
from langchain_openai import ChatOpenAI
import asyncio
import os
import chromadb
from chromadb.utils import embedding_functions
from vectordbdocs import financial_data
from typing import Optional
from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage, ChatMessage
from typing_extensions import TypedDict
from langgraph.graph import StateGraph
from judgeval.common.tracer import Tracer
from judgeval.integrations.langgraph import JudgevalCallbackHandler
from judgeval.scorers import AnswerCorrectnessScorer, FaithfulnessScorer
from judgeval.data import Example
judgment = Tracer(project_name="FINANCIAL_AGENT")
# Define our state type
class AgentState(TypedDict):
messages: list[BaseMessage]
category: Optional[str]
documents: Optional[str]
def populate_vector_db(collection, raw_data):
"""
Populate the vector DB with financial information.
"""
for data in raw_data:
collection.add(
documents=[data['information']],
metadatas=[{"category": data['category']}],
ids=[f"category_{data['category'].lower().replace(' ', '_')}_{os.urandom(4).hex()}"]
)
# Define a ChromaDB collection for document storage
client = chromadb.Client()
collection = client.get_or_create_collection(
name="financial_docs",
embedding_function=embedding_functions.OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))
)
populate_vector_db(collection, financial_data)
@judgment.observe(name="pnl_retriever", span_type="retriever")
def pnl_retriever(state: AgentState) -> AgentState:
query = state["messages"][-1].content
results = collection.query(
query_texts=[query],
where={"category": "pnl"},
n_results=3
)
documents = []
for document in results["documents"]:
documents += document
return {"messages": state["messages"], "documents": documents}
@judgment.observe(name="balance_sheet_retriever", span_type="retriever")
def balance_sheet_retriever(state: AgentState) -> AgentState:
query = state["messages"][-1].content
results = collection.query(
query_texts=[query],
where={"category": "balance_sheets"},
n_results=3
)
documents = []
for document in results["documents"]:
documents += document
return {"messages": state["messages"], "documents": documents}
@judgment.observe(name="stock_retriever", span_type="retriever")
def stock_retriever(state: AgentState) -> AgentState:
query = state["messages"][-1].content
results = collection.query(
query_texts=[query],
where={"category": "stocks"},
n_results=3
)
documents = []
for document in results["documents"]:
documents += document
return {"messages": state["messages"], "documents": documents}
@judgment.observe(name="bad_classifier", span_type="llm")
async def bad_classifier(state: AgentState) -> AgentState:
return {"messages": state["messages"], "category": "stocks"}
@judgment.observe(name="bad_classify")
async def bad_classify(state: AgentState) -> AgentState:
category = await bad_classifier(state)
example = Example(
input=state["messages"][-1].content,
actual_output=category["category"],
expected_output="pnl"
)
judgment.async_evaluate(
scorers=[AnswerCorrectnessScorer(threshold=1)],
example=example,
model="gpt-4.1"
)
return {"messages": state["messages"], "category": category["category"]}
@judgment.observe(name="bad_sql_generator", span_type="llm")
async def bad_sql_generator(state: AgentState) -> AgentState:
ACTUAL_OUTPUT = "SELECT * FROM pnl WHERE stock_symbol = 'GOOGL'"
example = Example(
input=state["messages"][-1].content,
actual_output=ACTUAL_OUTPUT,
retrieval_context=state.get("documents", []),
expected_output="""
SELECT
SUM(CASE
WHEN transaction_type = 'sell' THEN (price_per_share - (SELECT price_per_share FROM stock_transactions WHERE stock_symbol = 'GOOGL' AND transaction_type = 'buy' LIMIT 1)) * quantity
ELSE 0
END) AS realized_pnl
FROM
stock_transactions
WHERE
stock_symbol = 'META';
"""
)
judgment.async_evaluate(
scorers=[AnswerCorrectnessScorer(threshold=1), FaithfulnessScorer(threshold=1)],
example=example,
model="gpt-4.1"
)
return {"messages": state["messages"] + [ChatMessage(content=ACTUAL_OUTPUT, role="text2sql")]}
# Create the classifier node with a system prompt
@judgment.observe(name="classify")
async def classify(state: AgentState) -> AgentState:
messages = state["messages"]
input_msg = [
SystemMessage(content="""You are a financial query classifier. Your job is to classify user queries into one of three categories:
- 'pnl' for Profit and Loss related queries
- 'balance_sheets' for Balance Sheet related queries
- 'stocks' for Stock market related queries
Respond ONLY with the category name in lowercase, nothing else."""),
*messages
]
response = ChatOpenAI(model="gpt-4.1", temperature=0).invoke(
input=input_msg
)
example = Example(
input=str(input_msg),
actual_output=response.content,
expected_output="pnl"
)
judgment.async_evaluate(
scorers=[AnswerCorrectnessScorer(threshold=1)],
example=example,
model="gpt-4.1"
)
return {"messages": state["messages"], "category": response.content}
# Add router node to direct flow based on classification
def router(state: AgentState) -> str:
return state["category"]
@judgment.observe(name="generate_response")
async def generate_response(state: AgentState) -> AgentState:
messages = state["messages"]
documents = state.get("documents", "")
OUTPUT = """
SELECT
stock_symbol,
SUM(CASE WHEN transaction_type = 'buy' THEN quantity ELSE -quantity END) AS total_shares,
SUM(CASE WHEN transaction_type = 'buy' THEN quantity * price_per_share ELSE -quantity * price_per_share END) AS total_cost,
MAX(CASE WHEN transaction_type = 'buy' THEN price_per_share END) AS current_market_price
FROM
stock_transactions
WHERE
stock_symbol = 'META'
GROUP BY
stock_symbol;
"""
example = Example(
input=messages[-1].content,
actual_output=OUTPUT,
retrieval_context=documents,
expected_output="""
SELECT
stock_symbol,
SUM(CASE WHEN transaction_type = 'buy' THEN quantity ELSE -quantity END) AS total_shares,
SUM(CASE WHEN transaction_type = 'buy' THEN quantity * price_per_share ELSE -quantity * price_per_share END) AS total_cost,
MAX(CASE WHEN transaction_type = 'buy' THEN price_per_share END) AS current_market_price
FROM
stock_transactions
WHERE
stock_symbol = 'META'
GROUP BY
stock_symbol;
"""
)
judgment.async_evaluate(
scorers=[AnswerCorrectnessScorer(threshold=1), FaithfulnessScorer(threshold=1)],
example=example,
model="gpt-4.1"
)
return {"messages": messages + [ChatMessage(content=OUTPUT, role="text2sql")], "documents": documents}
async def main():
with judgment.trace(
"run_1",
project_name="FINANCIAL_AGENT",
overwrite=True
) as trace:
# Initialize the graph
graph_builder = StateGraph(AgentState)
# Add classifier node
# For failure test, pass in bad_classifier
graph_builder.add_node("classifier", classify)
# graph_builder.add_node("classifier", bad_classify)
# Add conditional edges based on classification
graph_builder.add_conditional_edges(
"classifier",
router,
{
"pnl": "pnl_retriever",
"balance_sheets": "balance_sheet_retriever",
"stocks": "stock_retriever"
}
)
# Add retriever nodes (placeholder functions for now)
graph_builder.add_node("pnl_retriever", pnl_retriever)
graph_builder.add_node("balance_sheet_retriever", balance_sheet_retriever)
graph_builder.add_node("stock_retriever", stock_retriever)
# Add edges from retrievers to response generator
graph_builder.add_node("response_generator", generate_response)
# graph_builder.add_node("response_generator", bad_sql_generator)
graph_builder.add_edge("pnl_retriever", "response_generator")
graph_builder.add_edge("balance_sheet_retriever", "response_generator")
graph_builder.add_edge("stock_retriever", "response_generator")
graph_builder.set_entry_point("classifier")
graph_builder.set_finish_point("response_generator")
# Compile the graph
graph = graph_builder.compile()
handler = JudgevalCallbackHandler(trace)
response = await graph.ainvoke({
"messages": [HumanMessage(content="Please calculate our PNL on Apple stock. Refer to table information from documents provided.")],
"category": None,
}, config=dict(callbacks=[handler]))
trace.save()
print(f"Response: {response['messages'][-1].content}")
if __name__ == "__main__":
asyncio.run(main())
```
# Official Judgment SDK Documentation
---
title: JudgmentClient
description: Complete reference for the JudgmentClient Python SDK
---
import { APIEndpoint } from '@/components/api';
# JudgmentClient API Reference
The JudgmentClient is your primary interface for interacting with the Judgment platform. It provides methods for running evaluations, managing datasets, handling traces, and more.
## Authentication
Set up your credentials using environment variables:
```bash
export JUDGMENT_API_KEY="your_api_key_here"
export JUDGMENT_ORG_ID="your_organization_id_here"
```
<APIEndpoint
title="Initialize Client"
description="Initialize a JudgmentClient object."
parameters={[
{
name: "judgment_api_key",
type: "str",
required: false,
description: "Recommended - set using the JUDGMENT_API_KEY environment variable",
},
{
name: "judgment_org_id",
type: "str",
required: false,
description: "Recommended - set using the JUDGMENT_ORG_ID environment variable",
},
]}
codeExamples={[
{
language: "python",
label: "Python",
code: `from judgeval import JudgmentClient
client = JudgmentClient()`
}
]}
/>
<APIEndpoint
title="client.run_evaluation()"
description="Execute an evaluation of examples using one or more scorers to measure performance and quality of your AI models."
parameters={[
{
name: "examples",
type: "List[Example]",
required: true,
description: "The examples to evaluate against your AI model",
example: "[Example(...)]",
},
{
name: "scorers",
type: "List[APIJudgmentScorer]",
required: true,
description: "List of scorers to use for evaluation",
example: "[APIJudgmentScorer(...)]"
},
{
name: "model",
type: "str",
required: false,
description: "Model used as judge when using LLM as a Judge",
example: '"gpt-4o-mini"',
default: "gpt-4.1"
},
{
name: "project_name",
type: "str",
required: false,
description: "Name of the project for organization",
example: '"my_qa_project"',
default: "default_project"
},
{
name: "eval_run_name",
type: "str",
required: false,
description: "Unique name for this evaluation run",
example: '"experiment_v1"',
default: "default_eval_run"
},
{
name: "override",
type: "bool",
required: false,
description: "Whether to override an existing evaluation run with the same name",
default: "False"
},
{
name: "append",
type: "bool",
required: false,
description: "Whether to append to an existing evaluation run with the same name",
default: "False"
},
{
name: "async_execution",
type: "bool",
required: false,
description: "Whether to execute the evaluation asynchronously",
default: "False"
}
]}
codeExamples={[
{
language: "python",
label: "Python",
code: `from judgeval import JudgmentClient
from judgeval.data import Example
client = JudgmentClient()
examples = [
Example(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
expected_output="Paris"
)
]
from judgeval.scorers import AnswerRelevancyScorer
results = client.run_evaluation(
examples=examples,
scorers=[AnswerRelevancyScorer(threshold=0.9)],
project_name="geography_qa"
)`
}
]}
responses={[
{
status: 200,
description: "List[ScoringResult]",
example: `[
ScoringResult(
success=False,
scorers_data=[ScorerData(...)],
name=None,
data_object=Example(...),
trace_id=None,
run_duration=None,
evaluation_cost=None
)
]`
}
]}
/>
<APIEndpoint
title="client.run_trace_evaluation()"
description="Execute trace-based evaluation using function calls and tracing to evaluate agent behavior and execution flows."
parameters={[
{
name: "scorers",
type: "List[APIJudgmentScorer]",
required: true,
description: "List of scorers to use for evaluation",
example: "[APIJudgmentScorer(...)]"
},
{
name: "examples",
type: "List[Example]",
required: false,
description: "Examples to run through the function (required if using function)",
example: "[Example(...)]"
},
{
name: "function",
type: "Callable",
required: false,
description: "Function to execute and trace for evaluation"
},
{
name: "tracer",
type: "Union[Tracer, BaseCallbackHandler]",
required: false,
description: "The tracer object used in tracing your agent"
},
{
name: "traces",
type: "List[Trace]",
required: false,
description: "Pre-existing traces to evaluate instead of generating new ones"
},
{
name: "project_name",
type: "str",
required: false,
description: "Name of the project for organization",
default: "default_project",
example: '"agent_evaluation"'
},
{
name: "eval_run_name",
type: "str",
required: false,
description: "Unique name for this trace evaluation run",
default: "default_eval_run",
example: '"agent_trace_v1"'
},
{
name: "override",
type: "bool",
required: false,
description: "Whether to override an existing evaluation run with the same name",
default: "False"
},
{
name: "append",
type: "bool",
required: false,
description: "Whether to append to an existing evaluation run with the same name",
default: "False"
},
]}
note="You either need to provide 'examples', 'function' and 'tracer' OR 'traces'"
codeExamples={[
{
language: "python",
label: "Python",
code: `
from judgeval.tracer import Tracer
tracer = Tracer()
def my_agent_function(query: str) -> str:
"""Your agent function to be traced and evaluated"""
response = f"Processing query: {query}"
return response
examples = [
Example(
input={"query": "What is the weather like?"},
expected_output="I'll help you check the weather."
)
]
from judgeval.scorers import ToolOrderScorer
results = client.run_trace_evaluation(
scorers=[ToolOrderScorer()],
examples=examples,
function=my_agent_function,
tracer=tracer,
project_name="agent_evaluation"
)`
}
]}
responses={[
{
status: 200,
description: "List[ScoringResult]",
example: `[
ScoringResult(
success=False,
scorers_data=[ScorerData(...)],
name=None,
data_object=Example(...),
trace_id=None,
run_duration=None,
evaluation_cost=None
)
]`
}
]}
/>
<APIEndpoint
title="client.create_dataset()"
description="Create a new evaluation dataset for storage and reuse across multiple evaluation runs."
parameters={[
]}
/>
<APIEndpoint
title="client.push_dataset()"
description="Upload an evaluation dataset to the Judgment platform for storage and reuse across multiple evaluation runs."
parameters={[
{
name: "alias",
type: "str",
required: true,
description: "Unique name for the dataset within the project",
example: '"qa_dataset_v1"'
},
{
name: "dataset",
type: "EvalDataset",
required: true,
description: "Dataset object containing examples and metadata"
},
{
name: "project_name",
type: "str",
required: true,
description: "Project name where the dataset will be stored",
example: '"question_answering"'
},
{
name: "overwrite",
type: "bool",
required: false,
description: "Whether to overwrite existing dataset with same alias",
default: "False"
}
]}
codeExamples={[
{
language: "python",
label: "Python",
code: `dataset = client.create_dataset()
dataset.add_examples([
Example(
input="What is machine learning?",
actual_output="Machine learning is a subset of AI...",
expected_output="Machine learning is a method of data analysis..."
)
])
success = client.push_dataset(
alias="ml_qa_dataset_v2",
dataset=dataset,
project_name="machine_learning_qa",
overwrite=True
)`
}
]}
responses={[
{
status: 200,
description: "bool",
example: `True`
}
]}
/>
<APIEndpoint
title="client.pull_dataset()"
description="Retrieve a saved dataset from the Judgment platform to use in evaluations or analysis."
parameters={[
{
name: "alias",
type: "str",
required: true,
description: "The alias of the dataset to retrieve",
example: '"qa_dataset_v1"'
},
{
name: "project_name",
type: "str",
required: true,
description: "Project name where the dataset is stored",
example: '"question_answering"'
}
]}
codeExamples={[
{
language: "python",
label: "Python",
code: `dataset = client.pull_dataset(
alias="qa_dataset_v1",
project_name="question_answering"
)
print(f"Dataset has {len(dataset.examples)} examples")
results = client.run_evaluation(
examples=dataset.examples,
scorers=my_scorers,
project_name="question_answering"
)`
}
]}
responses={[
{
status: 200,
description: "EvalDataset",
example: `EvalDataset(
examples=[
Example(
input="What is the capital of France?",
actual_output="Paris",
expected_output="Paris"
)
],
metadata={
"created_at": "2024-01-15T10:30:00Z",
"examples_count": 1
}
)`
}
]}
/>
<APIEndpoint
title="client.append_dataset()"
description="Append examples to an existing dataset."
parameters={[
{
name: "alias",
type: "str",
required: true,
description: "Unique name for the dataset within the project",
example: '"qa_dataset_v1"'
},
{
name: "examples",
type: "List[Example]",
required: true,
description: "List of examples to append to the dataset",
example: "[Example(...)]"
},
{
name: "project_name",
type: "str",
required: true,
description: "Project name where the dataset will be stored",
example: '"question_answering"'
},
]}
codeExamples={[
{
language: "python",
label: "Python",
code: `dataset = client.create_dataset()
dataset = client.pull_dataset(
alias="qa_dataset_v1",
project_name="question_answering"
)
examples = [
Example(
input="What is the capital of France?",
actual_output="Paris",
expected_output="Paris"
)
]
results = client.append_dataset(
alias="qa_dataset_v1",
examples=examples,
project_name="question_answering"
)`
}
]}
responses={[
{
status: 200,
description: "bool",
example: `True`
}
]}
/>
<APIEndpoint
title="client.assert_test()"
description="Runs evaluations as unit tests, raising an exception if the score falls below the defined threshold."
parameters={[
{
name: "examples",
type: "List[Example]",
required: true,
description: "The examples to evaluate against your AI model",
example: "[Example(...)]",
},
{
name: "scorers",
type: "List[APIJudgmentScorer]",
required: true,
description: "List of scorers to use for evaluation",
example: "[APIJudgmentScorer(...)]"
},
{
name: "model",
type: "str",
required: false,
description: "Model used as judge when using LLM as a Judge",
example: '"gpt-4o-mini"',
default: "gpt-4.1"
},
{
name: "project_name",
type: "str",
required: false,
description: "Name of the project for organization",
example: '"my_qa_project"',
default: "default_project"
},
{
name: "eval_run_name",
type: "str",
required: false,
description: "Unique name for this evaluation run",
example: '"experiment_v1"',
default: "default_eval_run"
},
{
name: "override",
type: "bool",
required: false,
description: "Whether to override an existing evaluation run with the same name",
default: "False"
},
{
name: "append",
type: "bool",
required: false,
description: "Whether to append to an existing evaluation run with the same name",
default: "False"
},
{
name: "async_execution",
type: "bool",
required: false,
description: "Whether to execute the evaluation asynchronously",
default: "False"
}
]}
codeExamples={[
{
language: "python",
label: "Python",
code: `from judgeval import JudgmentClient
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer
client = JudgmentClient()
example = Example(
input="What if these shoes don't fit?",
actual_output="We offer a 30-day full refund at no extra cost.",
retrieval_context=["All customers are eligible for a 44 day full refund at no extra cost."],
)
scorer = FaithfulnessScorer(threshold=0.5)
client.assert_test(
examples=[example],
scorers=[scorer],
)`
}
]}
/>
<APIEndpoint
title="client.assert_trace_test()"
description="Runs trace-based evaluations as unit tests, raising an exception if the score falls below the defined threshold."
parameters={[
{
name: "scorers",
type: "List[APIJudgmentScorer]",
required: true,
description: "List of scorers to use for evaluation",
example: "[APIJudgmentScorer(...)]"
},
{
name: "examples",
type: "List[Example]",
required: false,
description: "Examples to run through the function (required if using function)",
example: "[Example(...)]"
},
{
name: "function",
type: "Callable",
required: false,
description: "Function to execute and trace for evaluation"
},
{
name: "tracer",
type: "Union[Tracer, BaseCallbackHandler]",
required: false,
description: "The tracer object used in tracing your agent"
},
{
name: "traces",
type: "List[Trace]",
required: false,
description: "Pre-existing traces to evaluate instead of generating new ones"
},
{
name: "project_name",
type: "str",
required: false,
description: "Name of the project for organization",
default: "default_project",
example: '"agent_evaluation"'
},
{
name: "eval_run_name",
type: "str",
required: false,
description: "Unique name for this trace evaluation run",
default: "default_eval_run",
example: '"agent_trace_v1"'
},
{
name: "override",
type: "bool",
required: false,
description: "Whether to override an existing evaluation run with the same name",
default: "False"
},
{
name: "append",
type: "bool",
required: false,
description: "Whether to append to an existing evaluation run with the same name",
default: "False"
},
]}
note="You either need to provide 'examples', 'function' and 'tracer' OR 'traces'"
codeExamples={[
{
language: "python",
label: "Python",
code: `
from judgeval.tracer import Tracer
tracer = Tracer()
def my_agent_function(query: str) -> str:
"""Your agent function to be traced and evaluated"""
response = f"Processing query: {query}"
return response
examples = [
Example(
input={"query": "What is the weather like?"},
expected_output="I'll help you check the weather."
)
]
from judgeval.scorers import ToolOrderScorer
results = client.assert_trace_test(
scorers=[ToolOrderScorer()],
examples=examples,
function=my_agent_function,
tracer=tracer,
project_name="agent_evaluation"
)`
}
]}
/>
## Error Handling
The JudgmentClient raises specific exceptions for different error conditions:
<div className="overflow-x-auto">
<table className="min-w-full">
<thead>
<tr className="border-b border-gray-200 dark:border-gray-700">
<th className="text-left py-3 text-sm font-medium text-gray-900 dark:text-gray-100">Exception</th>
<th className="text-left py-3 text-sm font-medium text-gray-900 dark:text-gray-100">Description</th>
</tr>
</thead>
<tbody className="divide-y divide-gray-200 dark:divide-gray-700">
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">JudgmentAPIError</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">API request failures or server errors</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">ValueError</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Invalid parameters or configuration</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">FileNotFoundError</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Missing test files or datasets</td>
</tr>
</tbody>
</table>
</div>
```python
from judgeval.common.exceptions import JudgmentAPIError
try:
results = client.run_evaluation(examples, scorers)
except JudgmentAPIError as e:
print(f"API Error: {e}")
except ValueError as e:
print(f"Invalid parameters: {e}")
```
---
title: Tracer
description: Complete reference for the Tracer Python SDK
---
import { APIEndpoint } from '@/components/api';
# Tracer API Reference
The Tracer is your primary interface for adding observability to your AI agents. It provides methods for tracing function execution, evaluating performance, and collecting comprehensive environment interaction data.
<APIEndpoint
title="Initializing Tracer"
description="Initialize a Tracer object."
parameters={[
{
name: "api_key",
type: "str",
required: false,
description: "Recommended - set using the JUDGMENT_API_KEY environment variable",
},
{
name: "organization_id",
type: "str",
required: false,
description: "Recommended - set using the JUDGMENT_ORG_ID environment variable",
},
{
name: "project_name",
type: "str",
required: false,
description: "Optional project name override",
default: "default_project"
},
{
name: "deep_tracing",
type: "bool",
required: false,
description: "Whether to enable deep tracing, which will trace all nested function calls without the need to decorate each function.",
default: "False"
},
{
name: "enable_monitoring",
type: "bool",
required: false,
description: "If you need to toggle monitoring on and off",
default: "True"
},
{
name: "enable_evaluations",
type: "bool",
required: false,
description: "If you need to toggle evaluations on and off for async_evaluate()",
default: "True"
},
{
name: "use_s3",
type: "bool",
required: false,
description: "Whether to use S3 for storage",
default: "False"
},
{
name: "s3_bucket_name",
type: "str",
required: false,
description: "Name of the S3 bucket to use",
default: "None"
},
{
name: "s3_aws_access_key_id",
type: "str",
required: false,
description: "AWS access key ID for S3",
default: "None"
},
{
name: "s3_aws_secret_access_key",
type: "str",
required: false,
description: "AWS secret access key for S3",
default: "None"
},
{
name: "s3_region_name",
type: "str",
required: false,
description: "AWS region name for S3",
default: "None"
},
{
name: "trace_across_async_contexts",
type: "bool",
required: false,
description: "Whether to trace across async contexts",
default: "False"
},
{
name: "span_batch_size",
type: "int",
required: false,
description: "Number of spans to batch before sending",
default: "50"
},
{
name: "span_flush_interval",
type: "float",
required: false,
description: "Time in seconds between automatic flushes",
default: "1.0"
},
{
name: "span_num_workers",
type: "int",
required: false,
description: "Number of worker threads for span processing",
default: "10"
}
]}
codeExamples={[
{
language: "python",
label: "Python",
code: `from judgeval import Tracer
tracer = Tracer()`
}
]}
/>
<APIEndpoint
title="tracer.observe()"
description="Decorator to trace function execution with detailed entry/exit information."
parameters={[
{
name: "func",
type: "Callable",
required: true,
description: "The function to decorate (automatically provided when used as decorator)",
},
{
name: "name",
type: "str",
required: false,
description: "Optional custom name for the span (defaults to function name)",
default: "None",
example: '"custom_span_name"'
},
{
name: "span_type",
type: "str",
required: false,
description: "Label for the span. Use 'tool' for functions that should be tracked and exported as agent tools",
default: '"span"',
example: '"tool"'
},
{
name: "project_name",
type: "str",
required: false,
description: "Optional project name override",
default: "None",
example: '"my_project"'
},
{
name: "overwrite",
type: "bool",
required: false,
description: "Whether to overwrite existing traces",
default: "False",
example: "False"
},
{
name: "deep_tracing",
type: "bool",
required: false,
description: "Whether to enable deep tracing for this function and all nested calls. If None, uses the tracer's default setting.",
default: "False",
example: "True"
}
]}
codeExamples={[
{
language: "python",
label: "Function Decorator",
code: `from openai import OpenAI
from judgeval.common.tracer import Tracer
client = OpenAI()
tracer = Tracer(project_name='simple-agent', deep_tracing=False)
@tracer.observe(span_type="tool")
def search_web(query):
return f"Results for: {query}"
@tracer.observe(span_type="retriever")
def get_database(query):
return f"Database results for: {query}"
@tracer.observe(span_type="function")
def run_agent(user_query):
# Use tools based on query
if "database" in user_query:
info = get_database(user_query)
else:
info = search_web(user_query)
prompt = f"Context: {info}, Question: {user_query}"
# Generate response
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": prompt"}]
)
return response.choices[0].message.content`
}
]}
/>
<APIEndpoint
title="tracer.observe_tools()"
description="Automatically adds @observe(span_type='tool') to all methods in a class."
parameters={[
{
name: "cls",
type: "type",
required: true,
description: "The class to decorate (automatically provided when used as decorator)",
},
{
name: "exclude_methods",
type: "List[str]",
required: false,
description: "List of method names to skip decorating. Defaults to common magic methods",
default: '["__init__", "__new__", "__del__", "__str__", "__repr__"]',
example: '["__init__", "private_method"]'
},
{
name: "include_private",
type: "bool",
required: false,
description: "Whether to decorate methods starting with underscore. Defaults to False",
default: "False",
example: "False"
},
{
name: "warn_on_double_decoration",
type: "bool",
required: false,
description: "Whether to print warnings when skipping already-decorated methods. Defaults to True",
default: "True",
example: "True"
}
]}
codeExamples={[
{
language: "python",
label: "Class Decorator",
code: `@tracer.observe_tools()
class SearchTool:
def search_web(self, query):
return f"Web results for: {query}"
def search_docs(self, query):
return f"Document results for: {query}"
def _private_helper(self):
# This won't be traced by default
return "helper"
class MyAgent(SearchTool):
@tracer.observe(span_type="function")
def run_agent(self, user_query):
# Use inherited tools
if "docs" in user_query:
info = self.search_docs(user_query)
else:
info = self.search_web(user_query)
return f"Agent response based on: {info}"
# All public methods from SearchTool are automatically traced
agent = MyAgent()
result = agent.run_agent("Find web results") # Both calls are traced`
}
]}
/>
<APIEndpoint
title="wrap()"
description="Wraps an API client to add tracing capabilities. Supports OpenAI, Together, Anthropic, and Google GenAI clients. Patches both '.create' and Anthropic's '.stream' methods using a wrapper class."
parameters={[
{
name: "client",
type: "Any",
required: true,
description: "API client to wrap (OpenAI, Anthropic, Together, Google GenAI)",
example: "OpenAI()"
},
{
name: "trace_across_async_contexts",
type: "bool",
required: false,
description: "Whether to trace across async contexts",
default: "False",
example: "True"
}
]}
codeExamples={[
{
language: "python",
label: "Auto-trace LLM Calls",
code: `from openai import OpenAI
from judgeval import wrap
client = OpenAI()
wrapped_client = wrap(client)
# All API calls are now automatically traced
response = wrapped_client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": "Hello"}]
)
# Streaming calls are also traced
stream = wrapped_client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": "Hello"}],
stream=True
)`
}
]}
/>
## Evaluation & Logging
<APIEndpoint
title="tracer.async_evaluate()"
description="Runs quality evaluations on the current trace/span using specified scorers. You can provide either an Example object or individual evaluation parameters (input, actual_output, etc.)."
parameters={[
{
name: "scorers",
type: "List[Union[APIJudgmentScorer, JudgevalScorer]]",
required: true,
description: "List of evaluation scorers to run",
example: "[FaithfulnessScorer()]"
},
{
name: "example",
type: "Example",
required: false,
description: "Example object containing evaluation data",
default: "None"
},
{
name: "input",
type: "str",
required: false,
description: "Input text to evaluate",
default: "None",
example: '"What is the capital of France?"'
},
{
name: "actual_output",
type: "Union[str, List[str]]",
required: false,
description: "Actual output from your system",
default: "None",
example: '"Paris is the capital of France"'
},
{
name: "expected_output",
type: "Union[str, List[str]]",
required: false,
description: "Expected/reference output",
default: "None",
example: '"Paris"'
},
{
name: "context",
type: "List[str]",
required: false,
description: "Context information for evaluation",
default: "None",
example: '["France is a country in Europe"]'
},
{
name: "retrieval_context",
type: "List[str]",
required: false,
description: "Retrieved documents for RAG evaluation",
default: "None"
},
{
name: "tools_called",
type: "List[str]",
required: false,
description: "Tools that were actually called",
default: "None",
example: '["search", "calculate"]'
},
{
name: "expected_tools",
type: "List[str]",
required: false,
description: "Tools that should have been called",
default: "None",
example: '["search"]'
},
{
name: "additional_metadata",
type: "Dict[str, Any]",
required: false,
description: "Additional metadata for the evaluation",
default: "None"
},
{
name: "model",
type: "str",
required: false,
description: "Model name for evaluation",
default: "None",
example: '"gpt-4.1"'
},
{
name: "span_id",
type: "str",
required: false,
description: "Specific span ID to attach evaluation to",
default: "None"
},
{
name: "log_results",
type: "bool",
required: false,
description: "Whether to log results to the Judgment platform",
default: "True"
}
]}
codeExamples={[
{
language: "python",
label: "Using Example Object",
code: `from judgeval.scorers import FaithfulnessScorer
from judgeval.data import Example
answer = "Paris is the capital of France"
# Create example object
example = Example(
input=question,
actual_output=answer,
expected_output="Paris",
context=["France is a country in Europe"]
)
# Evaluate using Example
tracer.async_evaluate(
scorers=[FaithfulnessScorer()],
example=example
)
return answer`
},
{
language: "python",
label: "Individual Parameters",
code: `from judgeval.scorers import FaithfulnessScorer
answer = "Paris is the capital of France"
# Evaluate the current span
tracer.async_evaluate(
scorers=[FaithfulnessScorer()],
input=question,
actual_output=answer,
expected_output="Paris",
context=["France is a country in Europe"]
)
return answer`
}
]}
/>
<APIEndpoint
title="tracer.log()"
description="Log a message with the current span context"
parameters={[
{
name: "msg",
type: "str",
required: true,
description: "Message to log",
example: '"Starting web search"'
},
{
name: "label",
type: "str",
required: false,
description: "Label/category for the log entry",
default: '"log"',
example: '"debug"'
},
{
name: "score",
type: "int",
required: false,
description: "Numeric score associated with the log",
default: "1",
example: "1"
}
]}
codeExamples={[
{
language: "python",
label: "Logging Within Traced Functions",
code: `def search_process(query):
tracer.log("Starting search", label="info")
try:
results = perform_search(query)
tracer.log(f"Found {len(results)} results", label="success", score=1)
return results
except Exception as e:
tracer.log(f"Search failed: {e}", label="error", score=0)
raise`
}
]}
/>
## Metadata & Organization
<APIEndpoint
title="tracer.set_metadata()"
description="Set metadata for the current trace."
parameters={[
{
name: "**kwargs",
type: "Any",
required: true,
description: "Key-value pairs to set as metadata for the current trace. Each keyword argument becomes a metadata field.",
}
]}
codeExamples={[
{
language: "python",
label: "Adding Trace Metadata",
code: `def process_user_request(user_id, request):
# Add metadata to the current trace
tracer.set_metadata(
user_id=user_id,
environment="production",
experiment_id="exp_456",
version="1.2.3"
)
return handle_request(request)`
}
]}
/>
<APIEndpoint
title="tracer.set_customer_id()"
description="Set the customer ID for the current trace."
parameters={[
{
name: "customer_id",
type: "str",
required: true,
description: "The customer ID to set",
example: '"customer_123"'
}
]}
codeExamples={[
{
language: "python",
label: "Customer Tracking",
code: `def handle_customer_request(customer_id, request):
tracer.set_customer_id(customer_id)
return process_request(request)`
}
]}
/>
<APIEndpoint
title="tracer.set_tags()"
description="Set the tags for the current trace."
parameters={[
{
name: "tags",
type: "List[str]",
required: true,
description: "List of tags to set",
example: '["experiment", "production", "v2"]'
}
]}
codeExamples={[
{
language: "python",
label: "Tagging Traces",
code: `def experimental_feature(data):
tracer.set_tags(["experiment", "feature_v2", "production"])
return new_algorithm(data)`
}
]}
/>
## Advanced Features
<APIEndpoint
title="tracer.identify()"
description="Class decorator for multi-agent systems that assigns a unique identifier to agent and enables tracking of their internal state variables. Essential for monitoring and debugging complex multi-agent workflows where multiple agents interact and you need to track each agent's behavior and state separately." parameters={[
{
name: "identifier",
type: "str",
required: true,
description: "The identifier to associate with the decorated class. This will be used as the instance name in traces.",
example: '"user_agent"'
},
{
name: "track_state",
type: "bool",
required: false,
description: "Whether to automatically capture the state (attributes) of instances before and after function execution. Defaults to False.",
default: "False",
example: "True"
},
{
name: "track_attributes",
type: "List[str]",
required: false,
description: "Optional list of specific attribute names to track. If None, all non-private attributes (not starting with '_') will be tracked when track_state=True.",
default: "None",
example: '["memory", "goals"]'
},
{
name: "field_mappings",
type: "Dict[str, str]",
required: false,
description: "Optional dictionary mapping internal attribute names to display names in the captured state. For example: {\"system_prompt\": \"instructions\"} will capture the 'instructions' attribute as 'system_prompt' in the state.",
default: "None",
example: '{"system_prompt": "instructions"}'
}
]}
codeExamples={[
{
language: "python",
label: "State Tracking",
code: `@judgment.identify(identifier="name", track_state=True)
class Agent(AgentTools, AgentBase):
"""An AI agent."""
def __init__(self):
self.name = name
self.function_map = {
"func": self.function,
...
}
@judgment.observe(span_type="function")
def process_request(self, user_request):
"""Process a user request using all available tools."""
pass`
}
]}
/>
## Current Span Access
<APIEndpoint
title="tracer.get_current_span()"
description="Returns the current span object for direct access to span properties and methods, useful for debugging and inspection."
fullWidth={true}
/>
### Available Span Properties
The current span object provides these properties for inspection and debugging:
<div className="overflow-x-auto">
<table className="min-w-full">
<thead>
<tr className="border-b border-gray-200 dark:border-gray-700">
<th className="text-left py-3 text-sm font-medium text-gray-900 dark:text-gray-100">Property</th>
<th className="text-left py-3 text-sm font-medium text-gray-900 dark:text-gray-100">Type</th>
<th className="text-left py-3 text-sm font-medium text-gray-900 dark:text-gray-100">Description</th>
</tr>
</thead>
<tbody className="divide-y divide-gray-200 dark:divide-gray-700">
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">span_id</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">str</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Unique identifier for this span</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">trace_id</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">str</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">ID of the parent trace</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">function</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">str</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Name of the function being traced</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">span_type</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">str</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Type of span ("span", "tool", "llm", "evaluation", "chain")</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">inputs</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">dict</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Input parameters for this span</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">output</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Any</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Output/result of the span execution</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">duration</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">float</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Execution time in seconds</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">depth</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">int</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Nesting depth in the trace hierarchy</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-400">parent_span_id</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">str | None</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">ID of the parent span (if nested)</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">agent_name</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">str | None</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Name of the agent executing this span</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">has_evaluation</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">bool</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Whether this span has evaluation runs</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">evaluation_runs</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">List[EvaluationRun]</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">List of evaluations run on this span</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">usage</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">TraceUsage | None</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Token usage and cost information</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">error</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Dict[str, Any] | None</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Error information if span failed</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">state_before</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">dict | None</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Agent state before execution</td>
</tr>
<tr>
<td className="py-3 text-sm font-mono text-gray-900 dark:text-gray-100">state_after</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">dict | None</td>
<td className="py-3 text-sm text-gray-600 dark:text-gray-400">Agent state after execution</td>
</tr>
</tbody>
</table>
</div>
### Example Usage
```python
@tracer.observe(span_type="tool")
def debug_tool(query):
span = tracer.get_current_span()
if span:
# Access span properties for debugging
print(f"🔧 Executing {span.function} (ID: {span.span_id})")
print(f"📊 Depth: {span.depth}, Type: {span.span_type}")
print(f"📥 Inputs: {span.inputs}")
# Check parent relationship
if span.parent_span_id:
print(f"👆 Parent span: {span.parent_span_id}")
# Monitor execution state
if span.agent_name:
print(f"🤖 Agent: {span.agent_name}")
result = perform_search(query)
# Check span after execution
if span:
print(f"📤 Output: {span.output}")
print(f"⏱️ Duration: {span.duration}s")
if span.has_evaluation:
print(f"✅ Has {len(span.evaluation_runs)} evaluations")
if span.error:
print(f"❌ Error: {span.error}")
return result
```
## Getting Started
```python
from judgeval import Tracer
# Initialize tracer
tracer = Tracer(
api_key="your_api_key",
project_name="my_agent_project"
)
# Basic function tracing
@tracer.observe(span_type="agent")
def my_agent(query):
tracer.set_metadata(user_query=query)
result = process_query(query)
tracer.log("Processing completed", label="info")
return result
# Auto-trace LLM calls
from openai import OpenAI
from judgeval import wrap
client = wrap(OpenAI())
response = client.chat.completions.create(...) # Automatically traced
```