Online Evals
Run real-time evaluations on your agents in production.
Quickstart
Online evals are embedded as a method of the Tracer
class. They can be attached to any trace and will be executed asynchronously, inducing no latency on your agent's response time.
from judgeval.common.tracer import Tracer, wrap
from judgeval.scorers import AnswerRelevancyScorer
from judgeval.data import Example
from openai import OpenAI
client = wrap(OpenAI())
judgment = Tracer(project_name="default_project")
@judgment.observe(span_type="tool")
def my_tool():
return "Hello world!"
@judgment.observe(span_type="function")
def main():
task_input = my_tool()
res = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": f"{task_input}"}]
).choices[0].message.content
judgment.async_evaluate(
scorer=AnswerRelevancyScorer(threshold=0.5),
example=Example(
input=task_input,
actual_output=res
),
model="gpt-4.1"
)
return res
if __name__ == "__main__":
main()
Using Custom Scorers with Online Evals
You can also use custom scorers with online evaluations. Custom scorers currently run locally in your environment, which means they execute in the same process as your application but run asynchronously without blocking your main application flow. When using custom scorers, it's important to call judgment.wait_for_completion()
to ensure that all evaluations complete before your program ends.
from judgeval.common.tracer import Tracer, wrap
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from openai import OpenAI
client = wrap(OpenAI())
judgment = Tracer(project_name="default_project")
# Define a custom example class
class CustomerRequest(Example):
request: str
response: str
# Define a custom scorer
class ResolutionScorer(ExampleScorer):
name: str = "Resolution Scorer"
async def a_score_example(self, example: CustomerRequest):
# Custom scoring logic
if "package" in example.response.lower():
self.reason = "The response addresses the package inquiry"
return 1.0
else:
self.reason = "The response does not address the package inquiry"
return 0.0
@judgment.observe(span_type="tool")
def get_customer_request():
return "Where is my package?"
@judgment.observe(span_type="function")
def main():
customer_request = get_customer_request()
# Generate response using LLM
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": customer_request}]
).choices[0].message.content
# Run online evaluation with custom scorer
judgment.async_evaluate(
scorer=ResolutionScorer(threshold=0.8),
example=CustomerRequest(
request=customer_request,
response=response
)
)
return response
if __name__ == "__main__":
main()
# Wait for all evaluations to complete before program ends
judgment.wait_for_completion()
You should see the online eval results attached to the relevant trace span on the Judgment platform shortly after the trace is recorded.
Evals can take time to execute, so they may appear slightly delayed on the UI. Once the eval is complete, you should see it attached to your trace like this:

Hosted Scorers
We support hosting your custom evaluation and scoring logic directly on our backend infrastructure. This ensures all scoring computation is fully isolated from your own systems. This approach is ideal once your scoring logic is finalized and you want to offload evaluation workloads, freeing your system’s resources and keeping compute entirely independent.
Move your Custom Scorer
Move your Custom Scorer and any related models or logic into one Python file
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
class CustomerRequest(Example):
request: str
response: str
class ResolutionScorer(ExampleScorer):
name: str = "Resolution Scorer"
server_hosted: bool = True
async def a_score_example(self, example: CustomerRequest):
# Replace this logic with your own scoring logic
if "package" in example.response:
self.reason = "The response contains the word 'package'"
return 1
else:
self.reason = "The response does not contain the word 'package'"
return 0
Create a requirement file
List any dependencies required to run your scoring logic.
# replace with your own dependencies or leave empty if you don't have any
pydantic
openai
Upload the scorer via CLI
Simply run the CLI command to upload all of your custom scorer logic
judgeval upload_scorer <path_to_your_scorer> <path_to_your_requirements>
Use the scorer in produciton!
# Your function
@judgment.observe(span_type="tool")
def fake_tool(request: str):
response = resolve_request(request)
judgment.async_evaluate(
scorer=ResolutionScorer(),
example=CustomerRequest(
request=request,
response=response
),
model="gpt-4.1"
)
return response