Judges
Judges are LLMs that are used to evaluate a component of your LLM system. judgeval
's LLM judge scorers, such as
AnswerRelevancyScorer
, use judge models to execute evaluations.
A good judge model should be able to evaluate your LLM system performance with high consistency and alignment with human preferences.
judgeval
allows you to pick from a variety of leading judge models, or you can use your own custom judge!
OpenAI Judge Models
Both judgeval
(Python) and judgeval-js
(TypeScript) support OpenAI models (like the GPT family) for evaluations.
In Python, this is handled via LiteLLM integration. In TypeScript, the built-in DefaultJudge
is used.
You simply pass the model name (e.g., "gpt-4o") to the model
parameter in your evaluation call:
from judgeval import JudgmentClient # Added import
from judgeval.data import Example # Added import
from judgeval.scorers import AnswerRelevancyScorer # Added import
client = JudgmentClient()
example1 = Example(input="Q1", actual_output="A1")
results = client.run_evaluation(
examples=[example1],
scorers=[AnswerRelevancyScorer(threshold=0.5)],
model="gpt-4o" # Uses LiteLLM
)
TogetherAI / Open Source Judge Models
judgeval
also supports a variety of popular open-source judge models.
In Python, this uses LiteLLM with TogetherAI inference. In TypeScript, the built-in TogetherJudge
is used.
This includes models like the Llama, Mistral, QWEN, and DeepSeek families available via TogetherAI.
To use one, pass the model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct-Turbo") to the model
parameter:
from judgeval import JudgmentClient # Added import
from judgeval.data import Example # Added import
from judgeval.scorers import AnswerRelevancyScorer # Added import
client = JudgmentClient()
example1 = Example(input="Q1", actual_output="A1")
results = client.run_evaluation(
examples=[example1],
scorers=[AnswerRelevancyScorer(threshold=0.5)],
model="Qwen/Qwen2.5-72B-Instruct-Turbo" # Uses LiteLLM + TogetherAI
)
Use Your Own Judge Model
If you have a custom model or need to integrate with a different API (e.g., Vertex AI), you can implement your own judge.
In Python, this involves inheriting from the judgevalJudge
base class and implementing the required methods. In TypeScript, you implement the Judge
interface.
import vertexai
from vertexai.generative_models import GenerativeModel
from judgeval.judges import judgevalJudge # Assuming import path
from typing import List # Added import
PROJECT_ID = "<YOUR PROJECT ID>"
vertexai.init(project=PROJECT_ID, location="<REGION NAME>")
class VertexAIJudge(judgevalJudge):
def __init__(self, model_name: str = "gemini-1.5-flash-002"):
super().__init__(model_name=model_name) # Call super init
self.model = self.load_model() # Load model in init
def load_model(self):
# It's generally better to load the model once
return GenerativeModel(self.model_name)
def generate(self, prompt: List[dict]) -> str:
# For models that don't support chat history, we need to convert to string
# If your model supports chat history, you can just pass the prompt directly
response = self.model.generate_content(str(prompt))
return response.text
async def a_generate(self, prompt: List[dict]) -> str:
response = await self.model.generate_content_async(str(prompt))
return response.text
def get_model_name(self) -> str:
return self.model_name
# Usage (Example)
# from judgeval import JudgmentClient, Example, AnswerRelevancyScorer
# client = JudgmentClient()
# example1 = Example(input="Q1", actual_output="A1")
# custom_judge = VertexAIJudge()
# results = client.run_evaluation(
# examples=[example1],
# scorers=[AnswerRelevancyScorer(threshold=0.5)],
# model=custom_judge # Pass the custom judge instance
# )
VertexAIJudge
in Python or MyCustomJudge
in TypeScript), pass the instance directly to the model
parameter (Python) or the judge
option (TypeScript) in the evaluation call. The built-in judges (DefaultJudge
, TogetherJudge
) are used automatically when you pass a model name string (like "gpt-4o" or "meta-llama/...") to the model
option in TypeScript.