Judges
Judges are LLMs that are used to evaluate a component of your LLM system. judgeval
's LLM judge scorers, such as
AnswerRelevancyScorer
, use judge models to execute evaluations.
A good judge model should be able to evaluate your LLM system performance with high consistency and alignment with human preferences.
judgeval
allows you to pick from a variety of leading judge models, or you can use your own custom judge!
OpenAI Judge Models
Both judgeval
(Python) and judgeval-js
(TypeScript) support OpenAI models (like the GPT family) for evaluations.
In Python, this is handled via LiteLLM integration. In TypeScript, the built-in DefaultJudge
is used.
You simply pass the model name (e.g., "gpt-4.1") to the model
parameter in your evaluation call:
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import AnswerRelevancyScorer
client = JudgmentClient()
example1 = Example(input="What's the weather like in Philly?", actual_output="It's sunny and 70 degrees in Philadelphia today.")
results = client.run_evaluation(
examples=[example1],
scorers=[AnswerRelevancyScorer(threshold=0.5)],
model="gpt-4.1" # Uses LiteLLM
)
import { JudgmentClient, ExampleBuilder, AnswerRelevancyScorer, logger } from 'judgeval';
const client = JudgmentClient.getInstance();
const example1 = new ExampleBuilder().input("What's the weather like in Philly?").actualOutput("It's sunny and 70 degrees in Philadelphia today.").build();
async function runOpenAIJudge() {
const results = await client.evaluate({
examples: [example1],
scorers: [new AnswerRelevancyScorer(0.5)],
model: "gpt-4.1", // Uses DefaultJudge internally
projectName: "openai-judge-ts-proj",
evalName: "openai-judge-ts-eval"
});
logger.print(results);
}
runOpenAIJudge();
TogetherAI / Open Source Judge Models
judgeval
also supports a variety of popular open-source judge models.
In Python, this uses LiteLLM with TogetherAI inference. In TypeScript, the built-in TogetherJudge
is used.
This includes models like the Llama, Mistral, QWEN, and DeepSeek families available via TogetherAI.
To use one, pass the model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct-Turbo") to the model
parameter:
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import AnswerRelevancyScorer
client = JudgmentClient()
example1 = Example(input="What's the weather like in Philly?", actual_output="It's sunny and 70 degrees in Philadelphia today.")
results = client.run_evaluation(
examples=[example1],
scorers=[AnswerRelevancyScorer(threshold=0.5)],
model="Qwen/Qwen2.5-72B-Instruct-Turbo" # Uses LiteLLM + TogetherAI
)
import { JudgmentClient, ExampleBuilder, AnswerRelevancyScorer, logger } from 'judgeval';
const client = JudgmentClient.getInstance();
const example1 = new ExampleBuilder().input("What's the weather like in Philly?").actualOutput("It's sunny and 70 degrees in Philadelphia today.").build();
async function runTogetherJudge() {
const results = await client.evaluate({
examples: [example1],
scorers: [new AnswerRelevancyScorer(0.5)],
// Uses TogetherJudge internally for known OSS prefixes/names
model: "meta-llama/Meta-Llama-3-8B-Instruct-Turbo",
projectName: "together-judge-ts-proj",
evalName: "together-judge-ts-eval"
});
logger.print(results);
}
runTogetherJudge();
Use Your Own Judge Model
If you have a custom model or need to integrate with a different API (e.g., Vertex AI), you can implement your own judge.
In Python, this involves inheriting from the judgevalJudge
base class and implementing the required methods. In TypeScript, you implement the Judge
interface.
import vertexai
from vertexai.generative_models import GenerativeModel
from judgeval.judges import judgevalJudge
from typing import List
PROJECT_ID = "<YOUR PROJECT ID>"
vertexai.init(project=PROJECT_ID, location="<REGION NAME>")
class VertexAIJudge(judgevalJudge):
def __init__(self, model_name: str = "gemini-1.5-flash-002"):
super().__init__(model_name=model_name)
self.model = self.load_model()
def load_model(self):
return GenerativeModel(self.model_name)
def generate(self, prompt: List[dict]) -> str:
# For models that don't support chat history, we need to convert to string
# If your model supports chat history, you can just pass the prompt directly
response = self.model.generate_content(str(prompt))
return response.text
async def a_generate(self, prompt: List[dict]) -> str:
response = await self.model.generate_content_async(str(prompt))
return response.text
def get_model_name(self) -> str:
return self.model_name
# Usage (Example)
...
custom_judge = VertexAIJudge()
results = client.run_evaluation(
examples=[example1],
scorers=[AnswerRelevancyScorer(threshold=0.5)],
model=custom_judge
)
import { Judge, JudgmentClient, ExampleBuilder, AnswerRelevancyScorer, logger } from 'judgeval';
// 1. Implement the Judge interface
class MyCustomJudge implements Judge {
private modelName: string;
constructor(modelName: string = "my-custom-model/v1") {
this.modelName = modelName;
}
// Synchronous generation (example using a placeholder)
generate(prompt: string): string {
console.warn("MyCustomJudge synchronous generate is just a placeholder.");
// Replace with actual synchronous API call if needed
return `Sync response for: ${prompt} from ${this.modelName}`;
}
// Asynchronous generation (example using a placeholder)
async aGenerate(prompt: string): Promise<string> {
// Replace with your actual async API call logic
console.log(`Making async call for prompt: ${prompt} with ${this.modelName}`);
await new Promise(resolve => setTimeout(resolve, 50)); // Simulate network delay
return `Async response for: ${prompt} from ${this.modelName}`;
}
getModelName(): string {
return this.modelName;
}
}
// 2. Use the custom judge instance in evaluation
async function useCustomJudge() {
const client = JudgmentClient.getInstance();
const example1 = new ExampleBuilder().input("What's the weather like in Philly?").actualOutput("It's sunny and 70 degrees in Philadelphia today.").build();
const customJudge = new MyCustomJudge();
const results = await client.evaluate({
examples: [example1],
scorers: [new AnswerRelevancyScorer(0.5)],
judge: customJudge,
projectName: "custom-judge-ts-proj",
evalName: "custom-judge-ts-eval"
});
logger.print(results);
}
useCustomJudge();
When providing a custom judge instance (like VertexAIJudge
in Python or MyCustomJudge
in TypeScript), pass the instance directly to the model
parameter (Python) or the judge
option (TypeScript) in the evaluation call. The built-in judges (DefaultJudge
, TogetherJudge
) are used automatically when you pass a model name string (like "gpt-4o" or "meta-llama/...") to the model
option in TypeScript.