Judges

Judges are LLMs that are used to evaluate a component of your LLM system. judgeval's LLM judge scorers, such as AnswerRelevancyScorer, use judge models to execute evaluations.

A good judge model should be able to evaluate your LLM system performance with high consistency and alignment with human preferences. judgeval allows you to pick from a variety of leading judge models, or you can use your own custom judge!

OpenAI Judge Models

Both judgeval (Python) and judgeval-js (TypeScript) support OpenAI models (like the GPT family) for evaluations.

In Python, this is handled via LiteLLM integration. In TypeScript, the built-in DefaultJudge is used.

You simply pass the model name (e.g., "gpt-4.1") to the model parameter in your evaluation call:

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import AnswerRelevancyScorer

client = JudgmentClient()
example1 = Example(input="What's the weather like in Philly?", actual_output="It's sunny and 70 degrees in Philadelphia today.")

results = client.run_evaluation(
    examples=[example1],
    scorers=[AnswerRelevancyScorer(threshold=0.5)],
    model="gpt-4.1"  # Uses LiteLLM
)

import { JudgmentClient, ExampleBuilder, AnswerRelevancyScorer, logger } from 'judgeval';

const client = JudgmentClient.getInstance();
const example1 = new ExampleBuilder().input("What's the weather like in Philly?").actualOutput("It's sunny and 70 degrees in Philadelphia today.").build();

async function runOpenAIJudge() {
    const results = await client.evaluate({
        examples: [example1],
        scorers: [new AnswerRelevancyScorer(0.5)],
        model: "gpt-4.1", // Uses DefaultJudge internally
        projectName: "openai-judge-ts-proj",
        evalName: "openai-judge-ts-eval"
    });
    logger.print(results);
}

runOpenAIJudge();

TogetherAI / Open Source Judge Models

judgeval also supports a variety of popular open-source judge models.

In Python, this uses LiteLLM with TogetherAI inference. In TypeScript, the built-in TogetherJudge is used. This includes models like the Llama, Mistral, QWEN, and DeepSeek families available via TogetherAI.

To use one, pass the model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct-Turbo") to the model parameter:

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import AnswerRelevancyScorer

client = JudgmentClient()
example1 = Example(input="What's the weather like in Philly?", actual_output="It's sunny and 70 degrees in Philadelphia today.")

results = client.run_evaluation(
    examples=[example1],
    scorers=[AnswerRelevancyScorer(threshold=0.5)],
    model="Qwen/Qwen2.5-72B-Instruct-Turbo"  # Uses LiteLLM + TogetherAI
)

import { JudgmentClient, ExampleBuilder, AnswerRelevancyScorer, logger } from 'judgeval';

const client = JudgmentClient.getInstance();
const example1 = new ExampleBuilder().input("What's the weather like in Philly?").actualOutput("It's sunny and 70 degrees in Philadelphia today.").build();

async function runTogetherJudge() {
    const results = await client.evaluate({
        examples: [example1],
        scorers: [new AnswerRelevancyScorer(0.5)],
        // Uses TogetherJudge internally for known OSS prefixes/names
        model: "meta-llama/Meta-Llama-3-8B-Instruct-Turbo", 
        projectName: "together-judge-ts-proj",
        evalName: "together-judge-ts-eval"
    });
    logger.print(results);
}

runTogetherJudge();

Use Your Own Judge Model

If you have a custom model or need to integrate with a different API (e.g., Vertex AI), you can implement your own judge.

In Python, this involves inheriting from the judgevalJudge base class and implementing the required methods. In TypeScript, you implement the Judge interface.

import vertexai
from vertexai.generative_models import GenerativeModel
from judgeval.judges import judgevalJudge
from typing import List

PROJECT_ID = "<YOUR PROJECT ID>"
vertexai.init(project=PROJECT_ID, location="<REGION NAME>")

class VertexAIJudge(judgevalJudge):

    def __init__(self, model_name: str = "gemini-1.5-flash-002"):
        super().__init__(model_name=model_name)
        self.model = self.load_model()

    def load_model(self):
        return GenerativeModel(self.model_name)

    def generate(self, prompt: List[dict]) -> str:
        # For models that don't support chat history, we need to convert to string
        # If your model supports chat history, you can just pass the prompt directly
        response = self.model.generate_content(str(prompt))
        return response.text
    
    async def a_generate(self, prompt: List[dict]) -> str:
        response = await self.model.generate_content_async(str(prompt))
        return response.text
    
    def get_model_name(self) -> str:
        return self.model_name

# Usage (Example)
...
custom_judge = VertexAIJudge()
results = client.run_evaluation(
    examples=[example1],
    scorers=[AnswerRelevancyScorer(threshold=0.5)],
    model=custom_judge
)

import { Judge, JudgmentClient, ExampleBuilder, AnswerRelevancyScorer, logger } from 'judgeval';

// 1. Implement the Judge interface
class MyCustomJudge implements Judge {
    private modelName: string;

    constructor(modelName: string = "my-custom-model/v1") {
        this.modelName = modelName;
    }

    // Synchronous generation (example using a placeholder)
    generate(prompt: string): string {
        console.warn("MyCustomJudge synchronous generate is just a placeholder.");
        // Replace with actual synchronous API call if needed
        return `Sync response for: ${prompt} from ${this.modelName}`;
    }

    // Asynchronous generation (example using a placeholder)
    async aGenerate(prompt: string): Promise<string> {
        // Replace with your actual async API call logic
        console.log(`Making async call for prompt: ${prompt} with ${this.modelName}`);
        await new Promise(resolve => setTimeout(resolve, 50)); // Simulate network delay
        return `Async response for: ${prompt} from ${this.modelName}`;
    }

    getModelName(): string {
        return this.modelName;
    }
}

// 2. Use the custom judge instance in evaluation
async function useCustomJudge() {
    const client = JudgmentClient.getInstance();
    const example1 = new ExampleBuilder().input("What's the weather like in Philly?").actualOutput("It's sunny and 70 degrees in Philadelphia today.").build();
    
    const customJudge = new MyCustomJudge();

    const results = await client.evaluate({
        examples: [example1],
        scorers: [new AnswerRelevancyScorer(0.5)],
        judge: customJudge, 
        projectName: "custom-judge-ts-proj",
        evalName: "custom-judge-ts-eval"
    });
    logger.print(results);
}

useCustomJudge();

When providing a custom judge instance (like VertexAIJudge in Python or MyCustomJudge in TypeScript), pass the instance directly to the model parameter (Python) or the judge option (TypeScript) in the evaluation call. The built-in judges (DefaultJudge, TogetherJudge) are used automatically when you pass a model name string (like "gpt-4o" or "meta-llama/...") to the model option in TypeScript.

Judges

OpenAI Judge Models

TogetherAI / Open Source Judge Models

Use Your Own Judge Model

On this page