Judgment Labs Logo
Evaluation

Judges

Judges are LLMs that are used to evaluate a component of your LLM system. judgeval's LLM judge scorers, such as AnswerRelevancyScorer, use judge models to execute evaluations.

A good judge model should be able to evaluate your LLM system performance with high consistency and alignment with human preferences. judgeval allows you to pick from a variety of leading judge models, or you can use your own custom judge!

OpenAI Judge Models

Both judgeval (Python) and judgeval-js (TypeScript) support OpenAI models (like the GPT family) for evaluations.

In Python, this is handled via LiteLLM integration. In TypeScript, the built-in DefaultJudge is used.

You simply pass the model name (e.g., "gpt-4o") to the model parameter in your evaluation call:

from judgeval import JudgmentClient # Added import
from judgeval.data import Example # Added import
from judgeval.scorers import AnswerRelevancyScorer # Added import

client = JudgmentClient()
example1 = Example(input="Q1", actual_output="A1")

results = client.run_evaluation(
    examples=[example1],
    scorers=[AnswerRelevancyScorer(threshold=0.5)],
    model="gpt-4o"  # Uses LiteLLM
)

TogetherAI / Open Source Judge Models

judgeval also supports a variety of popular open-source judge models.

In Python, this uses LiteLLM with TogetherAI inference. In TypeScript, the built-in TogetherJudge is used. This includes models like the Llama, Mistral, QWEN, and DeepSeek families available via TogetherAI.

To use one, pass the model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct-Turbo") to the model parameter:

from judgeval import JudgmentClient # Added import
from judgeval.data import Example # Added import
from judgeval.scorers import AnswerRelevancyScorer # Added import

client = JudgmentClient()
example1 = Example(input="Q1", actual_output="A1")

results = client.run_evaluation(
    examples=[example1],
    scorers=[AnswerRelevancyScorer(threshold=0.5)],
    model="Qwen/Qwen2.5-72B-Instruct-Turbo"  # Uses LiteLLM + TogetherAI
)

Use Your Own Judge Model

If you have a custom model or need to integrate with a different API (e.g., Vertex AI), you can implement your own judge.

In Python, this involves inheriting from the judgevalJudge base class and implementing the required methods. In TypeScript, you implement the Judge interface.

import vertexai
from vertexai.generative_models import GenerativeModel
from judgeval.judges import judgevalJudge # Assuming import path
from typing import List # Added import

PROJECT_ID = "<YOUR PROJECT ID>"
vertexai.init(project=PROJECT_ID, location="<REGION NAME>")

class VertexAIJudge(judgevalJudge):

    def __init__(self, model_name: str = "gemini-1.5-flash-002"):
        super().__init__(model_name=model_name) # Call super init
        self.model = self.load_model() # Load model in init

    def load_model(self):
        # It's generally better to load the model once
        return GenerativeModel(self.model_name)

    def generate(self, prompt: List[dict]) -> str:
        # For models that don't support chat history, we need to convert to string
        # If your model supports chat history, you can just pass the prompt directly
        response = self.model.generate_content(str(prompt))
        return response.text
    
    async def a_generate(self, prompt: List[dict]) -> str:
        response = await self.model.generate_content_async(str(prompt))
        return response.text
    
    def get_model_name(self) -> str:
        return self.model_name

# Usage (Example)
# from judgeval import JudgmentClient, Example, AnswerRelevancyScorer
# client = JudgmentClient()
# example1 = Example(input="Q1", actual_output="A1")
# custom_judge = VertexAIJudge()
# results = client.run_evaluation(
#     examples=[example1],
#     scorers=[AnswerRelevancyScorer(threshold=0.5)],
#     model=custom_judge # Pass the custom judge instance
# )
When providing a custom judge instance (like VertexAIJudge in Python or MyCustomJudge in TypeScript), pass the instance directly to the model parameter (Python) or the judge option (TypeScript) in the evaluation call. The built-in judges (DefaultJudge, TogetherJudge) are used automatically when you pass a model name string (like "gpt-4o" or "meta-llama/...") to the model option in TypeScript.