LLM Evaluation

*These notes come from/are inspired by this Kaggle notebook *

We can ask models to evaluate themselves and other models by creating “evaluation agents.” Just like if we were asking humans to evaluate something, we need to provide our evaluation agents with a clear definition of the task to be evaluated as well as an assessment rubric.

So, if we were doing this, we might first construct a prompt that tells the evaluation agent its role, describes the rubric, and provides a template for the input/output:

# Define the evaluation prompt
SUMMARY_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated responses.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.
 
# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a summarization task and the context to be summarized are provided in the user prompt. The response should be shorter than the text in the context. The response should not contain information that is not present in the context.
 
## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.
 
## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, is concise, and fluent.
4: (Good). The summary follows instructions, is grounded, concise, and fluent.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.
 
## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, and verbosity according to the criteria.
STEP 2: Score based on the rubric.
 
# User Inputs and AI-generated Response
## User Inputs
 
### Prompt
{prompt}
 
## AI-generated Response
{response}
"""

Then we might create an Enum (as described in the Prompting notes) to constrain the output:

# Define a structured enum class to capture the result.
class SummaryRating(enum.Enum):
  VERY_GOOD = '5'
  GOOD = '4'
  OK = '3'
  BAD = '2'
  VERY_BAD = '1'

And then we might define a function that glues everything together and performs the appropriate request:

see this Kaggle notebook for the full example

def eval_summary(prompt, ai_response):
  """Evaluate the generated summary against the prompt used."""
 
  chat = client.chats.create(model='gemini-2.0-flash')
 
  # Generate the full text response.
  response = chat.send_message(
      message=SUMMARY_PROMPT.format(prompt=prompt, response=ai_response)
  )
  verbose_eval = response.text
 
  # Coerce into the desired structure.
  structured_output_config = types.GenerateContentConfig(
      response_mime_type="text/x.enum",
      response_schema=SummaryRating,
  )
  response = chat.send_message(
      message="Convert the final score.",
      config=structured_output_config,
  )
  structured_eval = response.parsed
 
  return verbose_eval, structured_eval
 
 
text_eval, struct_eval = eval_summary(prompt=[request, document_file], ai_response=summary)
Markdown(text_eval)

One thing to keep in mind here is that the evaluation is tied to the prompt. So as we change the prompt, we’re also changing what’s being evaluated. Imagine we have a document describing the technical specs of the Gemini 1.5 model. We could prompt an LLM with

“Summarize the model training process.”
“ELI5 the training process.”
“Describe the model architecture to someone with a degree in psychology.”

And the model will produce different summaries for each prompt. And presumably we’d get different evaluations for each of these summaries.

Pointwise Evaluation

Pointwise evaluation refers to evaluating a single input/output pair against some criteria. This is useful for evaluating outputs in an absolute “is this good, bad, or somewhere in between” sense.

Pairwise Evaluation

Another approach to evaluating responses is to use a pairwise evaluation, which compares two outputs against each other. This approach is useful in that it allows us to rank-order our prompts rather than assigning them a point score.

To construct a pairwise evaluation prompt, we might use the following (see this Kaggle notebook for a complete example of running the code)

QA_PAIRWISE_PROMPT = """\
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by two AI models. We will provide you with the user input and a pair of AI-generated responses (Response A and Response B). You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
 
You will first judge responses individually, following the Rating Rubric and Evaluation Steps. Then you will give step-by-step explanations for your judgment, compare results to declare the winner based on the Rating Rubric and Evaluation Steps.
 
# Evaluation
## Metric Definition
You will be assessing question answering quality, which measures the overall quality of the answer to the question in the user prompt. Pay special attention to length constraints, such as in X words or in Y sentences. The instruction for performing a question-answering task is provided in the user prompt. The response should not contain information that is not present in the context (if it is provided).
 
## Criteria
Instruction following: The response demonstrates a clear understanding of the question answering task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context if the context is present in the user prompt. The response does not reference any outside information.
Completeness: The response completely answers the question with sufficient detail.
Fluent: The response is well-organized and easy to read.
 
## Rating Rubric
"A": Response A answers the given question as per the criteria better than response B.
"SAME": Response A and B answers the given question equally well as per the criteria.
"B": Response B answers the given question as per the criteria better than response A.
 
## Evaluation Steps
STEP 1: Analyze Response A based on the question answering quality criteria: Determine how well Response A fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 2: Analyze Response B based on the question answering quality criteria: Determine how well Response B fulfills the user requirements, is grounded in the context, is complete and fluent, and provides assessment according to the criterion.
STEP 3: Compare the overall performance of Response A and Response B based on your analyses and assessment.
STEP 4: Output your preference of "A", "SAME" or "B" to the pairwise_choice field according to the Rating Rubric.
STEP 5: Output your assessment reasoning in the explanation field.
 
# User Inputs and AI-generated Responses
## User Inputs
### Prompt
{prompt}
 
# AI-generated Response
 
### Response A
{baseline_model_response}
 
### Response B
{response}
"""
 
 
class AnswerComparison(enum.Enum):
  A = 'A'
  SAME = 'SAME'
  B = 'B'

Brain

Explorer

LLM Evaluation

Pointwise Evaluation

Pairwise Evaluation

Graph View

Table of Contents

Backlinks