fb-pixel
Takaisin blogeihin

How to evaluate LLM outputs: A practical guide to AI Evals

You've shipped an AI feature. Your team keeps tweaking prompts, swapping models, and updating retrieval logic. But here's the question: is it actually getting better? Without evals, you're guessing and hoping. This two-part guide covers the practical stuff: what metrics to track, how to set up different eval types, and how to build an eval pipeline that doesn't suck.

What are AI evals?

Evals are systematic ways to measure how well your AI system performs. Think of them as unit tests for AI outputs, except instead of checking if a function returns true, you're checking if your LLM's response is accurate, factual, helpful, or whatever matters for your use case.

But AI outputs are fuzzy. There's rarely one "correct" answer. That's what makes evals both essential and tricky.

Why you can't skip evals

Without Evals With Evals
Ship a prompt change and cross your fingers Confidence that changes actually improve your system
Users report issues you can't reproduce consistently Regression detection before users find bugs
Can't tell if your RAG system is retrieving useful context Measurable retrieval quality you can track over time
Someone asks "is the new model better?" and you shrug Data-driven decisions about model selection
No way to know if you're moving forward or backward A feedback loop that actually works

Types of evals and metrics that matter

1. Exact match and pattern-based evals

Fast, deterministic, and underrated, use these as your first line of defense. They catch obvious errors before you waste compute power on fancy evals.

  • Exact match: predicted == expected. Works for structured outputs (classification labels, entity extraction, SQL queries)
  • Regex/parsing checks: Validate JSON schema compliance, required field presence, format adherence -Substring/keyword matching: Check for required elements in generated text
  • Length constraints: Token counts, character limits

2. Semantic similarity metrics

When exact match is too strict but you need objectivity:

  • Embedding cosine similarity: Compare output embeddings to reference. Threshold typically 0.7-0.9 depending on strictness
  • BERTScore: Uses contextual embeddings to compare tokens. F1 variant is most useful. Good for machine translation, summarization
  • BLEU/ROUGE: N-gram overlap metrics. BLEU for generation (originally MT), ROUGE for summarization. Useful but limited because high scores don't guarantee quality

3. LLM-as-judge evals

Use a strong model to evaluate outputs from your production model:

Single-output scoring: Give the judge an output and rubric, get a score. Here’s a simplified example:

Rate this customer support response on a scale of 1-5 for:
- Empathy (addresses customer emotion)
- Accuracy (correct information)
- Completeness (answers all questions)

Response: {output}
Return JSON: {"empathy": X, "accuracy": Y, "completeness": Z}

For accuracy and completeness, you'd also pass the original query and any ground truth context - simplified here for brevity.

Pairwise comparison: Which output is better? More robust than absolute scoring. Which response better answers the user's question? A: {outputa} B: {outputb} Return: A or B.

Critique-based: Ask the model to identify specific issues List factual errors in this summary: Original: {source} Summary: {output}

Key considerations:

  • Use structured outputs (JSON) for consistent parsing
  • Temperature=0 for reproducibility
  • Claude/GPT-5 as judges correlate ~0.85-0.9 with human ratings on many tasks
  • Validate your judge against human labels on a subset
  • Judge models have biases (position bias in pairwise, verbosity bias, self-preference)
  • In pairwise comparisons, judges often prefer whichever response appears second, so randomize order and run both permutations.

4. Retrieval metrics

For RAG systems, you need to evaluate both retrieval and generation:

Retrieval-specific:

  • Precision@K: Of K retrieved docs, what % are relevant?
  • Recall@K: Of all relevant docs, what % did you retrieve?
  • MRR (Mean Reciprocal Rank): 1/rank of first relevant result. Rewards getting relevant docs early
  • NDCG (Normalized Discounted Cumulative Gain): Accounts for position and relevance degree

Context relevance: Does retrieved context help answer the query? Answer faithfulness: Is the generated answer grounded in the retrieved context?

5. Task-specific metrics

Code generation:

  • Pass@K: % of problems where at least 1 of K samples passes tests
  • Unit test passage rates
  • Compilation success
  • Execution time, cyclomatic complexity

Classification:

  • Accuracy, precision, recall, F1
  • Confusion matrix analysis
  • Per-class breakdowns
  • Calibration (are confidence scores meaningful?)

Summarisation:

  • Coverage: Does summary include key information?
  • Consistency: Factual alignment with source
  • Coherence: Readability and flow
  • Use a mix of ROUGE + LLM judges + human spot-checks

Structured extraction:

  • Exact match on extracted entities
  • Partial credit for partially correct extractions
  • Schema compliance
  • Hallucination rate (extracted things not in source)

Evals might not feel exciting, but they’re essential infrastructure. But they're the difference between "we think this works" and "we measured it works."

Getting started with AI Evals

So how do you get started?

  1. Start with one critical path.
  2. Build 50 test cases.
  3. Pick 2-3 metrics.
  4. Run them before each deployment.

You'll catch regressions, make better decisions, and stop arguing about whether changes helped.

The hardest parts of building evals are:

  • Choosing the right metrics (not just easy-to-measure ones)
  • Keeping test sets fresh and representative
  • Actually using eval results to make decisions
  • Balancing quality, cost, and latency

But once you have evals running, you have a feedback loop. And feedback loops are how systems improve.

What's your eval setup? What metrics have you found useful? Where do you struggle?

In Part 2, we'll walk through setting up an automated eval pipeline with CI integration, versioned test sets, and dashboards.

Get in touch to discuss building your evaluations pipeline ->

Author

  • Aarushi Kansal
    AI Tech Director, UK