How to evaluate LLM outputs: A practical guide to AI Evals

16 Jan 2026|Technology

You've shipped an AI feature. Your team keeps tweaking prompts, swapping models, and updating retrieval logic. But here's the question: is it actually getting better? Without evals, you're guessing and hoping. This two-part guide covers the practical stuff: what metrics to track, how to set up different eval types, and how to build an eval pipeline that doesn't suck.

What are AI evals?

Evals are systematic ways to measure how well your AI system performs. Think of them as unit tests for AI outputs, except instead of checking if a function returns true, you're checking if your LLM's response is accurate, factual, helpful, or whatever matters for your use case.

But AI outputs are fuzzy. There's rarely one "correct" answer. That's what makes evals both essential and tricky.

Why you can't skip evals

Without Evals	With Evals
Ship a prompt change and cross your fingers	Confidence that changes actually improve your system
Users report issues you can't reproduce consistently	Regression detection before users find bugs
Can't tell if your RAG system is retrieving useful context	Measurable retrieval quality you can track over time
Someone asks "is the new model better?" and you shrug	Data-driven decisions about model selection
No way to know if you're moving forward or backward	A feedback loop that actually works

Types of evals and metrics that matter

1. Exact match and pattern-based evals

Fast, deterministic, and underrated, use these as your first line of defense. They catch obvious errors before you waste compute power on fancy evals.

Exact match: predicted == expected. Works for structured outputs (classification labels, entity extraction, SQL queries)
Regex/parsing checks: Validate JSON schema compliance, required field presence, format adherence -Substring/keyword matching: Check for required elements in generated text
Length constraints: Token counts, character limits

2. Semantic similarity metrics

When exact match is too strict but you need objectivity:

Embedding cosine similarity: Compare output embeddings to reference. Threshold typically 0.7-0.9 depending on strictness
BERTScore: Uses contextual embeddings to compare tokens. F1 variant is most useful. Good for machine translation, summarization
BLEU/ROUGE: N-gram overlap metrics. BLEU for generation (originally MT), ROUGE for summarization. Useful but limited because high scores don't guarantee quality

3. LLM-as-judge evals

Use a strong model to evaluate outputs from your production model:

Single-output scoring: Give the judge an output and rubric, get a score. Here’s a simplified example:

Rate this customer support response on a scale of 1-5 for:
- Empathy (addresses customer emotion)
- Accuracy (correct information)
- Completeness (answers all questions)

Response: {output}
Return JSON: {"empathy": X, "accuracy": Y, "completeness": Z}

For accuracy and completeness, you'd also pass the original query and any ground truth context - simplified here for brevity.

Pairwise comparison: Which output is better? More robust than absolute scoring. Which response better answers the user's question? A: {output_a} B: {output_b} Return: A or B.

Critique-based: Ask the model to identify specific issues List factual errors in this summary: Original: {source} Summary: {output}

Key considerations:

Use structured outputs (JSON) for consistent parsing
Temperature=0 for reproducibility
Claude/GPT-5 as judges correlate ~0.85-0.9 with human ratings on many tasks
Validate your judge against human labels on a subset
Judge models have biases (position bias in pairwise, verbosity bias, self-preference)
In pairwise comparisons, judges often prefer whichever response appears second, so randomize order and run both permutations.

4. Retrieval metrics

For RAG systems, you need to evaluate both retrieval and generation:

Retrieval-specific:

Precision@K: Of K retrieved docs, what % are relevant?
Recall@K: Of all relevant docs, what % did you retrieve?
MRR (Mean Reciprocal Rank): 1/rank of first relevant result. Rewards getting relevant docs early
NDCG (Normalized Discounted Cumulative Gain): Accounts for position and relevance degree

Context relevance: Does retrieved context help answer the query? Answer faithfulness: Is the generated answer grounded in the retrieved context?

5. Task-specific metrics

Code generation:

Pass@K: % of problems where at least 1 of K samples passes tests
Unit test passage rates
Compilation success
Execution time, cyclomatic complexity

Classification:

Accuracy, precision, recall, F1
Confusion matrix analysis
Per-class breakdowns
Calibration (are confidence scores meaningful?)

Summarisation:

Coverage: Does summary include key information?
Consistency: Factual alignment with source
Coherence: Readability and flow
Use a mix of ROUGE + LLM judges + human spot-checks

Structured extraction:

Exact match on extracted entities
Partial credit for partially correct extractions
Schema compliance
Hallucination rate (extracted things not in source)

Evals might not feel exciting, but they’re essential infrastructure. But they're the difference between "we think this works" and "we measured it works."