Pattern · Process

Evals & error analysis

Hamel Husain · Eugene Yan

What an eval actually is

An eval is a test for non-deterministic output. Like a unit test: you give it inputs, you run the system, you check the output against a criterion. Unlike a unit test, the same input doesn't always produce the same output, and "correct" isn't always a single value — it can be a quality judgment.

Three things make an eval:

  1. A dataset — a fixed set of inputs (prompts, contexts, scenarios) you run the system against.
  2. A judgment — for each output, did it succeed? Code check, reference comparison, or another LLM acting as judge.
  3. An aggregation — pass rate, average score, distribution of categorized errors.

Evals turn a vibes-based question ("is the agent good?") into a measurable one ("on these 50 representative tickets, the agent's draft PRs merge without human edits 38% of the time, down from 44% last week").

Why this matters

Without evals, teams change prompts based on the last bad demo. The next change makes that demo better and silently breaks three other things. There's no signal that progress is happening or regression is creeping in.

Hamel Husain's claim — that most failed LLM products share no robust evals as their root cause — is consistent with everything we've seen. Eval-less teams stall in pilot.

The anatomy of an eval

INPUT ticket · diff · prompt AGENT system under test OUTPUT PR · comment · plan JUDGE applies criterion SCORE pass / fail / rubric CRITERION code check · reference comparison · LLM-as-judge
An eval is a pipeline. The system under test is the agent; everything else (input, judge, criterion, score) is the test rig you build.

Three classes of judgment, with examples

1 · Code / deterministic checks

Cheapest, most reliable. Use when "correct" can be expressed as a rule.

Examples in our scope:

When to use: structural, syntactic, or executable criteria.

2 · Reference-based comparison

You have a gold answer; the question is how close the output is.

Examples:

Caveat: BLEU/ROUGE correlate negatively with human judgment of fluency. Use them as "is this in the ballpark" gates, not for ranking models.

3 · LLM-as-judge

For open-ended outputs where no single answer is right — code review comments, design rationale, summarization, postmortems.

Examples:

LLM-judge correlation with humans averages around 0.51 — better than n-gram metrics, much cheaper than humans, but biased (see below).

A worked example — evaluating a PR-review agent

The deliverable: an eval that tells us whether our PR-review agent is getting better or worse week-over-week.

  1. Build the dataset. 30 closed PRs from the client's repo, mixed by size and language. For each, capture: the diff, the human reviewer's comments, the eventual merge outcome.
  2. Run the agent on each PR. Capture its comments.
  3. Judge each agent comment with a mix of techniques:
    • Deterministic — did the comment reference a real line of the diff? (regex against file/line numbers)
    • Reference — did the agent flag any issue the human reviewer also flagged? (overlap on file+line — recall)
    • LLM-judge — is each comment specific and actionable (vs vague), with another model as judge?
  4. Aggregate per run:
    • % comments grounded in real diff lines
    • % overlap with human-reviewer concerns (recall)
    • % of comments rated "specific and actionable" (precision-ish)
    • false-positive rate — comments humans dismissed
  5. Watch the trend. Every prompt or model change reruns the suite. The numbers move; you decide whether the change goes to production.

The first version of this can be 200 lines of Python. The trap is over-engineering before the dataset is good. Get 30 real examples first, scoring by hand if necessary.

The error-analysis loop

Husain's "spend 60–80% of dev time on error analysis" point — concretely, this is the loop:

  1. Run the eval. Get failures.
  2. Read the failures. Don't just look at the score.
  3. Categorize them. "Cited wrong line", "vague comment", "missed the bug", "complained about style we don't care about". A spreadsheet, not a dashboard.
  4. Pick the biggest category. Fix the prompt, the harness, or the model to address it.
  5. Re-run. Did the category shrink? Did another one grow?

This loop is the work. Building the dataset and judge enables it.

LLM-as-judge biases to defeat

If you're using an LLM as the judge, install these mitigations from day one — they're cheap, and clients won't think of them:

Minimum viable eval (start today)

For any agent loop the client wants to ship:

  1. 30 real examples from their tracker, repo, or logs. Real, not synthetic.
  2. One pass/fail criterion per example — write it down, even if it's "I'd accept this PR" / "I'd revise it" by hand.
  3. A spreadsheet with one row per example, columns for the agent's output and your judgment.
  4. A weekly run with the same dataset and criteria.

That's enough to know whether you're improving. Sophistication comes later.

Hamel Husain · Shreya Shankar · evals FAQ · Yan's Evals pattern · AI Evals for Engineers & PMs (Maven course)

First deliverable in nearly every engagement: a working eval suite with labeled error categories. Start with 30 real examples and a spreadsheet — sophistication is later. When the eval needs LLM-as-judge (agent PR-review comments are the obvious case), install the four bias mitigations from day one.