What an eval actually is
An eval is a test for non-deterministic output. Like a unit test: you give it inputs, you run the system, you check the output against a criterion. Unlike a unit test, the same input doesn't always produce the same output, and "correct" isn't always a single value — it can be a quality judgment.
Three things make an eval:
- A dataset — a fixed set of inputs (prompts, contexts, scenarios) you run the system against.
- A judgment — for each output, did it succeed? Code check, reference comparison, or another LLM acting as judge.
- An aggregation — pass rate, average score, distribution of categorized errors.
Evals turn a vibes-based question ("is the agent good?") into a measurable one ("on these 50 representative tickets, the agent's draft PRs merge without human edits 38% of the time, down from 44% last week").
Why this matters
Without evals, teams change prompts based on the last bad demo. The next change makes that demo better and silently breaks three other things. There's no signal that progress is happening or regression is creeping in.
Hamel Husain's claim — that most failed LLM products share no robust evals as their root cause — is consistent with everything we've seen. Eval-less teams stall in pilot.
The anatomy of an eval
Three classes of judgment, with examples
1 · Code / deterministic checks
Cheapest, most reliable. Use when "correct" can be expressed as a rule.
Examples in our scope:
- PR diff applies cleanly —
git apply --check. - Tests pass after the agent's change —
pytest/npm test. - Generated JSON parses — schema validation.
- Plan contains a "files to change" section — regex.
- Tool call uses an allowed verb — allowlist check.
When to use: structural, syntactic, or executable criteria.
2 · Reference-based comparison
You have a gold answer; the question is how close the output is.
Examples:
- PR matches the human-authored fix — diff distance, or BLEU/ROUGE if textual.
- Bug repro actually reproduces the bug — run it, check exit code.
- Code-review comment cites the same lines a human would — set overlap (recall).
Caveat: BLEU/ROUGE correlate negatively with human judgment of fluency. Use them as "is this in the ballpark" gates, not for ranking models.
3 · LLM-as-judge
For open-ended outputs where no single answer is right — code review comments, design rationale, summarization, postmortems.
Examples:
- "Is this PR-review comment specific and actionable?" Yes/No + reason.
- "Of these two summaries of the failed CI run, which one would help the on-call engineer faster?" Pairwise.
- "Did the agent's plan address all the constraints in the issue?" Rubric scored 1–5.
LLM-judge correlation with humans averages around 0.51 — better than n-gram metrics, much cheaper than humans, but biased (see below).
A worked example — evaluating a PR-review agent
The deliverable: an eval that tells us whether our PR-review agent is getting better or worse week-over-week.
- Build the dataset. 30 closed PRs from the client's repo, mixed by size and language. For each, capture: the diff, the human reviewer's comments, the eventual merge outcome.
- Run the agent on each PR. Capture its comments.
- Judge each agent comment with a mix of techniques:
- Deterministic — did the comment reference a real line of the diff? (regex against file/line numbers)
- Reference — did the agent flag any issue the human reviewer also flagged? (overlap on file+line — recall)
- LLM-judge — is each comment specific and actionable (vs vague), with another model as judge?
- Aggregate per run:
- % comments grounded in real diff lines
- % overlap with human-reviewer concerns (recall)
- % of comments rated "specific and actionable" (precision-ish)
- false-positive rate — comments humans dismissed
- Watch the trend. Every prompt or model change reruns the suite. The numbers move; you decide whether the change goes to production.
The first version of this can be 200 lines of Python. The trap is over-engineering before the dataset is good. Get 30 real examples first, scoring by hand if necessary.
The error-analysis loop
Husain's "spend 60–80% of dev time on error analysis" point — concretely, this is the loop:
- Run the eval. Get failures.
- Read the failures. Don't just look at the score.
- Categorize them. "Cited wrong line", "vague comment", "missed the bug", "complained about style we don't care about". A spreadsheet, not a dashboard.
- Pick the biggest category. Fix the prompt, the harness, or the model to address it.
- Re-run. Did the category shrink? Did another one grow?
This loop is the work. Building the dataset and judge enables it.
LLM-as-judge biases to defeat
If you're using an LLM as the judge, install these mitigations from day one — they're cheap, and clients won't think of them:
- Position bias — evaluator prefers whichever option comes first. Fix: swap order and re-evaluate; only count a win if both runs agree.
- Verbosity bias — evaluator prefers longer answers. Fix: equalize length before comparing.
- Self-enhancement bias — evaluator favors its own family's output. Fix: never use the same model family for generation and evaluation.
- Score-vs-comparison noise — direct numeric scoring is noisier than pairwise. Fix: ask for comparisons, not scores.
Minimum viable eval (start today)
For any agent loop the client wants to ship:
- 30 real examples from their tracker, repo, or logs. Real, not synthetic.
- One pass/fail criterion per example — write it down, even if it's "I'd accept this PR" / "I'd revise it" by hand.
- A spreadsheet with one row per example, columns for the agent's output and your judgment.
- A weekly run with the same dataset and criteria.
That's enough to know whether you're improving. Sophistication comes later.
Hamel Husain · Shreya Shankar · evals FAQ · Yan's Evals pattern · AI Evals for Engineers & PMs (Maven course)