LLM eval hierarchy — Practical AI Dev

“LLM-as-judge” scales until you realize the rubric was never written. A practical stack layers cheap deterministic checks, small human-curated goldens for dangerous paths, broader sampling for brittleness, and finally online metrics—because users do not grade your rubric; they abandon, edit, or escalate. That mirrors how velocity and verification negotiate in shipping teams.

Layer 1: Determinism first

JSON schema, tool invocation shape, banned phrases—these belong in CI. They catch an enormous class of regressions without touching model quality semantics. Tie them to the same truth contract rules you use in prompts and policy code.

Layer 2: Goldens and failure classes

Curate examples from redacted production failures—otherwise your suite optimizes last quarter’s vocabulary. Group tests by failure class (“wrong tool,” “wrong entity”) to avoid eval inflation—see the homepage glossary “Terms we use precisely” and hardening practice.

Layer 3: Telemetry closes the loop

Postmortems need traces, not vibes. When an incident fires, you should name which layer should have caught it—or admit the promise was unrealistic. Cost matters too: heavy judge models belong in offline pipelines, not on every commit—see cost as a requirement.

Rubrics that survive contact with users

LLM-as-judge works when the rubric is explicit enough that two annotators agree. Invest in anchor examples per score band, and refresh when vocabulary drifts—especially after fine-tunes or corpus updates. Otherwise your offline scores diverge from what users actually trust.

Online metrics as the fourth layer

Click-through, edit distance, escalation rate, and task completion time tell you whether the system works in context—not only whether outputs match a reference string. Wire those signals back to owners of production and to the same failure taxonomy you use offline so improvements do not chase the wrong metric.

Pairwise comparisons and Elo pitfalls

Ranking models with head-to-head battles is intuitive but can be unstable if the comparison set is skewed toward easy prompts. Stratify battles by difficulty, intent, and risk tier; cap how much any single annotator can shift the leaderboard. Document tie-break rules when judges disagree—otherwise your “winner” is whoever showed up last week.

Negative tests matter

Suites heavy on happy paths miss refusals and safe failures. Add explicit cases where the correct behavior is to say “I cannot” or to call a tool—aligned with your truth contract. Rewarding verbosity in judges can punish those refusals even when they are correct.

Data leakage and benchmark contamination

If training data touches eval examples, scores lie. Separate holdout sets by time and source; rotate goldens when corpora update. For fine-tunes, treat contamination review as part of the release checklist, not an afterthought.

An eval stack is healthy when every production incident names a missing or weakened test—never when the suite was green and users were still angry.

A hierarchy of LLM evals that maps to incidents