Blog / Prompt engineering / Eval Pipeline: 3 Tiers That Work

Eval Pipeline: 3 Tiers That Work

Learn how to build eval pipelines with heuristic, LLM-as-judge, and human review tiers. Cut cost, catch bias, and ship safer evals. Read the full guide.

Ilia Ilinskii
Rephrase · June 7, 2026

Prompt engineering8 min read

On this page

Key Takeaways What is a three-tier eval pipeline?Why start with heuristic checks?What belongs in the heuristic tier?Where does LLM-as-a-judge fit?How should the judge be prompted?Why do humans still matter?When should a sample escalate to humans?A practical routing table for all three tiers How do you keep the judge from becoming the weak link?What should you measure?What does a good three-tier workflow look like?References Documentation & Research Community Examples

If your eval stack still treats every sample the same, you're paying too much for the easy cases and not enough attention to the risky ones. The better pattern is a ladder: let deterministic checks catch obvious failures, let an LLM judge handle fuzzy judgment, and reserve humans for the edge cases that matter.

Key Takeaways

A three-tier eval pipeline keeps cheap checks in front and expensive judgment at the back.
Heuristics should handle anything deterministic, like format, schema, length, or simple rule violations.
LLM-as-a-judge works best for rubric-based, subjective, or semi-structured evaluation.
Human annotation should be used for calibration, disputes, and high-stakes edge cases.
The strongest pipelines log every tier so you can audit why a sample escalated.

What is a three-tier eval pipeline?

A three-tier eval pipeline is a staged system for scoring model outputs: first pass deterministic heuristics, second pass an LLM judge, third pass human annotation. The point is efficiency with control. Research on LLM judges shows they can be effective but biased and prompt-sensitive, so you want them inside a larger architecture, not as the whole architecture [1][2].

The key mental model is escalation. Most samples are boring. A small fraction are messy, ambiguous, or high-risk. The pipeline should spend almost nothing on the boring ones and progressively more on the ones where uncertainty is rising.

Why start with heuristic checks?

Heuristics are your cheapest and most trustworthy layer because they answer questions code can answer exactly. If you need JSON validity, citation presence, forbidden-word filtering, max length, or schema compliance, don't ask an LLM to "judge" that. Just check it. Practical LLM eval writeups keep repeating this because it's where teams waste the most money [3].

I'd put any deterministic failure here. If the sample fails a heuristic, it either fails immediately or gets marked for deeper review. That gives you fast feedback and a clean audit trail. It also prevents the LLM judge from becoming a glorified parser.

What belongs in the heuristic tier?

This layer should handle exact-match or rule-based conditions: schema checks, formatting, length bounds, required keywords, duplicate detection, safety blocklists, citation formatting, and routing metadata. In other words, if you can write a unit test for it, you should.

Here's the part teams often miss: heuristics aren't just "cheap." They also reduce noise upstream. If your judge sees only well-formed candidates, its job gets simpler and more stable.

Where does LLM-as-a-judge fit?

LLM-as-a-judge belongs in the middle tier, where the task is real but not fully reducible to code. That includes helpfulness, groundedness, completeness, tone, adherence to nuanced instructions, and tradeoff-heavy judgments. The recent literature is pretty clear that judges can work well, but their reliability depends on prompt design, aggregation strategy, and bias mitigation [1][2].

My rule is simple: use an LLM judge when you need a rubric, not a binary truth table. If the output needs interpretation, the judge is useful. If it needs proof, use a test. If it needs policy or judgment under uncertainty, you probably still want a human somewhere in the loop.

How should the judge be prompted?

Use a narrow rubric, explicit criteria, and structured output. Studies on judge improvement found that task-specific criteria and ensembling often outperform fancier tricks like calibration context or soft blending [3]. Another reliability study showed that prompt variations can materially change judge behavior, which is exactly why you want the rubric to be as stable and explicit as possible [2].

A good judge prompt usually asks for a single decision, brief reasoning, and machine-readable output. Keep it boring. Boring is auditable.

Why do humans still matter?

Humans matter because the LLM judge is not the source of truth. It is a scalable proxy. Research shows judges can exhibit style bias, prompt sensitivity, and inconsistent behavior across tasks and model families [1][2]. Humans are how you calibrate the proxy, spot drift, and adjudicate the weird stuff the rubric missed.

I like to use humans for three cases: borderline samples, disagreement between heuristic and judge tiers, and periodic gold-set calibration. That's the difference between "we have a judge" and "we trust our judge."

When should a sample escalate to humans?

Escalate when confidence is low, when tiers disagree, when the sample is high-value, or when the task is sensitive enough that a false positive is costly. In practice, that means you can route only a tiny fraction of traffic to humans and still keep the system honest.

Here's the useful trick: make escalation rules explicit. Don't let "someone should look at this" live in Slack. Put it in the pipeline.

A practical routing table for all three tiers

The cleanest pipeline is a decision tree, not a blob of scores. Start with heuristics, then judge, then human review if needed. This keeps the architecture explainable and makes it easy to tune thresholds later.

Tier	Best for	Output	Cost	Risk
Heuristic	Schema, format, hard rules	Pass/fail	Lowest	Low
LLM-as-a-judge	Subjective rubric checks	Score / verdict	Medium	Medium
Human annotation	Edge cases, gold labels	Final adjudication	Highest	Lowest bias, highest latency

What I've noticed in practice is that the best teams don't argue about whether the judge is "good enough" in the abstract. They ask whether it's good enough for this tier. That's a much better question.

How do you keep the judge from becoming the weak link?

You test it like software. Build a gold set, measure agreement with humans, run adversarial cases, and track drift over time. The research is full of reminders that judge quality is not fixed; it changes with prompt wording, model version, and task type [1][2]. So the judge itself needs evaluation.

This is also where tools like Rephrase can help: if you're writing dozens of eval prompts across tasks, it's easy to introduce accidental wording drift. Rewriting those prompts into cleaner, task-specific versions saves a lot of avoidable variance.

What should you measure?

Measure escalation rates, human agreement, false accepts, false rejects, cost per evaluated sample, and judge stability under prompt changes. If you only track average accuracy, you'll miss the failure modes that actually break pipelines.

I also recommend versioning everything: heuristic rules, judge prompt, model version, and human rubric. Without that, you won't know whether a regression came from the model or from your evaluation system.

What does a good three-tier workflow look like?

A good workflow is simple and opinionated. The heuristic layer rejects obvious failures. The judge tier scores the ambiguous middle. The human tier handles uncertainty, calibration, and high-impact decisions. That structure matches both the practical advice in recent judge papers and the reliability concerns raised by meta-evaluation work [1][2][3].

If you want a quick start, define one deterministic rule set, one judge rubric, and one human escalation policy. Then measure how often each tier fires. You'll usually find that the expensive layers can be much smaller than you thought.

A final note: don't overcomplicate the first version. Ship the pipeline, inspect the failures, then tighten the thresholds. That's the loop.

If you want more articles on prompt engineering and AI workflows, check the Rephrase blog. And if you need to turn rough evaluator notes into tighter prompts fast, the Rephrase homepage is a good place to start.

References

Documentation & Research

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines - arXiv (link)
Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory - arXiv (link)
An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2 - arXiv (link)

Community Examples

Open-source LLM-as-a-Judge pipeline for comparing local models - feedback welcome - r/LocalLLaMA (link)

Frequently asked

What is a three-tier eval pipeline?

It's an evaluation system that starts with cheap heuristic checks, escalates ambiguous cases to an LLM judge, and sends the hardest or highest-risk samples to humans.

Why not use only human annotation?

Humans are still the gold standard, but they're slow and expensive. A tiered pipeline reserves human time for the samples where it actually changes the decision.

What should go into the heuristic tier?

Anything deterministic: schema validity, regex checks, length limits, citation presence, forbidden terms, and simple business rules. If code can verify it, code should verify it.

Blog / Prompt engineering / Eval Pipeline: 3 Tiers That Work

← All notes

Eval Pipeline: 3 Tiers That Work

Learn how to build eval pipelines with heuristic, LLM-as-judge, and human review tiers. Cut cost, catch bias, and ship safer evals. Read the full guide.

Ilia Ilinskii
Rephrase · June 7, 2026

Prompt engineering8 min read

On this page

Key Takeaways

A three-tier eval pipeline keeps cheap checks in front and expensive judgment at the back.
Heuristics should handle anything deterministic, like format, schema, length, or simple rule violations.
LLM-as-a-judge works best for rubric-based, subjective, or semi-structured evaluation.
Human annotation should be used for calibration, disputes, and high-stakes edge cases.
The strongest pipelines log every tier so you can audit why a sample escalated.

What is a three-tier eval pipeline?

Why start with heuristic checks?

What belongs in the heuristic tier?

Here's the part teams often miss: heuristics aren't just "cheap." They also reduce noise upstream. If your judge sees only well-formed candidates, its job gets simpler and more stable.

Where does LLM-as-a-judge fit?

How should the judge be prompted?

A good judge prompt usually asks for a single decision, brief reasoning, and machine-readable output. Keep it boring. Boring is auditable.

Why do humans still matter?

When should a sample escalate to humans?

Here's the useful trick: make escalation rules explicit. Don't let "someone should look at this" live in Slack. Put it in the pipeline.

A practical routing table for all three tiers

Tier	Best for	Output	Cost	Risk
Heuristic	Schema, format, hard rules	Pass/fail	Lowest	Low
LLM-as-a-judge	Subjective rubric checks	Score / verdict	Medium	Medium
Human annotation	Edge cases, gold labels	Final adjudication	Highest	Lowest bias, highest latency

How do you keep the judge from becoming the weak link?

What should you measure?

What does a good three-tier workflow look like?

A final note: don't overcomplicate the first version. Ship the pipeline, inspect the failures, then tighten the thresholds. That's the loop.

References

Documentation & Research

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines - arXiv (link)
Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory - arXiv (link)
An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2 - arXiv (link)

Community Examples

Open-source LLM-as-a-Judge pipeline for comparing local models - feedback welcome - r/LocalLLaMA (link)

Frequently asked

What is a three-tier eval pipeline?

It's an evaluation system that starts with cheap heuristic checks, escalates ambiguous cases to an LLM judge, and sends the hardest or highest-risk samples to humans.

Why not use only human annotation?

Humans are still the gold standard, but they're slow and expensive. A tiered pipeline reserves human time for the samples where it actually changes the decision.

What should go into the heuristic tier?

Anything deterministic: schema validity, regex checks, length limits, citation presence, forbidden terms, and simple business rules. If code can verify it, code should verify it.