Iterating Prompts with Expert Feedback: A Five-Step Loop

Iterating Prompts with Expert Feedback: A Five-Step Loop

Updated: 2026-03-01 By: Ari Heljakka

Short answer

Iterating a prompt without a measurement loop is guessing. The reliable pattern is a closed loop: define the objective as a measurable target, capture a baseline, collect structured expert feedback against the same objective, revise the prompt in small steps, and run a layered evaluation (deterministic checks, model-as-judge, sampled human review) on every revision. The objective is versioned, the prompt is versioned, the evaluation suite is versioned, and a prompt change ships only when it clears the same gate every prior version had to clear.

Key facts

  • Definition: Expert-feedback prompt iteration is a measurement-driven loop where domain specialists score outputs against an explicit rubric, and the rubric, the prompt, and the calibration dataset all evolve under version control.
  • When to use: Any prompt powering a high-stakes task (medical, legal, financial, customer-facing), any prompt whose failure modes are subtle (tone, factual nuance, policy compliance), and any prompt that will be revised more than once.
  • Limitations: Expert time is expensive and noisy across raters. Without a calibration dataset and inter-rater agreement metrics, expert feedback is just opinion. Without CI gates, "improved" prompts can regress on dimensions no one is watching.
  • Example: A clinical summarization prompt is scored by clinicians on five independent 0 to 1 dimensions (faithfulness, completeness, safety, tone, brevity). Each revision must hold or improve every dimension before deployment; a regression on safety blocks merge even if faithfulness improves.

Key takeaways

  • Prompt iteration is an optimization problem. You cannot optimize what you do not measure.
  • Separate the objective (what good looks like) from the implementation (the prompt text). The objective survives prompt rewrites and model swaps.
  • Decompose quality into independent dimensions, each scored 0 to 1. Composition is a weighted sum the team agrees on, not a vibe.
  • Expert feedback is most valuable when it labels a calibration dataset; raw critique without ground truth degrades into preference noise.
  • Treat prompt revisions like code: semantic versioning, pull requests, automated evaluation gates, and rollback paths.

Definition

A prompt iteration loop is a repeatable process where a prompt is changed, evaluated against an explicit set of measurable objectives, and either promoted or rejected. Expert feedback is structured input from a domain specialist, captured as labeled examples and rubric scores, not as free-form notes. Calibration data is the labeled ground truth against which both the prompt and any automated evaluator are scored.

A serious iteration loop has three artifacts that outlive any single prompt revision:

  1. The objective. A written, dimensional description of what success looks like, independent of the prompt that produces it.
  2. The calibration dataset. A versioned set of inputs paired with expert-labeled scores on each objective dimension.
  3. The evaluation suite. Deterministic checks, a managed LLM judge, and sampled human review, each independently scored and composable into a single release gate.

The prompt itself is the most disposable artifact in the loop. The objective, dataset, and suite carry across model swaps and product pivots.

When this matters

  • High-stakes domains. Medicine, law, finance, and trust and safety cannot ship prompts that "look fine." They need measurable agreement with expert judgment.
  • Prompts that will be revised repeatedly. Any prompt powering a long-lived product accumulates revisions. Without a loop, the tenth revision is no more trustworthy than the first.
  • Cross-team ownership. When the prompt author is not the domain expert, structured feedback is the only mechanism that survives a handoff.
  • Model swaps. When the underlying model changes, the prompt may not, but the evaluation must still pass. A model-agnostic evaluation suite is what lets you swap models with confidence.

How it works

The loop runs in five stages, and each stage produces a versioned artifact.

Stage 1: Define a measurable objective

Write down what the prompt is for, in dimensions a domain expert can score independently. Avoid composite labels like "quality." Decompose into orthogonal dimensions:

  • Faithfulness: does the output reflect the source?
  • Completeness: are required elements present?
  • Safety: does the output avoid disallowed content?
  • Tone: does the output match the audience and brand?
  • Brevity: is the output as short as the task allows?

Each dimension is a 0 to 1 score. The composite is a weighted sum the team agrees on, recorded in the rubric. The rubric is itself a versioned artifact, not a slide.

A concrete operational target gives the loop something it can succeed or fail against. "Improve clarity" is not a target. "Raise faithfulness from 0.72 to 0.85 on the calibration set, without dropping any other dimension below its current floor" is a target.

Stage 2: Generate a baseline

Run the current prompt across the full calibration dataset and record every output. The baseline is not a single number; it is the per-dimension score distribution, plus the failure transcripts.

Group failures by mode (hallucination, omission, tone mismatch, format break, policy violation). The mode distribution tells you which dimension to attack first. A prompt that fails on faithfulness needs a different intervention than one that fails on tone.

Stage 3: Collect expert feedback as structured labels

Domain experts review the failure transcripts and label each output on every dimension. They do not rewrite the prompt; they grade the output. This separation is critical: the prompt is the engineer's responsibility, the rubric and the labels are the expert's.

Structured feedback formats that survive review fatigue:

  • Per-dimension 0 to 1 score with a short justification.
  • Failure tag from a fixed taxonomy (the same taxonomy the engineer uses to triage).
  • Suggested counterfactual: what the output should have said, not how the prompt should be rewritten.

Inter-rater agreement is measured (Cohen's kappa, intra-class correlation, or simple percent agreement). If two experts disagree on a dimension by more than a threshold, the rubric is ambiguous and must be sharpened before the prompt is touched.

Stage 4: Revise the prompt in small steps

Each revision should change one thing at a time so the effect is attributable. The disciplined moves:

  • Add a constraint that addresses a specific failure mode.
  • Add or refine a few-shot example drawn from the calibration dataset.
  • Restructure the output schema to make the failure mode mechanically impossible (JSON schema, enum, length cap).
  • Decompose the prompt into stages, with an intermediate output that is itself evaluable.

What does not belong in a single revision: rewriting the entire system prompt, swapping the model, and editing the rubric at the same time. Each of those is its own change, with its own evaluation pass.

Revisions use semantic versioning. A patch change (1.2.3 to 1.2.4) addresses a single failure mode without changing the contract. A minor change (1.2 to 1.3) adds a new dimension to the rubric or changes the output schema. A major change (1.x to 2.0) is incompatible with downstream consumers.

Stage 5: Evaluate in layers and gate the release

A single number is not a release gate. A layered evaluation, each layer independently scored, is. The standard composition:

  1. Deterministic checks. Format, schema, length, presence of required tokens, absence of disallowed tokens. These run in milliseconds and catch the cheap failures.
  2. Managed LLM judge. A versioned judge prompt scores each dimension against the rubric. The judge's agreement with the calibration dataset is itself a metric tracked over time.
  3. Sampled human review. A fraction of outputs, deliberately biased toward the boundary slice (judge score near the threshold), is reviewed by an expert and added to the calibration dataset.

The release gate is a composition rule, not a single threshold. A common pattern: every dimension must hold its current floor, the composite score must improve by at least the noise floor of the calibration set, and the deterministic checks must pass at 100 percent. Anything else is rejected by CI and rolls back automatically.

Example

A team building a clinical-summary assistant runs this loop on a five-dimension rubric (faithfulness, completeness, safety, tone, brevity). The calibration dataset starts at 150 clinician-labeled records and grows by roughly twenty per week from sampled human review.

  • Baseline. The first prompt scores 0.71 faithfulness, 0.82 completeness, 0.95 safety, 0.78 tone, 0.69 brevity. Failure tagging shows that most faithfulness misses are over-confident assertions about lab values not present in the source.
  • Revision 1 (patch). Add an explicit constraint: "If a lab value is not present in the source, write 'not reported' and do not infer." Re-run on the calibration set. Faithfulness rises to 0.84, completeness holds, safety holds, tone holds, brevity drops to 0.65 (the model now adds more "not reported" sentences). The composite improves; the regression on brevity is small but it crosses the floor.
  • Revision 2 (patch). Constrain "not reported" to a single line at the end of each section. Brevity recovers to 0.72, faithfulness holds at 0.84, no other dimension drops. The change passes CI.
  • Judge calibration. The managed LLM judge for faithfulness has 0.87 agreement with clinicians on the labeled set. When the model behind the judge is updated, agreement drops to 0.79. The judge prompt is recalibrated against the same dataset, and the release gate refuses to score new revisions until agreement is restored.

The team ships eleven revisions over six weeks. Every revision is attributable to a failure tag, every revision is gated by the same suite, and the calibration dataset grows by an order of magnitude over the same period. The prompt at the end of the cycle is unrecognizable from the start; the objective and the gate are identical.

Limitations

  • Expert time is the binding constraint. A loop that requires daily expert review will not run daily. Bias the human review toward the boundary slice and let the judge handle the rest.
  • Calibration data goes stale. Real-world inputs drift. The calibration set must be refreshed against production traces on a cadence; a static set hides distribution shift.
  • The judge can lie. A managed LLM judge is itself a model, and its agreement with experts must be re-measured whenever the judge's model, prompt, or temperature changes.
  • Composite scores can paper over regressions. A composite that goes up while one dimension goes down can ship a worse product. Per-dimension floors are not optional.
  • Inter-rater disagreement is a feature, not a bug. When experts disagree, the rubric is ambiguous. Treat the disagreement as a signal that the rubric needs a counterexample, not as noise to average away.

Evidence and sources

Further numeric figures cited elsewhere on the web (single-digit hallucination rates after N iterations, double-digit accuracy improvements) are typically reported without enough methodological detail to reproduce. Treat them as directional, and measure on your own data.

FAQ

How big should the calibration dataset be to start iterating safely? A useful floor is around 150 expert-labeled examples spanning the failure modes you already know about. Below 50, single-record noise dominates revision-to-revision comparisons. The dataset should grow with every human review pass, with new examples drawn from the boundary slice.

How do I prevent a "fix" on one dimension from regressing another? Set per-dimension floors and refuse any revision that drops a dimension below its floor, even if the composite improves. CI enforces this automatically.

Who owns the rubric, the engineer or the domain expert? The expert owns the rubric, the engineer owns the prompt, and both sign off on the calibration dataset. When the expert and engineer disagree on whether an output is good, the rubric is the tiebreaker; if the rubric does not resolve the disagreement, it needs another counterexample.

Can the managed LLM judge fully replace expert review? Not on its own. The judge's job is to scale expert judgment across the full traffic, but its agreement with experts is itself a metric that has to be measured and re-measured. Sampled human review feeds the dataset that keeps the judge honest.

How often should the prompt change in production in practice? As often as the loop produces a revision that passes the gate. Some teams ship a revision a week; others go months between revisions. Cadence is an output of the loop, not an input.

What changes when the underlying model is swapped? The objective, calibration set, and rubric remain. The prompt may need to be re-tuned, and the judge has to be re-calibrated against the new model's outputs before the gate is trusted. A model swap is a major version bump for the system, even if the prompt text is unchanged.

Related reading