O13: How Do AI Evals Work?

What AI evals are, why they matter, and how to run one on your LLM integration

Mar 06, 2026

∙ Paid

You built the feature. You integrated the LLM. Now someone’s asking: How do you know it’s working? How do you test it? How do you make it better?

Those three questions have a four-step answer. It’s called an AI eval.

The Four Steps

DEFINE → RUN → JUDGE → FIX → (repeat)

Step 1 — Define the Golden Set

Define what good looks like. Build your golden set.

Before your AI touches anything, your domain expert — the person with the most credible judgment in your product area — writes down what the right answer is. Not roughly right. Specifically right.

Their labeled examples become your golden set.

How to build it:

Identify your domain expert — your best support agent, most experienced paralegal, or senior engineer
Have them label ~100 real examples: good outputs and bad ones
Their judgment is the standard — no consensus, no committee

⚠️ Skip this step and you’re measuring nothing.

What one golden set entry looks like:

Input: “How do I cancel my subscription?”
Gold standard answer: “Go to Settings → Billing → Cancel. You’ll keep access until the end of your billing period.”
Why it’s good: Direct, correct, no jargon, tells the user what happens next
Failure mode if wrong: Hallucination or incomplete answer

Step 2 — Run the Model

Feed your AI real questions. Not easy ones.

Take the inputs your actual users send and run them through your model. Collect every output.

What “real” means:

Continue reading this post for free, courtesy of Lewis C. Lin.

Or purchase a paid subscription.

Lewis C. Lin’s Newsletter