Lewis C. Lin’s Newsletter

Deep Dives

O13: How Do AI Evals Work?

What AI evals are, why they matter, and how to run one on your LLM integration

Mar 06, 2026
∙ Paid

You built the feature. You integrated the LLM. Now someone’s asking: How do you know it’s working? How do you test it? How do you make it better?

Those three questions have a four-step answer. It’s called an AI eval.

The Four Steps

DEFINE → RUN → JUDGE → FIX → (repeat)

Step 1 — Define the Golden Set

Define what good looks like. Build your golden set.

Before your AI touches anything, your domain expert — the person with the most credible judgment in your product area — writes down what the right answer is. Not roughly right. Specifically right.

Their labeled examples become your golden set.

How to build it:

  • Identify your domain expert — your best support agent, most experienced paralegal, or senior engineer

  • Have them label ~100 real examples: good outputs and bad ones

  • Their judgment is the standard — no consensus, no committee

⚠️ Skip this step and you’re measuring nothing.

What one golden set entry looks like:

  • Input: “How do I cancel my subscription?”

  • Gold standard answer: “Go to Settings → Billing → Cancel. You’ll keep access until the end of your billing period.”

  • Why it’s good: Direct, correct, no jargon, tells the user what happens next

  • Failure mode if wrong: Hallucination or incomplete answer

Step 2 — Run the Model

Feed your AI real questions. Not easy ones.

Take the inputs your actual users send and run them through your model. Collect every output.

What “real” means:

User's avatar

Continue reading this post for free, courtesy of Lewis C. Lin.

Or purchase a paid subscription.
© 2026 Lewis C. Lin · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture