← All insights
AI · · 5 min

A short list of evals every LLM product should have

You don't need 1,000 examples. You need 30 well-chosen ones.

The minimum bar

For every LLM-powered feature we ship, we set up the same six eval categories before the feature goes live:

  1. Happy path. Does it work on the most common inputs?
  2. Edge cases. What happens with empty / extreme / adversarial inputs?
  3. Format adherence. Does the structured output stay structured?
  4. Refusal cases. Does it correctly refuse out-of-scope requests?
  5. Regression cases. Known-broken inputs from before we deployed.
  6. Domain golden set. 10-30 hand-picked examples your team labeled as correct/incorrect.

What you don’t need

You don’t need a research-paper-grade eval suite. You don’t need 10,000 examples. You need enough that a prompt change can’t ship without measurable evidence.

What changes when you have evals

You stop arguing about whether the model got worse. You can measure it. You ship faster, with less fear.