The boring parts of an AI agent that actually matter
Tool-calling and prompts are the fun part. Evals, observability, and rate-limiting are the parts that decide whether the agent makes it past month two.
The 10x cost
When we audit failed agent projects, the cause is almost never the agent’s brain. It’s everything around it.
Every shipped agent we run in production has more code dedicated to observability, retry logic, and eval scaffolding than to the agent itself. That ratio doesn’t reverse; it just gets bigger as the agent gets useful.
What you actually need
- Per-step observability. Every tool call, every reasoning trace, every output: stored, queryable, and replay-able.
- A real eval harness. Not vibes. Twenty to two hundred labeled examples per task, run every time you change the prompt.
- Idempotent tools. If your agent retries, your tool needs to handle the retry. Most don’t, and you find out at 2am.
- Rate limit budgets per task. The agent can run away. Capping cost and tokens per execution is non-negotiable.
A small example
We had an outreach agent that was producing great drafts 99% of the time. Until one Tuesday it started sending the same draft 40 times because of a retry loop. The cost of two hours of debugging vs. the cost of five lines of rate-limit code is the lesson.