rigami.
All insights
/// AI · 2026-02-02 · 9 min read

Shipping LLM features without shipping a demo

Most LLM features look great in a demo and fall apart in production. Here's the observability, eval and rollback stack we wrap around every AI feature.

There's a pattern that has cost teams real money over the last two years. It looks like this: someone hacks together an LLM feature on a Friday, the demo is magical, leadership greenlights it, the team ships it Monday — and within a week the support inbox is full of hallucinations, jailbreaks and quietly inflated bills.

The feature wasn't bad. The wrapper around it was missing. Production LLM work is 20% prompt and model selection, 80% the operational layer that makes it survive contact with real users. Here's the layer we wrap around every AI feature we ship.

1. Offline evals before any user sees it

Before a feature goes near a real user, we build an eval set. Fifty to two hundred input/output pairs that represent the work the feature is supposed to do. We run the model against the eval set on every prompt change and every model swap, and we score outputs with a combination of deterministic checks (regex, JSON validity, did-it-call-the-right-tool) and LLM-judges scored against a reference rubric.

Tools we reach for: LangSmith, Braintrust, or a homegrown harness in pytest. The brand doesn't matter. The discipline does.

2. Online evals once it ships

Offline evals catch regression. Online evals catch reality. We sample real production traffic — anonymised — and grade it the same way, on a rolling basis. If the score on a metric drops by more than X% week-over-week, an alert fires.

This is the single biggest difference between an AI product that ages well and one that quietly degrades. Models drift. User behaviour drifts. If you don't measure both, you'll find out from the support inbox.

3. Cost ceilings, per feature

Every LLM feature has a per-request token budget and a per-day spend ceiling. If a request exceeds its budget, it gets truncated or downgraded to a cheaper model. If the daily ceiling hits, we degrade gracefully — the feature shows a fallback message instead of silently burning cash. We've seen too many post-mortems about LLM bills to skip this step.

4. Tracing on every request

We trace every LLM call with OpenTelemetry: which model, how many tokens, how long it took, which tools it called, what the upstream user did before and after. If a customer complains about an output six weeks later, we can pull the exact trace.

5. A rollback that actually works

Prompts and models go through the same release path as code. They're versioned, gated behind feature flags, and rollable from a single switch. We never edit a production prompt by hand — every change is a PR.

6. PII and safety as code, not policy

PII redaction happens at the boundary, not in the prompt. Safety classifiers run before and after generation. Refusals are logged and reviewed weekly. The compliance answer to "what happens if the model says X" should never be "we hope it doesn't".

What this gets you

You get an LLM feature that ships on a Tuesday, doesn't melt down by Friday, and gets quietly better month-over-month — instead of one that demos beautifully and requires a war room to keep alive. That's the difference between AI as a marketing line and AI as a product.

Want this kind of work on your roadmap?
Work with us