A demo is a demo. A production system has different requirements, and most AI projects we see are missing them on day one. This is the checklist we run on every system we ship. Use it to audit your own.
What “production-ready” actually means
Production-ready does not mean the model is accurate. Models are accurate enough for most tasks already. Production-ready means the system can run unattended in front of real users, with normal failure modes handled, normal abuse rejected, normal cost controls in place, and normal operational visibility.
Most AI projects fail one of these seven checks. The rest of this post walks through each one, what it looks like in practice, and how to add it if you are missing it.
1. Eval pipeline with regression tests
You have a versioned set of input/output pairs. Every change to the prompt, the model, or the retrieval pipeline runs against the set. If any test drops below threshold, the deploy is blocked. The set is in version control. The threshold is documented. The on-call engineer can run the eval locally in under five minutes.
The most common failure mode here is a team that says “we test it manually before deploys.” Manual testing does not catch the regression where you change the system prompt to fix one customer’s complaint and silently break twelve other customer workflows.
Build the eval set early. Keep it under 200 examples until you have real production traffic, then promote real traffic into the eval set when you find interesting failure cases.
2. Observability of latency, cost, and errors
Every request logs latency (p50 and p95), token cost, model used, retrieval hit count, and outcome. The data lands in a dashboard your on-call engineer actually looks at. The dashboard has alerts wired to a pager, not just an email.
If you cannot answer “what was our p95 latency over the last hour?” in under thirty seconds, you do not have observability. Buy a vendor (LangSmith, Helicone, Langfuse, Arize) or build a thin wrapper, but ship something.
3. Prompt and model versioning
Every prompt has a version. Every request log includes the prompt version. When a prompt change goes bad, rolling back is a one-command operation, not a redeploy.
The simplest version of this is a prompts/ directory in your git repo with one file per prompt and the file’s git SHA stamped into the request. Plenty of vendors offer fancier versioning UIs. The requirement is the same either way: when something goes wrong at 2am, the on-call has to know which prompt version was running and how to swap it.
4. Cost cap with alerting
You have a daily ceiling and a per-tenant ceiling on token spend. Crossing either one pages the on-call. The thresholds are reviewed quarterly.
Without this, three things can happen and at least one of them will. A prompt-injection attack consumes 100,000 tokens in a single conversation. An agent loop fails to terminate and burns $400 of compute overnight. A new feature ships and 10x your normal volume hits the API for legitimate reasons but no one notices until the invoice.
The cap is not optional. The threshold can be high. Just have one.
5. Documented fallback path
When the model fails, the system has a tested, documented behavior. Common patterns: serve a cached response, escalate to a human, return a graceful error message that the user understands. Whatever you choose, it is tested in CI, not assumed.
The bar here is low: pull the network during a test and verify the system does the right thing. Test it once. Document the outcome. Move on.
The teams that skip this end up with a try/except block that returns “Sorry, something went wrong” and a customer who never comes back.
6. Audit trail of every model call
Every input, output, model version, prompt version, retrieval context, and downstream action is stored. For HIPAA-covered or finance-regulated work this is non-negotiable. For everyone else, it is the difference between debugging in 20 minutes and debugging in a week.
Storage is cheap. Retention policy can be 90 days for non-regulated systems. The cost of NOT having this is invisible until the day you need it.
7. Runbook for the on-call engineer
There is a markdown page somewhere your engineers can find it. It lists the top-five failure modes, what they look like in the dashboard, and the first three steps to take for each one. It was written by someone who knows the system. It was tested by someone who does not.
If your runbook does not exist, you do not have on-call. You have hope.
How to use this checklist
If you are about to ship an AI system, run through these seven items. Anything missing, add before launch. If launch is too soon to add them all, ship anyway, but add them in the first thirty days as planned tech debt. Track the items in your project plan. Do not let them slide past day 30.
If you are running a system that has been “in production” for a while and is missing several of these, you do not have a production system. You have a demo with a public URL. Add the missing pieces before the next thing breaks, not after.
Production AI is mostly engineering, not modeling. The checklist is short on purpose. Get it right and the rest takes care of itself.
Ready to scope something?
The first call is free. The quote is fixed. The team is senior.
Start a scoping call →