Most “AI Agents” in Production Are Demos With Extra Steps

Part 1 of a series on what actually goes into production agentic systems.

Most AI agents running in production right now aren’t production systems. They’re demos that happen to be internet-facing.

They have a model. They have some tools. They have a system prompt that grew to four thousand tokens because nobody wanted to delete anything. And they have a Slack channel full of “can you check why it did that?” messages that nobody can actually answer.

The teams running them aren’t dumb. They built exactly what the tutorials said to build. The tutorials are the problem. Tutorials teach you the agent. Production is about everything around the agent.

This article is about what production actually looks like — the shape of the system, not the cleverness of the prompt.

Let’s say you’re building an email-processing agent. A real one. It reads incoming email, extracts structured information, queries historical context, writes back to the CRM. Maybe routes to a human for anything high-stakes.

Five Planes, One Spine

Strip a production agentic application down to its bones and you get five horizontal planes stacked on top of each other, with one vertical spine running through all of them.

The Business Plane is where work enters and leaves the system. An inbox. A reviewer dashboard. A prompt console where someone who isn’t a developer tweaks behavior. The humans, and the systems the humans care about.

The Gateway is the security perimeter. Authentication. Authorization. PII redaction. Input guardrails that reject prompt injection attempts before the model ever sees them. Your agent never touches raw untrusted input. Ever. This is not optional. Email bodies, support tickets, retrieved documents — these are data, not instructions, and the gateway is where that boundary gets enforced.

The Runtime is what most people think of as “the agent.” The planner that decomposes a task, the executor that runs the tool-calling loop, the memory that tracks state across turns, the critic that checks proposed actions before they’re taken. This is the smallest part of the system and the part everyone obsesses over.

The Tool Layer is the functions the agent can actually invoke. A retriever. A database query. A write to the CRM. An email-sender. Each one typed, each one permissioned, each one individually rate-limited. Tools are where least-privilege lives — not at the agent level, at the tool level. If your email-reader token can also send emails, you’ve already lost.

The Data Plane is everything the system reads from and writes to. Vector store. Document store. Warehouse. Audit log. Golden datasets for evaluation. This is where the long-term memory of your agent lives, and it’s where compliance will ask to look first.

Those five planes get you a demo. They do not get you a production system.

The thing that separates the two is the spine.

The Spine Is What You Can’t See Until It’s Missing

The spine runs vertically through every plane. It has four components:

Tracing captures every model call, every tool call, every retrieval hit, every intermediate reasoning step, every token count, every latency measurement — for every single request. The trace is the primary debugging artifact. When something goes wrong, you pull the trace, not a log file.

Online evaluation runs on live traffic, sampling real requests and scoring them asynchronously. LLM-as-judge on a percentage of outputs. Implicit signals — did the user accept the result? Did they edit it? Did the downstream system reject the write? Explicit signals — thumbs up, thumbs down, the correction a human made when they fixed what the agent got wrong.

Offline evaluation runs in CI, gating every prompt change and every model upgrade against a golden dataset. Nothing ships unless the numbers say it’s at least as good as what’s already running.

The prompt registry versions every prompt as a first-class artifact. Prompts are not string literals in code. They have semantic versions, owners, linked eval runs, and rollout configurations.

With only the five planes, you can ship a demo. It will work in the first week. It will work in the second week. And then, somewhere around week six, something will shift. Accuracy will drop on Tuesday. A particular client will complain. A downstream system will start rejecting writes at a higher rate than it did last month.

Without the spine, here’s what you can actually answer about that drop:

Nothing. You can answer nothing about it.

You don’t know which prompt version is live, because prompts were strings in code and got edited in three different PRs last sprint. You don’t know what changed in the retrieval corpus, because indexing runs on a cron that nobody monitors. You don’t know if the model’s behavior shifted, because you have no baseline metrics to compare against. You don’t know if the drop is real or a sampling artifact, because you don’t sample.

You will end up in a meeting where someone asks “is it getting worse?” and the honest answer is “I don’t know.”

This is the conversation that separates demos from production systems. And the only way to avoid it is to build the spine before you need it.

Why the Spine Gets Skipped

The spine gets skipped for one reason: it doesn’t demo well.

A stakeholder watching a demo wants to see the agent do the thing. They want to see an email come in, an extraction happen, a record created in the CRM. They do not want to see the tracing dashboard. They do not care about the golden dataset. Nobody ever got promoted for shipping a prompt registry.

So teams under pressure ship the agent and defer the spine. The agent works. The demo goes well. Leadership is happy. Two quarters later the team is in permanent firefight mode, debugging in production, answering “why did it do that?” with hand-waves.

The senior move is to build the spine first. Not second. Not “once we have v1 working.” First.

Here’s a rough breakdown of where the work actually lives in a production-ready agentic system:

Effort distribution across components — the spine is bigger than the agent

The agent is 15% of the work. The spine alone is 25%. And the spine is disproportionately responsible for whether the system survives contact with real users and real data.

What Seniors Get Right

The pattern I see across teams that ship agents successfully is that they invert the usual build order.

They build tracing before they build the agent. They define the golden dataset before they write the first prompt. They set up the prompt registry the week they start, even if it has two entries. They put the gateway in front of a stub agent that just echoes, so the security model is tested before there’s anything worth securing.

Then they build the agent.

This looks like slower progress for the first two weeks. It is not slower progress. It’s the only way the progress compounds. Every prompt change after week two gets validated automatically. Every production issue gets a trace. Every new team member can see what’s running and why.

Teams that don’t do this build faster for a month and then spend three months unwinding the mess.

The Part Nobody Mentions

The thing that makes this hard isn’t technical. It’s social.

Explaining to a non-technical stakeholder why week one is spent building observability instead of “the AI part” is a conversation most engineers lose. The ones who win it have usually already lost it once, on a previous project, and are not willing to lose it again.

If you are building an agent right now and you don’t have a trace dashboard, a golden dataset, and a versioned prompt, you are building a demo. That’s fine if demos are what you want. It is not fine if you expect the thing to still be working in six months.

Coming Up in This Series

This was the mental model. The rest of the series goes deep on each layer.

Next up: Workflow vs. Agent — when do you actually need an agent? Because most systems called “agents” right now shouldn’t be agents at all. They should be workflows with one LLM call in them, and they’d be cheaper, faster, and more reliable.

If this resonated and you’re building production AI systems, follow along. The series covers the 21 things I think senior AI engineers and architects need to reason about — from RAG pipelines and tool design through security, evaluation, cost, and the operational patterns that separate demos from systems you can actually run.