Part 3 – Most Agentic Projects Can’t Defend Their Own ROI
Part 3 of a series on what actually goes into production agentic systems.
Part 1 – Most “AI Agents” in Production Are Demos With Extra Steps
Part 2 – Most “Agents” Are Workflows That Got Ideas
Ask a team that just shipped an agentic feature what it saves the business, and watch the sentence get longer. “Well, it reduces the time analysts spend on…” “It lets us scale…” “It unlocks…” By the fourth clause you know the answer. Nobody did the math.
This is the conversation that kills most AI projects in their second year. The first year is vibes and demos. The second year is someone in finance asking what the line item is actually buying. Teams that can’t answer lose the budget. Teams that can’t even structure the question lose the headcount.
The strange thing is how consistently the question gets skipped. A team will spend three months evaluating vector databases and zero minutes asking whether the problem even needed an LLM. By the time a senior leader asks “what’s the ROI on this?”, the only honest answer is “we never framed it that way.
Six shapes, ascending in cost and operational weight. The senior move is to pick the lowest rung that actually solves the problem.
The Cheapest Shape That Could Work
Before the workflow-vs-agent question from the last article, there’s an earlier question most teams skip. Should there be an LLM in the loop at all?
A feature can take at least six shapes, and they compound in cost and operational weight as you climb. A well-designed form captures structured input with zero inference cost and no failure modes beyond validation. A rules engine encodes explicit business logic that any analyst can read. Classical ML handles pattern recognition on structured data, with training cost up front and low inference cost forever after. A single LLM call handles unstructured text transformations with one API round-trip per request. A workflow chains a handful of those calls together with code holding the orchestration. An agent hands the orchestration itself to the model.
Every step up that ladder multiplies the cost, the latency, and the blast radius of a failure. The senior move is to pick the lowest rung that actually solves the problem, and to defend that choice against the gravity pulling every conversation toward the top rung.
The gravity is real. “We built a form” doesn’t make the investor deck. “We built an agent” does. That is why so many agent projects exist: the economics were never the point.
Let’s Say You’re Building an Email Agent
Take the same email-processing agent from the last two articles. Before writing the first line of code, try to sketch the ROI.
Start with what exists today. An analyst handles 200 emails a day, averaging four minutes each. That’s 13 hours of analyst time per day, or roughly two full-time headcount. At a loaded cost of $80k per person, that’s $160k a year in pure processing time, setting aside everything else the analysts do.
Now sketch the agent side. Assume the agent handles 70% of emails end-to-end and routes 30% to a human. Each autonomous handle costs 12 cents in tokens and model inference. Each routed one costs 4 cents, because it stops early. 200 emails a day at those rates is roughly $20 a day, or $7,000 a year in runtime. Add infrastructure, tracing, and evaluation tooling, call it $15,000 a year. On the human side, you’ve cut the 200 emails down to 60, so roughly a third of the analyst time, saving about $105k a year in labor.
Net: ~$90k a year saved, against an annual run cost of ~$15k, plus a build cost of maybe four engineer-months. On paper, a good investment.
But sketching the ROI is not the same as defending it. Most of those numbers were optimistic. The token cost was estimated from a happy-path prompt, not the three-retry worst case. The 70% autonomy rate came from a POC on a curated test set, not live traffic. The analyst time saved assumed the analyst could immediately do higher-value work with the freed hours, which they usually can’t for the first six months. Swap those for honest numbers, and the save often shrinks from $90k to $30k. Sometimes it flips negative.
The point isn’t that the project shouldn’t ship. It’s that the real ROI conversation starts when you write these numbers down. Teams that skip this step find themselves two years in, being asked to justify the spend, with no reference point to argue from.
The POC-to-Production Cliff
The second reason most ROI math falls apart is that POC economics and production economics are not the same system.
A POC handles the 50 clean test cases the team curated. It runs at low volume, so cost doesn’t matter. It runs without guardrails, so latency is whatever the model wants. It doesn’t log much, because nothing’s scaled yet. Under those conditions, an agent looks cheap, fast, and accurate.
Production is a different animal. Volume is fifty times higher, so the token bill becomes real. Inputs are not clean; 15% of emails are forwarded threads, auto-replies, or multi-language, and each one adds retries. The guardrails you didn’t have in the POC add 300ms of latency and a second model call for content safety. The tracing you added to debug production adds another 100ms and another few cents. The golden dataset you built for evaluation doubled your build cost. The p95 latency, which was 2 seconds in the POC, is 9 seconds in production, and now the UX team is complaining.
Cost per successful transaction in a POC is almost always 2x to 4x off from cost per successful transaction in production. Latency is 3x to 5x off. Failure rates, once you count real edge cases, can be 5x higher. A team that budgets off POC numbers is building a financial case that will come apart the day it hits real traffic.
The senior move is to build the ROI case with production-honest numbers from the start, even if you have to estimate. Assume the token cost will be 3x the POC. Assume 20% of cases will fail and need retries or human routing. Assume the first six months of analyst time “saved” will not actually be redirected to revenue-generating work, because reorganizations take that long. If the ROI still works under those assumptions, the project is real. If it only works under POC-clean conditions, it isn’t.
Internal vs. Customer-Facing Flips Everything
The same agent, built twice, has completely different ROI math depending on who’s on the other end of it.
An internal tool used by trained analysts can tolerate a 15% failure rate, because the analyst catches the failure, corrects it, and moves on. The agent is a productivity multiplier, not an oracle. The ROI calculation is clean: time saved times loaded hourly rate, minus runtime cost. Failure doesn’t kill the business case; it just reduces the multiplier.
The same agent pointed at customers behaves differently. A 15% failure rate is now a brand problem, a churn problem, and sometimes a legal problem. To hit a failure rate customers will actually tolerate (usually under 2% for anything high-stakes), you need guardrails, human review, fallback logic, and a far more expensive evaluation pipeline. All of that compounds the cost. And the failures you do ship are more expensive per occurrence: a bad internal extraction costs someone five minutes; a bad customer-facing response costs a ticket, a refund, or a public tweet.
Two different projects. Two different ROI shapes. The common mistake is to prove ROI on the internal version and then assume the customer-facing version has the same math. It doesn’t. The customer-facing version is typically 3x to 10x more expensive to build to production quality, and the per-failure cost is 20x to 100x higher. Those numbers flip the shape of what’s worth building.
A lot of teams should ship the internal version, ship it well, and leave the customer-facing version for a year later when the unit economics improve. Very few actually do.
The Part Nobody Mentions
The hardest part of the ROI conversation is not the math. It’s the political cost of running the math honestly.
An engineer who sketches an ROI and concludes “this shouldn’t be an agent, it should be a form” has just argued themselves out of a high-visibility project. An engineering lead who concludes the customer-facing version doesn’t pencil has just deflated a roadmap item leadership was already excited about. The honest answer is usually the career-unfriendly answer, which is why it so rarely gets voiced.
What senior practitioners do instead is reframe the conversation around what would make the project worth doing, not whether it’s worth doing right now. “At current token prices, the internal version pays back in nine months. The customer-facing version never does at current prices. Here’s what would have to change for the customer version to pencil: a 40% drop in inference cost, or a 10x increase in volume, or a willingness to accept a specific failure mode.” That reframe preserves optionality without pretending the math works when it doesn’t.
The Question That Ends Projects
“What does this save us, and how do you know?” is the question that ends AI projects that shouldn’t have started. The teams that survive it are the ones that answered it before the question got asked.
Agentic is not the default answer. It’s one of six shapes, and usually not the cheapest one that works. Before the architecture conversation, before the framework conversation, before the vendor conversation, run the ROI sketch with production-honest numbers. If it pencils, build. If it doesn’t, the most senior move in the room is to say so.
Coming Up in This Series
Next up: Single-agent vs. multi-agent architectures. Once you’ve decided the problem genuinely needs an agent, the next question is how many. Multi-agent sounds modular and clean on a whiteboard. It usually isn’t, once you’re watching agents argue with each other in a trace viewer at 3am.
If this resonated and you’re building production AI systems, follow along. The series covers the 21 things I think senior AI engineers and architects need to reason about: RAG pipelines, tool design, security, evaluation, cost, and the operational patterns that separate demos from systems you can actually run.
