Part 4 – Most Multi-Agent Systems Shouldn’t Be

Part 4 of a series on what actually goes into production agentic systems.

Part 1 – Most “AI Agents” in Production Are Demos With Extra Steps
Part 2 – Most “Agents” Are Workflows That Got Ideas
Part 3 – Most Agentic Projects Can’t Defend Their Own ROI

Open a recent AI architecture diagram and count the agents. You’ll usually find three or five, arranged around a “supervisor” or “router” with arrows connecting them. A researcher agent. A writer agent. A critic agent. A CRM agent. Each one has its own prompt, its own set of tools, and its own box on the slide. It looks modular. It looks like software engineering.

It usually isn’t. Most multi-agent systems shipped in the last eighteen months are a single agent that someone split into pieces for reasons that had more to do with whiteboards and team structure than with any technical argument. The cost shows up later, in context fragmentation, coordination overhead, and traces nobody can read.

A single agent with tools on the left. A multi-agent system on the right: the same tools, but each one now lives behind another agent, and a supervisor routes between them. Notice how much more context has to move.

One Context or Many

The defining question for this architecture choice is not “how many agents do you need?” It’s “how many contexts do you need?”

A context is the conversation state an agent reasons over: the goal, the history, the intermediate findings, the tool results. A single-agent system has one context. Everything the system has seen, tried, and concluded so far is in one place, and the model reasons over it every turn.

A multi-agent system has many contexts. The supervisor has one. Each sub-agent has its own. When the supervisor hands work to a sub-agent, you now have a choice. Pass the full context, and the token bill compounds, because the same history is paid for by every agent that touches it. Pass a summary, and you lose fidelity, and the sub-agent reasons over an impoverished view of the problem.

Multi-agent systems are not modular in the way microservices are modular. Microservices compose because their contracts are narrow and their state is explicit. Agents compose poorly because their contracts are prompts and their state is conversational. Every agent boundary is a place where context has to be serialized, summarized, or duplicated, and each of those options has a real cost.

Let’s Say You’re Building an Email Agent

Take the email-processing system from the last three articles. A senior engineer new to the project will often draw it as a multi-agent system before anyone stops them. A Triage Agent decides what kind of email it is. A Research Agent pulls history from the CRM and the knowledge base. A Draft Agent writes the reply. A Critic Agent reviews the draft. A Supervisor orchestrates all four. The diagram looks clean. It looks professional. It survives the design review.

Now watch what happens at runtime. An email comes in. The supervisor passes it to Triage, which classifies it and hands back a label. The supervisor now has the label. It passes the email plus the label to Research, which pulls five records and hands back a summary. The supervisor now has the label, the email, and the summary, and passes all three to Draft. Draft writes a reply. The supervisor hands the reply, plus the label, plus the summary, plus the original email to Critic. Critic returns notes. The supervisor maybe loops once. Maybe twice.

Count the LLM calls. You now have at least five, possibly eight. Count the tokens. The supervisor is holding the accumulating conversation. Each sub-agent is receiving a hydrated context that includes the original email, prior findings, and its specific instructions. The same email body is tokenized and re-tokenized four or five times. The supervisor’s context grows with every hop.

A single agent with the same four tools (classify, research, draft, critic-review) does the same work with one context and usually fewer total calls. The model reasons over the full history every turn. No hand-offs. No re-hydration. No summaries that lose fidelity. The “agents” in the multi-agent version were never really specialists. They were prompts. Prompts compose inside one model context at a fraction of the cost.

The Coordination Tax

The hidden cost of multi-agent isn’t tokens. It’s coordination.

Every agent boundary is a place where something can go wrong that single-agent systems don’t have to worry about. The supervisor can misroute. Two sub-agents can reach contradictory conclusions and nobody notices until the last step. A sub-agent can return a response in a format the supervisor can’t parse, and the recovery is another LLM call. The critic can decide the draft is fine when it isn’t, because the critic is looking at the draft without the full trail of assumptions that produced it.

Debugging compounds these problems. A trace of a single-agent run is a single conversation: goal, tool calls, intermediate reasoning, final result. A trace of a five-agent run is five interleaved conversations with hand-off boundaries between them. You’re not reading a story; you’re reading a system of stories that share half their state.

Reliability compounds in the wrong direction. If each agent succeeds 95% of the time independently, a five-agent pipeline succeeds about 77% of the time, assuming clean hand-offs. It doesn’t have clean hand-offs. Real production numbers come in lower.

Latency compounds too. Sub-agent calls are usually sequential because the supervisor needs each result before deciding the next hop. A single-agent loop runs at one round-trip per iteration. A multi-agent run runs at one round-trip per sub-agent invocation, plus supervisor turns in between. The same work that took four seconds in one agent takes fifteen across five.

There is a class of multi-agent failure that is unique to this architecture: agents that argue.

Put a draft agent and a critic agent in a loop and they will disagree. The loop runs until one agent capitulates, which usually means until the token budget runs out. You didn’t build a system. You built a committee.

When Multi-Agent Is Actually Earned

There are cases where splitting into multiple agents is the right move. They share properties a single agent cannot replicate.

Genuine parallelism is the first. If the task decomposes into independent sub-tasks whose results don’t influence each other, multiple agents can run in parallel and you win on wall-clock latency. “Research these five companies and return a summary of each” is parallelizable. “Draft a reply based on prior research” is not.

Different trust boundaries is the second. If one part of the task needs access to customer data and another doesn’t, splitting the agents lets you scope tools and context appropriately. The research agent never sees the draft. The draft agent never sees the raw CRM payload. That’s a security argument, not a modularity argument, and it’s a real one.

Different models is the third. A cheap model can triage. An expensive model can draft. A fine-tuned model can judge. If the economics only work when you route by capability, you need agent boundaries because you need model boundaries, and the easiest way to switch models cleanly is at an agent seam.

Context compaction is the fourth, and the subtlest. On very long-horizon tasks, a single context eventually blows past the window or degrades in quality. Multi-agent lets you compact between hops: the sub-agent does its work, returns a compressed result, and its working context is discarded. The supervisor sees the summary, not the full trail. This is the one case where multi-agent actually wins on tokens, and it’s the reason the pattern exists at all for sufficiently large tasks.

Notice what doesn’t qualify. “Roles” don’t qualify. Naming one agent “Drafter” and another “Critic” is not a technical argument. Neither is “modularity,” in most cases. A well-structured single agent with well-named tools is already modular. You just drew the modules as functions instead of as agents.

The Part Nobody Mentions

Multi-agent diagrams map suspiciously well to team org charts.

The team has a retrieval specialist, a prompt engineer, a safety engineer, and an ML lead. The system has a retrieval agent, a drafting agent, a critic agent, and a supervisor. Each team member owns their agent. Each agent has its own prompt file, its own eval set, and its own promotion path. The boundaries in the architecture exist so the boundaries in the org chart can exist.

This is a variant of Conway’s Law, and it’s doing real damage. Teams ship multi-agent systems because it’s the architecture that lets four people each own a clear thing, not because four agents is the right answer. The single-agent version would require one person owning the whole prompt, which is politically harder.

The senior move here is uncomfortable. It is to say, in a design review, “we don’t need four agents, we need one agent with four tools, and we need one person owning the behavior.” That sentence costs the other three people their clean ownership boundary. Nobody wants to say it. A lot of multi-agent systems exist because nobody did.

The Committee Test

The test I keep coming back to is simple. If you removed the agent boundaries and put all the prompts into one agent with all the tools, would the system work? If the answer is yes, and the prompts are short enough to fit in one model’s context, you don’t have a multi-agent system. You have a committee you accidentally built.

A single well-designed agent with well-designed tools beats a multi-agent system on almost every metric that matters. Fewer calls. Lower token cost. Lower latency. Cleaner traces. Fewer failure modes. One place to change behavior.

Save multi-agent for the cases that demand it: genuine parallelism, hard trust boundaries, model routing, or long-horizon context compaction. Everything else is a single agent you haven’t finished writing.

Coming Up in This Series

Next up: ReAct vs. Plan-and-Execute vs. Graph orchestration. Once you’ve decided whether it’s one agent or many, the next question is how the agent (or the supervisor) actually runs the loop. ReAct feels like the default. Plan-and-execute claims to be smarter. Graph-based orchestration says both are wrong. The answer isn’t a framework choice; it’s a judgment about how much structure the problem will tolerate.

If this resonated and you’re building production AI systems, follow along. The series covers the 21 things I think senior AI engineers and architects need to reason about: RAG pipelines, tool design, security, evaluation, cost, and the operational patterns that separate demos from systems you can actually run.