Part 6 – Most Production RAG Is Quietly Wrong
Part 6 of a series on what actually goes into production agentic systems.
Part 1 – Most “AI Agents” in Production Are Demos With Extra Steps
Part 2 – Most “Agents” Are Workflows That Got Ideas
Part 3 – Most Agentic Projects Can’t Defend Their Own ROI
Part 4 – Most Multi-Agent Systems Shouldn’t Be
Part 5 – ReAct Is Where You Start, Not Where You Ship
Most RAG systems in production look fine and are not fine. They embed a query. They hit a vector index. They return the top five chunks. Someone stuffs them into a prompt. The answer that comes back is coherent, sounds right, and is sometimes wrong in ways nobody notices, because nobody is checking.
When the team finally admits quality is off, the reflex is to swap the model. Sonnet to Opus. Opus to something newer. The quality needle does not move, because the problem was not the model. The right passage was not in the context.
Caption: the three-stage production RAG pipeline
The production RAG pipeline is three stages, not one. Candidate generation, re-ranking, and context assembly each have different failure modes. Most teams build stage one and call it a system.
The Three-Stage Pipeline
Demo RAG is one stage: embed the query, fetch the top-k, hand it to the model. That pipeline works in the demo because the corpus is small, the queries are friendly, and nobody is grading the retrieval.
Production RAG is three stages.
Candidate generation casts a wide net: pull the top fifty or a hundred plausibly relevant chunks, using more than one retrieval method.
Re-ranking applies a slower, smarter model to that set and picks the best five.
Context assembly decides what actually goes into the prompt: which chunks, in what order, with what metadata, under what token budget. Each stage has a job.
Candidate generation is about recall: did the right chunk even make it into the pool?
Re-ranking is about precision: of the plausible chunks, which ones actually answer the query?
Assembly is about fit: can the model use what you gave it, given the other things competing for context?
When RAG quality is bad, it is almost always a recall problem dressed up as a model problem. The right chunk was never retrieved, so no amount of model upgrade, prompt engineering, or temperature tuning will bring it back.
Single-Vector Search Is Not Retrieval
Most production RAG systems use one embedding model and cosine similarity. This works until a query contains anything a dense vector cannot represent cleanly: exact identifiers, acronyms, negation, rare terms, codes.
Let’s say you’re building an email-processing agent for a B2B support function. A customer writes in: “refund not processed for order A-7821, second request.” Dense retrieval on this query will do a perfectly good job of finding passages semantically similar to refund complaints. It will not reliably find the passage that contains the string A-7821. The embedding model was never trained to preserve that identifier; it just compresses it into a vector that clusters near “refund email from customer.”
The fix is not a better embedding model. The fix is hybrid retrieval. Run the query through BM25 or a similar lexical index in parallel with the dense vector search. Merge the result sets with a score fusion method like RRF(reciprocal rank fusion). Add structured filters on top: tenant ID, date range, document type. Dense search finds semantic neighbors. Lexical search finds exact matches. Filters eliminate the rest. None of them is sufficient alone.
# hybrid retrieval, merged via reciprocal rank fusion
candidates_dense = vector_index.search(embed(query), k=50, filters=tenant_scope)
candidates_lexical = bm25_index.search(query, k=50, filters=tenant_scope)
def rrf(rankings, k=60):
scores = defaultdict(float)
for ranking in rankings:
for rank, doc in enumerate(ranking):
scores[doc.id] += 1 / (k + rank)
return sorted(scores.items(), key=lambda x: -x[1])
candidates = rrf([candidates_dense, candidates_lexical])[:50]Caption: Single-vector vs. hybrid retrieval
The teams that skip hybrid usually do so because one of two things happened. Either they adopted a managed vector database that only did dense retrieval and did not want to run a second system. Or they read a benchmark showing dense beating BM25 on a clean academic dataset and generalized it to messy real-world queries.
In production, nearly every query has at least one exact-match anchor: a name, an ID, a product code, an error string. Losing those anchors is how RAG quietly stops working.
Chunking Is a Modeling Decision
The other place RAG silently degrades is chunking. Teams treat it as preprocessing: pick a chunk size, pick an overlap, run the script, move on. It is not preprocessing. It is a modeling decision that determines what the retrieval layer is even capable of returning.
Fixed-size chunking with no structural awareness destroys meaning at boundaries. A 512-token chunk that splits a support policy halfway through a conditional sentence produces two chunks, neither of which can answer a question about the condition. A table chopped in half is neither table nor prose. A code block split across chunks is line noise.
Going back to the email agent: suppose a passage of a support contract reads, “Refunds are available within 30 days of purchase, except for enterprise accounts, for which the window extends to 90 days.” If the chunker splits that sentence at the comma, the first chunk says refunds are available within 30 days. Full stop. Retrieve that chunk for an enterprise customer, stuff it into the prompt, and the model will confidently state the wrong policy. The model did not hallucinate. You gave it a truncated source and it trusted you.
The baseline that works is structural chunking with semantic overlap. Respect document structure: headers, sections, paragraphs, list items, tables, code blocks. Chunk boundaries follow the document’s own joints, not a fixed token count. Add overlap at boundaries so information near a split appears in both neighbors. For long structured elements (a long table, a long code block), keep them whole, or chunk them with metadata that says “this is part 2 of 3 of table X.”
# chunk on document joints, not token counts
def chunk(doc):
for section in doc.sections:
for block in section.blocks: # paragraph, list, table, code
if block.kind in {"table", "code"} and block.tokens <= MAX:
yield Chunk(block, meta={"section": section.title, "kind": block.kind})
else:
for span in split_on_sentences(block, target=TARGET, overlap=OVERLAP):
yield Chunk(span, meta={"section": section.title})The chunks are not the text. The chunks are your unit of retrieval. Choose the unit to match what you want to retrieve.Chunking strategies compared
The Cheapest Quality Win You Aren’t Using
Re-ranking is the stage most teams skip, and it is almost always the highest-leverage change you can make to an underperforming RAG system.
The logic is simple. First-stage retrieval (dense, lexical, hybrid) is optimized for speed over a large corpus. It returns a ranked list, but the ranking is noisy. The top chunk is often not the most relevant. The most relevant chunk is frequently somewhere in positions 5 to 30. Take the top fifty candidates, pass them through a cross-encoder re-ranker, and let a slower, more discriminating model pick the real top five.
# first stage casts a wide net; reranker picks the real top 5
candidates = hybrid_retrieve(query, k=50) # recall stage
pairs = [(query, c.text) for c in candidates]
scores = cross_encoder.score(pairs) # precision stage
top5 = [c for c, _ in sorted(zip(candidates, scores), key=lambda x: -x[1])[:5]]
context = assemble(top5, token_budget=6000) # fit stageRe-ranker promotes the buried chunk
Cross-encoders are not expensive in the way people assume. They run on fifty chunks per query, not the whole corpus. Latency budget is measurable and small, usually under 200 milliseconds for modern rerankers. Cost is measurable and small. The quality gain is not small. On most real-world corpora, adding a reranker moves top-5 precision from “usable” to “reliable.”
The reason teams do not add one is that the first-stage system looked like it was already working. Top-5 returned plausible chunks. The answers were coherent. Nobody measured whether the single best chunk was actually in position one, or whether it had been displaced by a noisier match. Re-rankers are the cleanup stage you do not know you need until you measure.
Why RAG Quality Gets Under-Invested
There is a structural reason production RAG stays at “fine.” Retrieval quality is invisible in the output. The model paraphrases whatever context it was handed. If the context is subtly wrong, the answer is subtly wrong in the same way, and there is no visible seam. Compared with a broken tool call or a crashed agent, bad retrieval produces a beautifully formatted wrong answer. Nobody escalates a beautifully formatted wrong answer.
The other structural reason is that retrieval work does not look like AI work. Tuning a BM25 index, adding structural chunking, fitting a reranker, setting up hybrid fusion. This is information retrieval, a discipline that existed for thirty years before the current AI wave. It is not glamorous. It does not demo well. It does not fit cleanly into the “we’re an AI team” story.
So it gets under-resourced, and the team spends the same effort on prompt-engineering the symptom instead of fixing the cause.
The senior move is to treat the retrieval layer as a first-class component with its own metrics: recall at k, precision at k, mean reciprocal rank on a labeled eval set. Not just end-to-end answer quality. End-to-end metrics hide which stage is failing. Stage-level metrics tell you where to put the next hour of work.
The Punchline
When an agentic system gives a wrong answer, the easy story is that the model got it wrong. The more useful story is that retrieval put the wrong passage in the context, or left the right one out, and the model did exactly what you would want it to do with the evidence you provided.
Fix the retrieval layer first. Hybrid search, structural chunking, a reranker on top, and stage-level metrics. The model is not the bottleneck. The bottleneck is what you are feeding it.
Coming Up in This Series
Next up: Data pipelines for AI: the unsexy foundation. RAG assumes you have clean, current, well-structured text to retrieve against. Most teams do not. The documents are stale, the formats are inconsistent, the same fact lives in three systems with three answers, and the index is built off whichever extract ran last Tuesday. The pipeline that feeds the retriever is where most RAG systems actually fail, and it has almost nothing to do with AI.
If this resonated and you’re building production AI systems, follow along. The series covers the 21 things I think senior AI engineers and architects need to reason about: RAG pipelines, tool design, security, evaluation, cost, and the operational patterns that separate demos from systems you can actually run.
