RAG Pipeline Architecture: Beyond the Tutorial
Every RAG tutorial looks the same. Load documents, split into chunks, embed them, throw them into a vector database, query with an LLM. Ship it. Done.
# RAG Pipeline Architecture: Beyond the Tutorial
Every RAG tutorial looks the same. Load documents, split into chunks, embed them, throw them into a vector database, query with an LLM. Ship it. Done.
And then production happens.
The gap between a working RAG demo and a production RAG system is not incremental. It is structural. We have built and operated RAG pipelines for clients across Europe, the Middle East, and India --- from legal document retrieval systems in Frankfurt to multilingual knowledge bases in Dubai to internal tooling for engineering teams in Bangalore. The pattern is consistent: 90% of production RAG failures are retrieval failures, not model failures.
Your LLM is almost certainly good enough. Your retrieval almost certainly is not.
This post is not a tutorial. It is a field report on what changes when you move RAG from a Jupyter notebook to a system that handles 50,000 queries a day with real users, real compliance requirements, and a real budget.
---
The Tutorial Trap
The standard RAG pattern --- fixed-size chunking, cosine similarity over dense embeddings, stuff everything into a prompt --- works beautifully on demo datasets. It works because demo datasets are clean, homogeneous, and small.
Production data is none of those things. Your documents have tables, headers, nested lists, code blocks, and metadata that matters. Your users ask vague questions, follow-up questions, and questions that require synthesizing information across multiple documents. Your compliance team wants to know exactly which document a claim came from, and your finance team wants to know why the LLM bill tripled last month.
Companies like Freshworks and Zoho learned this when building AI features into their SaaS platforms. Razorpay discovered it when adding intelligent document processing to their payment infrastructure. The tutorial gets you to a demo. Production requires six fundamental changes.
---
6 Things That Change in Production
1. Semantic Chunking vs. Fixed-Size Splitting
Fixed-size chunking (500 tokens with 50-token overlap) is the default in every tutorial. It is also the single biggest source of retrieval degradation in production.
The problem is simple: fixed-size chunks routinely split semantic units. A paragraph explaining a policy gets cut in half. A table loses its header. A code example gets separated from its explanation.
Semantic chunking preserves meaning boundaries. We use a two-pass approach: first, split on structural markers (headings, paragraph breaks, list boundaries), then merge small chunks and split oversized ones while respecting sentence boundaries.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
def semantic_chunk(text: str, max_tokens: int = 512, similarity_threshold: float = 0.72):
"""Split text into semantically coherent chunks.
Instead of fixed windows, we split on sentence boundaries
and merge adjacent sentences while they remain semantically similar.
"""
sentences = text.replace("\n\n", "\n\n<SPLIT>").split("<SPLIT>")
sentences = [s.strip() for s in sentences if s.strip()]
if not sentences:
return []
embeddings = model.encode(sentences, normalize_embeddings=True)
chunks = []
current_chunk = [sentences[0]]
current_embedding = embeddings[0]
for i in range(1, len(sentences)):
similarity = np.dot(current_embedding, embeddings[i])
combined_length = sum(len(s.split()) for s in current_chunk) + len(sentences[i].split())
if similarity >= similarity_threshold and combined_length <= max_tokens:
current_chunk.append(sentences[i])
# Running average of embeddings for the chunk
current_embedding = np.mean(
[current_embedding, embeddings[i]], axis=0
)
current_embedding /= np.linalg.norm(current_embedding)
else:
chunks.append("\n\n".join(current_chunk))
current_chunk = [sentences[i]]
current_embedding = embeddings[i]
if current_chunk:
chunks.append("\n\n".join(current_chunk))
return chunksIn our benchmarks across three client deployments, semantic chunking improved retrieval accuracy by 23% compared to fixed-size splitting, measured by the percentage of queries where the correct answer appeared in the top-5 retrieved chunks.
2. Hybrid Retrieval: Vector + BM25
Pure vector search has a well-known weakness: it struggles with exact keyword matches, entity names, and technical terms. Ask a vector database for documents mentioning "GDPR Article 17" and it might return documents about data privacy generally --- close in meaning, wrong in specificity.
BM25 (the algorithm behind Elasticsearch) excels at exact matches. Dense vectors excel at semantic similarity. Combining them is not optional in production.
from qdrant_client import QdrantClient
from qdrant_client.models import ScoredPoint
from rank_bm25 import BM25Okapi
import numpy as np
class HybridRetriever:
"""Combines dense vector search with BM25 sparse retrieval.
Uses Reciprocal Rank Fusion (RRF) to merge results from
both retrieval methods into a single ranked list.
"""
def __init__(self, qdrant_client: QdrantClient, collection: str, corpus: list[dict]):
self.qdrant = qdrant_client
self.collection = collection
self.corpus = corpus
# Build BM25 index over the same corpus
tokenized = [doc["text"].lower().split() for doc in corpus]
self.bm25 = BM25Okapi(tokenized)
self.doc_ids = [doc["id"] for doc in corpus]
def search(self, query: str, query_embedding: list[float], top_k: int = 20,
alpha: float = 0.6) -> list[dict]:
"""Hybrid search with RRF fusion.
Args:
alpha: Weight for vector search (1-alpha for BM25).
0.6 works well for most use cases.
"""
# Dense vector search
vector_results = self.qdrant.search(
collection_name=self.collection,
query_vector=query_embedding,
limit=top_k * 2,
)
# BM25 sparse search
bm25_scores = self.bm25.get_scores(query.lower().split())
bm25_top = np.argsort(bm25_scores)[::-1][:top_k * 2]
# Reciprocal Rank Fusion
rrf_scores = {}
k = 60 # RRF constant
for rank, result in enumerate(vector_results):
doc_id = result.id
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + alpha / (k + rank + 1)
for rank, idx in enumerate(bm25_top):
doc_id = self.doc_ids[idx]
rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + (1 - alpha) / (k + rank + 1)
# Sort by fused score
ranked = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
return [{"id": doc_id, "score": score} for doc_id, score in ranked[:top_k]]Hybrid retrieval delivered a 31% improvement in answer accuracy over pure vector search across our production deployments. The gains are most pronounced for queries containing specific identifiers, product names, or regulatory references --- exactly the queries that matter most in enterprise settings.
3. Cross-Encoder Reranking
Bi-encoder embeddings (what vector databases use) are fast but approximate. They encode query and document independently, which means they miss fine-grained interactions between query terms and document content.
Cross-encoders process the query and document together, producing a much more accurate relevance score --- but they are 100x slower. The solution is a two-stage pipeline: retrieve 50-100 candidates cheaply with hybrid search, then rerank the top candidates with a cross-encoder.
We use cross-encoder/ms-marco-MiniLM-L-12-v2 for English and jeffwan/mmarco-mMiniLMv2-L12-H384-v1 for multilingual deployments. This adds 50-100ms of latency but delivers an 18% quality improvement on our evaluation benchmarks. For CTO-level decision-making: this is the single highest ROI optimization after hybrid retrieval.
4. Automated Evaluation with RAGAS
Without automated evaluation, you are guessing whether your pipeline is improving or degrading. Every change --- a new chunking strategy, a different embedding model, a prompt tweak --- needs to be measured.
We run nightly evaluation pipelines using the RAGAS framework against a curated set of 200+ question-answer pairs. Four metrics matter:
These metrics feed into Grafana dashboards. When any metric drops below threshold, an alert fires. This is not optional infrastructure --- it is the only way to maintain quality over time as your document corpus changes, user behavior shifts, and models get updated.
5. GDPR-Compliant Access Control in Vector Databases
For any deployment touching European user data --- and increasingly for Middle Eastern and Indian deployments with their own data protection regulations --- you cannot have a flat vector database where every query searches every document.
Both Weaviate and Qdrant support metadata filtering that enables document-level access control. Every chunk gets tagged with access metadata at ingestion time: tenant ID, classification level, data residency region. Every query includes a mandatory filter that restricts search to authorized documents.
This is not just a compliance checkbox. A healthcare client in Germany needed patient data to be retrievable only by the treating physician's team. A financial services client in Dubai required that regional data never left the GCC jurisdiction. Metadata filtering in the vector database is the enforcement layer.
The key architectural decision: enforce access control at the retrieval layer, not the application layer. If unauthorized chunks never enter the context window, they cannot leak into the response. Defense in depth.
6. Cost Management: Caching, Batching, Model Routing
At 50,000 queries per day, costs add up fast. Three techniques keep them manageable:
Semantic caching: Hash the query embedding (not the text) and cache responses for similar queries. A cosine similarity threshold of 0.95 catches 15-25% of queries as cache hits without returning stale answers.
Embedding batching: Instead of embedding queries one at a time, batch them in windows of 32-64. This reduces per-query embedding costs by 40% and latency by 60% for async workloads.
Model routing: Not every query needs GPT-4. Route simple factual lookups to a smaller model (GPT-4o-mini, Claude Haiku, or a fine-tuned open-source model). Reserve the expensive model for complex reasoning queries. A simple classifier based on query length, entity count, and retrieval confidence handles the routing. This alone cuts LLM costs by 50-60%.
Realistic monthly cost at 50K queries/day: $500--$2,500 depending on model mix and caching efficiency.
---
Production Architecture
Here is the architecture that emerges from these six changes:
User Query
|
v
Query Rewriter (expand abbreviations, resolve coreferences)
|
v
+-------------------+ +------------------+
| Vector DB Search | | BM25 Index Search|
| (Qdrant/Weaviate) | | (Elasticsearch) |
+-------------------+ +------------------+
\ /
v v
Reciprocal Rank Fusion (RRF)
|
v
Cross-Encoder Reranker
(top 50 -> top 5)
|
v
Context Assembly
(chunk dedup, ordering, token budget)
|
v
LLM Generation
(with citation extraction)
|
v
Response + Source Citations
|
v
Eval Logging (async)
(faithfulness, relevancy -> RAGAS nightly)Every component is independently scalable. The vector database and BM25 index can be sharded. The reranker runs on GPU. The LLM call goes through a routing layer. Eval logging is async and never blocks the response path.
---
The Evaluation Problem Nobody Solves
Most teams skip evaluation because it is hard to build and maintain. This is a mistake that compounds over time.
The core challenge: you need ground-truth question-answer pairs that reflect real user queries. Synthetic QA generation (using an LLM to generate questions from your documents) gets you 70% of the way there. The remaining 30% must come from real user queries, manually annotated by domain experts.
Our nightly eval pipeline:
1. Run 200+ QA pairs through the full retrieval + generation pipeline 2. Compute RAGAS metrics for each pair 3. Aggregate scores and compare against the previous night's baseline 4. Flag any metric that dropped by more than 2 percentage points 5. Store results in a time-series database for trend analysis
The investment is significant --- roughly 2 engineer-weeks to set up, plus ongoing curation of the QA set. The alternative is flying blind while your pipeline silently degrades as documents change and user patterns shift.
---
Cost Breakdown at Scale
Here is what each component costs per query, and how it scales:
|---|---|---|---|---|
These numbers assume a mix of 70% routed to a cheaper model and 30% to a premium model, with 20% cache hit rate. Your mileage will vary based on document corpus size, average query complexity, and chosen providers.
The biggest lever is model routing. Without it, the LLM line item alone would be $450/mo at 10K queries/day and $4,500/mo at 100K. Routing is not an optimization --- it is a requirement for economic viability.
---
Mistakes We Made (So You Do Not Have To)
Mistake 1: Embedding the entire document corpus on day one. We embedded 2 million chunks before testing retrieval quality. The embedding model was wrong for the domain. We re-embedded everything, burning $800 in compute and 3 days of pipeline time. Solution: Embed a 5% sample first. Run your eval suite. Only scale when metrics confirm the model works.
Mistake 2: Ignoring chunk metadata. Early versions stored raw text chunks without preserving document structure --- no source file, no section heading, no page number. When users asked "where did this come from?" we could not answer. Solution: Treat metadata as a first-class citizen. Every chunk carries: source document, section path, page/paragraph number, ingestion timestamp, and access control tags.
Mistake 3: Using the same prompt for all query types. A factual lookup ("What is our refund policy?") and an analytical question ("How has our refund policy changed over the last 3 years?") require fundamentally different prompts and retrieval strategies. One-size-fits-all prompting degraded quality for both. Solution: Classify queries into types (factual, analytical, comparative, procedural) and use type-specific prompts and retrieval parameters.
Mistake 4: Not monitoring retrieval latency separately from generation latency. When users complained about slow responses, we assumed the LLM was the bottleneck. It was actually the vector database --- a missing index on a metadata filter was causing full scans. Solution: Instrument every pipeline stage independently. Track p50, p95, and p99 latencies for embedding, retrieval, reranking, and generation separately.
---
FAQ
Should we use OpenAI embeddings or open-source?
It depends on your data residency requirements and cost tolerance. OpenAI's text-embedding-3-large is excellent but sends your data to OpenAI's servers --- a non-starter for many European clients under GDPR. Open-source models like BAAI/bge-large-en-v1.5 or intfloat/multilingual-e5-large run on your own infrastructure, cost nothing per query after the GPU investment, and perform within 2-5% of OpenAI on most benchmarks. For clients in Germany, the Netherlands, or the UAE where data sovereignty matters, self-hosted embeddings are the only viable path.
How do we handle multi-language documents for EU clients?
Multilingual embedding models are the foundation. intfloat/multilingual-e5-large supports 100+ languages and produces embeddings in a shared vector space --- meaning a German query will retrieve relevant French documents. For generation, use a multilingual LLM (GPT-4, Claude, or Mixtral) with explicit language instructions in the system prompt. Critical: your chunking strategy must be language-aware. German compound words and Arabic right-to-left text require different tokenization than English. Test retrieval quality per language pair, not just in aggregate.
What vector database should we choose?
For most production deployments: Qdrant if you need strong metadata filtering and GDPR-compliant self-hosting, Weaviate if you want a more opinionated framework with built-in modules, pgvector if your team already runs PostgreSQL and your corpus is under 5 million chunks. Avoid managed-only solutions if data residency is a constraint. All three support the hybrid search and metadata filtering patterns described above.
How long does it take to go from prototype to production?
For a team that has built one before: 6-8 weeks. For a team doing it the first time: 12-16 weeks. The biggest time sinks are evaluation pipeline setup (2-3 weeks), data ingestion pipeline with proper chunking and metadata (2-3 weeks), and access control implementation (1-2 weeks). The LLM integration itself is typically done in the first week --- everything else is the infrastructure that makes it reliable.
---
Related Reading
---
Build It Right the First Time
The difference between a RAG demo and a RAG system is not sophistication --- it is discipline. Semantic chunking, hybrid retrieval, reranking, automated evaluation, access control, and cost management are not advanced features. They are baseline requirements for production.
If your team is planning a RAG deployment or struggling with retrieval quality in an existing one, we can helpwe can helphttps://www.techsaas.cloud/services/. TechSaaS has designed and operated production RAG systems for clients across Europe, the Middle East, and India --- from architecture review to full implementation and ongoing optimization.
[Talk to our team about your RAG pipeline ->](https://www.techsaas.cloud/services/)
Need help with thought-leadership?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.