Build the Best RAG Pipeline for Your GenAI Apps: A Hands-On, Expert Guide

1. Introduction

“LLMs are brilliant… until they’re confidently wrong.”
That’s a sentence I’ve had to explain more times than I can count.

I’ve worked on multiple GenAI apps over the past year, and one thing became painfully clear: pure LLMs just don’t cut it for production.

They’re great at general language understanding, sure—but if your app needs to reference dynamic knowledge, proprietary data, or anything outside the model’s pretraining, you need Retrieval-Augmented Generation (RAG). No way around it.

I’ve spent a lot of time testing what works—and more importantly, what doesn’t. From latency issues with bloated pipelines to hallucinations caused by poor chunking, I’ve hit enough walls to know how to build a system that doesn’t fall apart under real-world usage.

In this guide, I’ll walk you through everything I wish I had when I was building my first serious RAG system—from choosing the right embedding model to plugging in a reranker when your retriever starts pulling garbage.

You’ll walk away with:

  • A production-grade, modular RAG pipeline you can customize
  • Real code for each step (not vague placeholders)
  • Tips from actual experience using tools like Weaviate, Mistral, BGE, and LangChain

I’m keeping it open-source, performance-first, and extensible—because that’s how I’ve had the most success scaling these pipelines in real apps.

Let’s get into it.


2. Architecture Overview (Brief but Clear)

Before we dive into code, here’s the high-level flow I’ve personally used across several projects:

User Query → Embed + Retrieve → (Optional: Re-rank) → Inject into Prompt → LLM → Response

Each part can be swapped or fine-tuned depending on your stack and use case. I’ve had setups where just dense retrieval was enough—and others where not having a reranker was the reason the answers started going off the rails.

Here’s the exact stack I’ve used in production RAG pipelines:

Vector Stores:

  • Weaviate: When I needed built-in hybrid search and GraphQL filtering
  • Qdrant: Great if you want fast, local, Dockerized indexing
  • FAISS: Still the fastest for small-to-medium local projects, but not ideal for anything dynamic

Embedding Models:

  • BGE (BAAI General Embeddings): My go-to for high-quality dense embeddings, especially with bge-base-en-v1.5
  • InstructorXL: Excellent if your documents are long and you want instruction-tuned behavior
  • OpenAI Embeddings: Expensive at scale, but solid baseline

Language Models:

  • Mistral 7B: Surprisingly solid out of the box
  • LLama2 (13B/70B): Used in private deployments, quantized with GGUF for performance
  • OpenChat / Mixtral: For when I needed a solid open-weight model that could handle reasoning

Backend Glue:

  • LangChain: Fast to prototype, though I’ve had to write custom wrappers to keep it sane at scale
  • LlamaIndex: Strong when you need better document routing and retriever chaining
  • Custom Python scripts: Honestly, sometimes you just need to bypass the abstraction and handle it yourself

This modular layout means you can plug and play depending on your infra, latency budget, and use case.

Next, I’ll walk you through how I prepare documents so the retrieval step doesn’t silently ruin everything (been there too many times).


3. Data Preparation That Doesn’t Suck

“Garbage in, garbage out” hits very differently when your LLM is hallucinating because you fed it a PDF full of headers, footers, and copyright blurbs.

I’ve learned (sometimes the hard way) that data prep is where most RAG pipelines quietly fail. If your chunks are misaligned, too long, too short, or stripped of useful metadata, no fancy LLM is going to save you.

Let’s break this down.

Real-World Parsing (PDFs, HTML, Markdown)

When I’m loading documents, I always start with the simplest goal: preserve structure without injecting noise. Here’s what’s worked best for me:

  • PDFsunstructured package or pdfminer.six + my own layout logic
  • HTML → BeautifulSoup, obviously—but I’ve built custom filters for tags like <nav>, <footer>, and <script> which I almost always skip
  • Markdown / Notebooks → I treat these as structured text, but I aggressively tag headers and code blocks

Chunking Strategy: Recursive vs Fixed

Fixed-size chunking sounds nice. It’s predictable. It’s also where half my search failures came from.

I’ve personally had better results with Recursive Character Text Splitting (LangChain’s RecursiveCharacterTextSplitter or a custom tokenizer-based version). It respects semantic boundaries—especially when paired with a chunk overlap of 20-50 tokens.

That said, for HTML-heavy or transcript-style data, I’ve sometimes written custom logic that chunks by heading tags or timestamps. Whatever helps keep context within a single chunk.

Here’s a working chunker I’ve used in production:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ".", "!", "?", ",", " "]
)

chunks = text_splitter.split_documents(documents)

You can adjust separators based on your domain. For transcripts, I’ll even add "[Timestamp]" as a boundary.

Metadata Design: What You Should Actually Store

Here’s the deal: if you’re not tagging documents with meaningful metadata, your filtering’s going to suck—and your reranker (if you use one) will be working overtime.

What I store:

{
  "title": "Deep Learning with PyTorch",
  "section": "Chapter 4: Training Loops",
  "source": "book",
  "filepath": "/docs/pytorch_book/ch4.md",
  "timestamp": "00:05:32",  # only for media
  "tags": ["training", "optimization", "loops"]
}

You’ll thank yourself later when you need to:

  • Filter results by source
  • Pre-rank based on sections or chapters
  • Display human-readable search results

4. Choosing the Right Embedding Model

“You don’t need the biggest model—you need the one that gets your data.”

This might surprise you: I’ve had better semantic search performance using BGE-small than with OpenAI’s ada-002 in certain niche domains.

Let me walk you through the trade-offs based on real usage.

My Top Embedding Models (and When I Use Them)

  • BGE-base / BGE-small: My go-to for English, general-purpose data. Super efficient, open-source, and works well with rerankers.
  • Instructor XL: Great for instructional content, FAQs, and knowledge bases. Performs especially well when paired with prompt-style queries.
  • OpenAI ada-002: Reliable but expensive at scale. Latency is a concern. I only use this when I don’t want to host models.
  • E5 / GTE / MPNet: Solid options, especially multilingual or domain-specific contexts.

For domain-specific tasks (e.g., medical or legal), I’ve had to train or fine-tune embeddings myself—but that’s another story.

Accuracy vs Latency vs Vector Size

Here’s a quick take from my experience:

ModelAccuracyLatency (local)Vector Size
BGE-smallGoodFast384
BGE-baseGreatFast-ish768
Instructor-XLExcellentMedium768
ada-002GreatHigh (API)1536

I personally prefer smaller vectors unless I really need fine semantic nuance. Smaller vectors = faster retrieval.

Embedding Code Example

Here’s a pipeline I’ve used with sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5")

def embed_texts(texts):
    return model.encode(texts, normalize_embeddings=True)

And if you’re storing these in Qdrant:

import qdrant_client

client = qdrant_client.QdrantClient("localhost")

client.upsert(
    collection_name="rag_chunks",
    points=[
        {
            "id": i,
            "vector": vec.tolist(),
            "payload": metadata[i]
        }
        for i, vec in enumerate(embeddings)
    ]
)

That’s the stack I’ve personally used on multiple live systems. Fast, easy to update, and portable.


5. Indexing Into a Vector Store (Without Losing Your Mind)

“You can’t build a library of knowledge if you don’t index your books properly.”

I’ve gone through more vector DB setups than I care to admit—Weaviate, Qdrant, FAISS, Pinecone—and I’ll be blunt: your choice doesn’t matter as much as your schema and how you feed it.

Let’s walk through what’s worked for me in production, where latency, filtering, and update patterns actually matter.

Which Vector DB? (And Why)

If I had to pick one for most of my use cases, it’d be Qdrant. Why?

  • Fast setup, especially with Docker
  • Filterable metadata support out of the box
  • Native support for hybrid search (vector + keyword)
  • Local or cloud, your call
  • Clean REST and gRPC APIs

That said, I’ve used Weaviate for more semantically rich projects because of its hybrid search and schema management via GraphQL. FAISS? It’s great, but I wouldn’t touch it unless I need pure vector math and zero infra overhead.

Schema Design That Doesn’t Bite You Later

Here’s the deal: your index schema is not just some technicality. It defines what your retriever can filter on, how you re-rank, and whether you can scale your system.

Here’s what I typically include in my Qdrant schema payload:

{
  "id": "uuid",
  "vector": [...],
  "payload": {
    "text": "actual chunk content",
    "title": "RAG Tutorial",
    "section": "5. Indexing",
    "source": "blog",
    "tags": ["rag", "indexing", "vectorstore"],
    "timestamp": "00:14:25"
  }
}

Personally, I always tag chunks with source and section-level info. It’s helped me:

  • Filter irrelevant matches
  • Run section-aware retrieval
  • Build explainable chains

If your DB supports filters (like Qdrant or Weaviate), use them. You don’t want your search returning glossary definitions when the user asks about implementation details.

Index Update Strategy (Batch vs Streaming vs Real-Time)

You might be wondering: how often should I update my index?

Here’s how I handle it:

  • Batch: For books, manuals, and static content—use nightly jobs or retrain pipelines
  • Streaming: When ingesting chat logs, user tickets, or dynamic docs—use Kafka or a pub/sub system
  • Real-time: Honestly? Rare. But when I had to push updates as users edited docs in real-time, I built a lightweight webhook to trigger chunking + upserts within seconds

Avoid premature optimization. Start with batch and move to streaming only when your content changes faster than your job schedule.

Code: Indexing Into Qdrant (with Retry Logic)

Let me show you the pattern I use to push vectors + metadata into Qdrant—with batching and retries baked in.

import time
import uuid
import qdrant_client
from qdrant_client.http.models import PointStruct

client = qdrant_client.QdrantClient("localhost", port=6333)

def index_chunks(vectors, metadatas, batch_size=100):
    points = []
    for i, (vec, meta) in enumerate(zip(vectors, metadatas)):
        point = PointStruct(
            id=str(uuid.uuid4()),
            vector=vec.tolist(),
            payload=meta
        )
        points.append(point)

        if len(points) == batch_size:
            _upsert_with_retry(points)
            points = []

    if points:
        _upsert_with_retry(points)

def _upsert_with_retry(points, retries=3):
    for attempt in range(retries):
        try:
            client.upsert(collection_name="rag_chunks", points=points)
            return
        except Exception as e:
            print(f"Upsert failed: {e}")
            time.sleep(2 ** attempt)
    raise RuntimeError("Failed to upsert after multiple attempts.")
raise RuntimeError("Failed to upsert after multiple attempts.")

If you’re using Weaviate, you’ll use client.batch.add_data_object() instead—but the batching + retry principle is the same.

This setup has saved me so many times during large corpus loads or flaky local test runs.


6. Query Pipeline: Retriever Logic That Doesn’t Choke

“The moment users start typing real questions, your whole stack either shows up… or shows cracks.”

Once I had my chunks embedded and indexed, I figured I was close to done. But honestly? This is where things started breaking.

The retrieval pipeline is where theory collides with reality—real queries, real ambiguity, and a lot of trial and error. I’ve tried everything: dense retrieval, sparse (BM25), hybrid setups, and rerankers. And here’s what I learned the hard way.

What I Tried: BM25, Dense, and Hybrid

BM25: Surprisingly Solid (For Keyword-Heavy Domains)

If your data has jargon-heavy text (e.g. legal docs, medical records, or source code), BM25 actually punches above its weight. I used Elasticsearch and Weaviate’s sparse capabilities to test this.

Pro: Lightning-fast and brutally literal.
Con: Falls flat when the query is phrased in a way the document never used.

Dense: My Default Starting Point

I’ve mostly used BGE and Instructor models for embeddings. Dense retrieval gives me strong semantic matches, especially for abstract or instructional queries.

But here’s the thing—dense models can get overconfident and hallucinate semantic matches that aren’t actually useful. I’ve had cases where it pulled “vibes” instead of facts.

Hybrid: The Sweet Spot (Most of the Time)

Once I started blending BM25 and dense scores, the results got way more stable. For example, a BM25 match would catch the right section by keyword, and then dense embeddings would help find the most semantically relevant chunk within that section.

I’ve used hybrid setups in Weaviate, Qdrant with keyword payload filters, and even custom reranking logic where I manually weight scores. Not every vector DB supports it natively—but trust me, it’s worth hacking together.

How I Actually Evaluated Retrieval Quality

“Precision isn’t enough. I wanted confidence that I’d trust the chunk being returned.”

I built a small QA set of real questions based on the dataset and manually reviewed the top 3-5 retrieved chunks for each retriever strategy.

What I looked for:

  • Did the chunk directly answer the question?
  • Was it specific or just a vague semantic match?
  • Was any important context missing?

I even wrote a script to dump the top-5 results for a batch of questions side-by-side from each retriever—dense, BM25, hybrid—and just eyeballed the deltas. Painful, yes. Worth it? Also yes.

Should You Filter by Metadata? Absolutely.

Filtering by metadata isn’t optional—it’s how you keep your retriever focused. I’ve used metadata filters to:

  • Restrict by source (e.g., only return from docs, not blog posts)
  • Target specific sections (e.g., only Chapter 3 or FAQ)
  • Narrow time windows (super useful in log-based RAG setups)

Here’s a simple Qdrant filter I’ve used to restrict retrieval:

from qdrant_client.http import models

filter = models.Filter(
    must=[
        models.FieldCondition(
            key="source",
            match=models.MatchValue(value="technical_docs")
        )
    ]
)

Even if you’re not using Qdrant, almost every vector DB with metadata payloads supports something similar.

Optional: Reranker Integration (Cohere Rerank / ColBERT)

Once I had decent first-pass retrieval, reranking made everything feel sharper.

I’ve had solid results using Cohere Rerank—especially with cohere.embed-english-light-v3.0 on instruction-style data. Just pass the query and the retrieved documents, and it scores them.

Sample rerank with Cohere:

import cohere

co = cohere.Client("COHERE_API_KEY")

results = co.rerank(
    query="How do I fine-tune a transformer on tabular data?",
    documents=[doc["text"] for doc in retrieved_docs],
    top_n=5
)

reranked = [retrieved_docs[i.index] for i in results.results]

ColBERTv2 is great too, especially if you’re running your own stack and want control over late interaction. But it takes some serious infra and GPU.

Full Retriever Pipeline (Hybrid + Filter + Rerank)

Here’s what a full retriever setup might look like in a Qdrant + Cohere stack:

# 1. Dense retrieval
query_vector = embedding_model.encode(["your query here"])[0]

# 2. Apply filter + get dense results
hits = client.search(
    collection_name="rag_chunks",
    query_vector=query_vector,
    limit=10,
    with_payload=True,
    score_threshold=0.2,
    query_filter=filter
)

# 3. Rerank
texts = [hit.payload["text"] for hit in hits]

results = co.rerank(
    query="your query here",
    documents=texts,
    top_n=5
)

final_chunks = [texts[i.index] for i in results.results]

This combo gives you precision, context relevance, and filtering—all tuned from hands-on frustration.


7. Connecting to the LLM

“A model is only as smart as the mess you hand it.”

I’ve connected to LLMs in every way imaginable—local, remote, hot-swapped between APIs mid-session—you name it. What I’ve learned is that how you connect matters just as much as what you’re connecting to.

Let’s break it down.

Local vs API-Hosted LLMs: Trade-offs I’ve Run Into

I’ve hosted local models like LLaMA, Mistral, and Mixtral using llama-cpp-python, text-generation-webui, and even spun up GGUF versions on bare metal with 48GB VRAM.

Why go local?

  • Latency is unbeatable once it’s warm—sub-300ms with 7B models.
  • No rate limits or token caps from third-party APIs.
  • Full control over model weights, quantization, and batching.

But here’s the kicker: local hosting burns time. You’re suddenly managing:

  • Model loading optimizations (trust me, --n-gpu-layers can make or break you)
  • Memory issues (even GGUF 13B struggles on 24GB cards)
  • Threading quirks and llama-cpp‘s tokenizer weirdness

For most production RAG setups, I still reach for API-hosted models like OpenAI or Groq when I want:

  • Scale instantly
  • Access to larger models like GPT-4 or Claude
  • Lower ops overhead

That said, if you’re running queries 24/7, local starts to make a lot more sense financially. I’ve run small quantized models for under $50/month on a dedicated box.

Prompt Templates: Plain vs Structured

Early on, I just threw raw strings into the prompt:

prompt = f"Answer the question using context:\n\n{context}\n\nQuestion: {question}"

It worked. Sort of. Until it didn’t.

Once the prompts got bigger—and context windows started bloating—I switched to structured templates using LangChain’s PromptTemplate and ChatPromptTemplate. Why? Because you need consistent formatting if you’re ever going to test or tweak anything.

Example with LangChain:

from langchain.prompts import ChatPromptTemplate

template = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant that only answers using provided context."),
    ("human", "Context:\n{context}\n\nQuestion: {question}")
])

prompt = template.format_messages(
    context=context_snippet,
    question=user_query
)

This gives you clean abstraction, especially when you start A/B testing prompt phrasing or injecting metadata dynamically.

Passing Context: Inline vs JSON vs… Weird Hacks

Here’s the deal: how you inject context has a huge impact on downstream answer quality.

What worked for me:

  • Inline raw context works well when you trust the LLM to reason.
  • Structured JSON helped when I needed the model to reference multiple docs or chunk sources (e.g., [{"section": "2.3", "text": "..."}, ...])
  • Context priming hacks, like adding "Based on the following knowledge base..." to anchor the prompt, helped reduce hallucination in complex cases.

One trick I’ve used: adding a pseudo memory line at the top like:

“You are answering based on the company’s internal documentation from 2023.”

It subtly nudges the model into staying grounded—even with fuzzy questions.

Code: LLM Call Logic (LangChain + Raw Python)

Here’s how I set up a typical API call using OpenAI + LangChain:

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model="gpt-4", temperature=0.3)

response = llm(prompt)

For local models via llama-cpp-python:

from llama_cpp import Llama

llm = Llama(model_path="mistral-7b.gguf", n_gpu_layers=20)

output = llm(
    f"Answer using context:\n\n{context}\n\nQuestion: {question}",
    max_tokens=512,
    temperature=0.3,
    stop=["User:", "System:"]
)

Bonus Tip: Streaming Responses

If latency is killing UX, I’ve used streaming tokens with LangChain to pipe responses to the UI as they’re generated. It feels way more interactive, even with slower models.

llm = ChatOpenAI(streaming=True)

for chunk in llm(prompt):
    print(chunk.content, end="")

So if your LLM connection feels slow, janky, or unreliable—chances are, it’s not just the model. It’s your pipeline.


8. Building the Full Pipeline (End-to-End Glue Code)

“A pipeline isn’t just a sequence—it’s a conversation between brittle components pretending to cooperate.”

When I built my first production-grade RAG pipeline, I naively thought the hard part was the model. Nope. The real battle was making the pieces talk to each other—cleanly, modularly, and without choking on edge cases.

Let me walk you through what I’ve settled on after way too many “why is this NoneType” errors in logs.

The Core Pipeline Structure

At a high level, the steps are:

  1. Take in a query.
  2. Retrieve the most relevant context chunks.
  3. Format the prompt with that context.
  4. Call the LLM.
  5. Return the response (with sources, optionally).

I like to write it as a class with modular components so I can easily swap out vector DBs or models:

class RAGPipeline:
    def __init__(self, retriever, llm, prompt_fn, cache=None):
        self.retriever = retriever
        self.llm = llm
        self.prompt_fn = prompt_fn
        self.cache = cache or {}

    def run(self, query):
        if query in self.cache:
            return self.cache[query]

        context_chunks = self.retriever.get_relevant_docs(query)
        prompt = self.prompt_fn(query, context_chunks)
        response = self.llm(prompt)

        self.cache[query] = response
        return response

This is simplified, but the key is encapsulation. You want the retriever, prompt logic, and LLM to be swappable pieces—not buried in spaghetti if/else logic.

Error Handling & Fallbacks

This might surprise you: the LLM will fail in production—timeouts, 502s, bad JSON. If you’re not wrapping your calls with proper error handling, you’re gambling.

Here’s a simple retry wrapper I’ve used around the LLM call:

import time

def safe_llm_call(llm, prompt, retries=3):
    for attempt in range(retries):
        try:
            return llm(prompt)
        except Exception as e:
            if attempt < retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise RuntimeError(f"LLM failed after {retries} attempts: {e}")

And I always include fallback prompts—shorter, less context-heavy versions—just in case the full one blows the token limit or times out.

Logging That’s Actually Useful

One thing I’ve learned the hard way: logging the final output is useless without seeing the context that went in.

Here’s the log format I usually go with:

log = {
    "query": query,
    "context_doc_ids": [doc.id for doc in context_chunks],
    "prompt_tokens": len(prompt),
    "llm_response": response,
    "timestamp": datetime.now().isoformat()
}

You’d be shocked how often a “bad answer” is really a bad context pull. Good logs let you trace that.

Caching That Actually Saves You

If you’re not caching at multiple levels, you’re wasting compute. Here’s what I cache and why:

  • Query → Context Chunks
    Avoid hitting the vector store repeatedly for the same input.
  • Query → LLM Response
    Huge if you’re dealing with expensive models like GPT-4.

I typically use an in-memory dict for dev, but in production? Redis or something like DuckDB with a simple hash index does the trick.

Example plug-in swap:

from redis import Redis

class RedisCache:
    def __init__(self, client: Redis):
        self.client = client

    def __contains__(self, key):
        return self.client.exists(key)

    def __getitem__(self, key):
        return self.client.get(key)

    def __setitem__(self, key, value):
        self.client.set(key, value)

Just slot this into the RAGPipeline constructor.

Modular Config Support

You might be wondering: how do I keep all these components flexible?

Personally, I like to load everything from a config file or dict. Here’s a stripped example:

config = {
    "retriever": "bm25",  # or "dense" or "hybrid"
    "llm": "gpt-4",        # or "mistral-local"
    "cache": True
}

Then I just write a factory that builds the pipeline:

def build_pipeline(config):
    retriever = load_retriever(config["retriever"])
    llm = load_llm(config["llm"])
    prompt_fn = select_prompt_template(config["llm"])
    cache = RedisCache(Redis()) if config["cache"] else {}

    return RAGPipeline(retriever, llm, prompt_fn, cache)

This gives you total control with minimal branching in your main code.

So yeah—this is the real engine room of your RAG stack. If your glue is brittle, nothing else will hold. Build it modular, log everything, and treat your fallbacks like first-class citizens.


9. Evaluation: Does It Even Work?

“The model gave an answer” is not the same as “the model gave a useful answer.”

After building out the full pipeline, I hit this wall: some responses looked great, others were way off—but there was no obvious metric to catch the bad ones. I had to come up with an evaluation setup that didn’t just chase accuracy like a classification task. This was about usefulness, relevance, and yes, latency.

Here’s what actually worked for me.

Two Layers of Evaluation: Retrieval and Response

I split my eval into two parts:

  1. Retrieval Quality — Did we pull in the right context?
  2. Final Response Quality — Did the model answer the query well?

If you’re only measuring the output, you’re missing half the story. I’ve had retrieval bugs where the LLM still produced a confident answer—just to the wrong question.

Retrieval Eval: Vector Similarity & Embedding-Based Scoring

You might be wondering: how do I know if the retrieved docs were relevant?

What worked for me was using embedding-based similarity between the query and each retrieved chunk, and comparing that to the ground-truth answer (if available).

Here’s a quick example using sentence-transformers:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def score_retrieval(query, docs, expected_answer):
    query_vec = model.encode(query, convert_to_tensor=True)
    doc_vecs = model.encode(docs, convert_to_tensor=True)
    answer_vec = model.encode(expected_answer, convert_to_tensor=True)

    # How close are docs to the query
    query_scores = util.cos_sim(query_vec, doc_vecs)

    # How close are docs to the ground-truth answer
    answer_scores = util.cos_sim(answer_vec, doc_vecs)

    return {
        "query_similarity": query_scores.tolist(),
        "answer_similarity": answer_scores.tolist()
    }

This helped me catch when the retriever was bringing in off-topic content—even if the LLM faked its way through it.

Response Eval: BLEU, ROUGE? Meh. Go Embedding.

BLEU and ROUGE felt… off. I mean, they’re fine for exact match tasks, but RAG answers are often phrased differently from the ground truth. That’s why I leaned into embedding-based eval for responses too.

def score_response_embedding_based(predicted_answer, reference_answer):
    pred_vec = model.encode(predicted_answer, convert_to_tensor=True)
    ref_vec = model.encode(reference_answer, convert_to_tensor=True)
    similarity = util.cos_sim(pred_vec, ref_vec)
    return similarity.item()

This metric tracked much better with what I saw qualitatively. If you’ve got multiple reference answers per question (e.g., human annotators), take the max score across them.

Manual Review Still Matters (Unfortunately)

Let me be blunt: you can’t fully automate this.

At some point, I had to sit down and read the responses—especially for edge cases. I built a tiny Streamlit app to show:

  • Query
  • Retrieved docs
  • Final LLM answer
  • Expected answer (if available)
  • Eval score (for correlation sanity checks)

Super low effort, but saved me hours of trying to interpret raw metrics in a CSV.

Metrics That Actually Mattered

In production, here’s what I tracked:

  • Mean Retrieval Similarity
  • Mean Answer Similarity
  • Token Count (Input/Output) — for cost
  • LLM Latency
  • Cache Hit Rate
  • User Feedback Tags (thumbs up/down or feedback form)

Relevance mattered more than “correctness.” And latency? If you’re pulling context from a slow disk-based vector store and calling an API LLM, users will feel it.

That’s the gist. No eval metric is perfect, but embedding-based similarity got me 80% of the way there. The rest? Just sit down and look at the responses.


10. Optimization Tips from the Trenches

“The model worked… until it didn’t. Then I had to get clever.”

At first, everything felt fine during dev. But once I started pushing real traffic—multiple users, varying queries, and longer context chunks—things broke in subtle (and expensive) ways. Here’s what saved me.

How I Reduced Hallucinations

This might surprise you: the LLM hallucinates less when you give it less freedom. Not fewer tokens, but less ambiguity.

Here’s what helped:

  • Force it to cite: I started adding phrases like “based on the context above” or even “answer only using the provided documents” to the prompt. That nudged the LLM to ground itself.
  • Inject source snippets tactically: Instead of dumping entire documents, I passed only the top 2–3 chunks with the highest similarity scores. Quality over quantity mattered.
  • Fallbacks: If the retrieval score dropped below a threshold, I skipped the LLM call entirely and returned a default response like:
    “No relevant information found. Try rephrasing?”

This fallback logic saved me from generating nonsense when the vector store couldn’t find anything decent.

Latency Tips That Actually Helped

Latency crept up fast as I added context injection, prompt formatting, and multiple model calls. Here’s what helped:

  • Async everywhere: I moved all LLM and embedding calls to asyncio. In LangChain or raw Python, this alone shaved hundreds of ms.
import asyncio
import httpx

async def get_response(prompt):
    async with httpx.AsyncClient() as client:
        res = await client.post("http://localhost:8000/generate", json={"prompt": prompt})
        return res.json()
  • Local embeddings: I switched from OpenAI’s embedding API to a local sentence-transformers model. Yes, it used more RAM—but inference time was <50ms, and I avoided outbound calls.
  • Pre-tokenization caching: This was subtle—caching not just chunk embeddings, but also pre-tokenized documents for faster context assembly. It helped more than I expected.

Cost-Cutting Without Cutting Corners

You might be wondering: how do I make this cheaper without going dumb?

Here’s what worked:

  • Quantized models for retrieval: I ran a 4-bit quantized MiniLM model for embedding generation on a modest GPU. Throughput went up, cost went down.
  • Chunk smarter, not more: I tuned chunk overlap aggressively. Instead of blindly overlapping 20–30 tokens, I calculated semantic boundaries (e.g., between sections). Less bloat, fewer tokens sent downstream.
  • Batching: Especially for retrieval, I batched embedding queries when processing multiple user inputs. It’s easy to overlook, but it can half your costs if you’re not doing it.

A Final Bit of Glue Code (Optimization Edition)

Here’s a slimmed-down version of the optimization layer I used in production:

class RAGPipeline:
    def __init__(self, embedder, retriever, llm, cache, threshold=0.4):
        self.embedder = embedder
        self.retriever = retriever
        self.llm = llm
        self.cache = cache
        self.threshold = threshold

    async def run(self, query):
        if self.cache.contains(query):
            return self.cache.get(query)

        query_vec = self.embedder.encode(query)
        docs, scores = self.retriever.retrieve(query_vec, top_k=3)

        if max(scores) < self.threshold:
            return "No relevant documents found. Please try again."

        prompt = self.construct_prompt(query, docs)
        response = await self.llm.generate(prompt)
        self.cache.set(query, response)
        return response

    def construct_prompt(self, query, docs):
        context = "\n\n".join(docs)
        return f"Based on the following context, answer the question:\n\n{context}\n\nQuestion: {query}"

This version bakes in caching, retrieval thresholds, and async LLM calls—all of which made a serious difference when I scaled.


11. Final Thoughts (No Fluff)

If I had to start over, I’d do two things differently:

  1. Invest in eval earlier. I waited too long to build a feedback loop. I should’ve had an eval dashboard from day one—even if manual.
  2. Decouple components. Early on, I glued everything together tightly. Later, I wished I had split it all into modular parts: prompt templates, vector DB adapters, retriever classes, etc.

When RAG Doesn’t Work (Yes, It Happens)

RAG isn’t a silver bullet. I’ve had situations where:

  • The data was too sparse (e.g., support logs that didn’t capture context well)
  • The queries were too abstract (“What’s the best strategy?” type)
  • The users just wanted a chatbot—not document-grounded answers

In those cases, a fine-tuned model worked better. Sometimes, it’s not about clever prompting—it’s about choosing the right architecture altogether.

Fork It, Hack It, Make It Yours

Here’s the deal: I built this pipeline to be modular and swappable. Try it with your own vector DB, your own LLM, or your own prompt structure.

Fork it, run it, stress-test it—and let me know what breaks. That’s the fun part.

Leave a Comment