Building an Open Source RAG Application: A Hands-On Guide

1. Introduction: Why I Built This and What You’ll Get

“A tool is only as good as the friction it removes.”

I didn’t set out to build yet another RAG demo. I built this because I was tired of the disconnect between flashy LLM prototypes and what it actually takes to ship something usable.

You’ve probably seen the same thing — half-baked notebooks on GitHub, missing glue code, no retriever logic, and vague promises that “LangChain handles everything.” It doesn’t.

What I needed was a production-practical, open-source RAG pipeline I could fully control — something I could spin up locally, plug into different models, and actually trust in a real application.

Tools I Chose (and Why)

I’ll say this upfront: I didn’t go with LangChain for this project. Personally, I found it too opinionated and too slow for debugging when things break (and they will).

I’ve used LangChain in previous builds — it’s great for prototyping, but this time I wanted full control over my chain logic and a cleaner mental model. So I went with LlamaIndex, and I’m glad I did.

For the vector store, I picked ChromaDB. Here’s the deal: it’s lightweight, embeddable, and makes local dev stupidly fast.

I’ve worked with Weaviate and Qdrant before — they’re great when you’re dealing with more complex metadata filtering at scale — but for a local-first setup, Chroma was a no-brainer.

And for the LLM? I went with Ollama, running mistral locally. I wanted full control over latency, tokens, and cost. I’ve used OpenAI plenty — and yes, I’ll show how to plug that in too — but for this guide, I’m focusing on a 100% open-source stack.

What You’ll Get from This Guide

This guide walks you through how I actually built the whole thing — from ingestion to embeddings, from retrieval to a working RAG query API. Every step includes working, modular code. No copy-paste dumps — just clean, reusable Python you can plug into your own projects.

It’s opinionated, yes — because I’ve been burned by vague abstractions and I want to show you what actually works. If something gave me trouble, I’ll show you what went wrong and how I fixed it. This isn’t just code — this is experience.

Scope of the Build

  • Stack: Fully open-source (LlamaIndex + ChromaDB + Ollama)
  • LLM: Local-first, but you can easily switch to OpenAI if needed
  • Interface: FastAPI-based API, no frontend bloat
  • Deployment: Dockerized, reproducible, minimal dependencies
  • Goal: Reusable RAG engine that runs on your machine or inside a Docker container with minimal fuss

2. Architecture Overview

Before we dive into code, let’s look at the moving parts.

Here’s the minimal system architecture I ended up with:

                          +-------------------------+
                          |      Local Documents    |
                          |  (PDFs, Markdown, etc.) |
                          +-----------+-------------+
                                      |
                              [Data Ingestion]
                                      |
                                      v
                         +------------+-------------+
                         |   Chunking + Embeddings  |
                         | (e.g. Mistral Embeddings)|
                         +------------+-------------+
                                      |
                                      v
                            +--------+--------+
                            |   Chroma Vector  |
                            |      Store       |
                            +--------+--------+
                                     ^
                                     |
                             [Retriever Logic]
                                     |
                                     v
                        +------------+-------------+
                        |     Prompt + LLM Query    |
                        |  (via Ollama: Mistral)    |
                        +------------+-------------+
                                     |
                                     v
                            +--------+--------+
                            |     FastAPI      |
                            |   (Query API)    |
                            +--------+--------+
                                     |
                                     v
                            [Optional Frontend]

Key Components (And What I Customized)

Data Ingestion

I wrote a modular loader for PDFs and Markdown. Not just file reading — it handles cleaning, splitting, and metadata tagging upfront. You’ll see that code soon.

Chunking & Embeddings

This is where most RAG guides fall apart. I’ll show you how I chunked the docs using LlamaIndex’s SentenceSplitter, and embedded them using nomic-embed-text-v1 (via HuggingFace). If you’re using your own model or OpenAI, I’ll show you how to swap it out.

Vector Store

ChromaDB, running locally with persistence enabled. Works great out of the box — but I had to tune the collection and metadata setup a bit to make retrieval fast and relevant.

Retriever

I wrote a custom retriever to support filtered search and Maximal Marginal Relevance (MMR). I’ll walk you through the logic — no black-boxing here.

LLM (Local via Ollama)

I used mistral — quick to load, handles decent context windows, and works fine for my needs. The entire pipeline is model-agnostic though.

API Layer

FastAPI powers the interface. It exposes a /query endpoint that takes in your question and returns the answer + source context. Supports both sync and async calls.

Frontend (Optional)

Personally, I didn’t need a frontend for this — but I’ll drop a basic Gradio script in case you want a playground-style interface.

Deployment

This whole thing runs inside Docker, with GPU passthrough if you have it. I’ll show you how I set that up, including the base image tweaks I made to keep build time low.


3. Setting Up the Stack (with Justified Choices)

“If the foundation is shaky, don’t expect the pipeline to stand.”

You might be tempted to rush past setup and get to retrieval — I get it. I’ve done that too. But I’ll be honest: 80% of the long-term pain I’ve felt in RAG projects came from lazy environment configs and fragile dependencies. So I’ve made it a habit to lock things down early.

Let me walk you through how I built the stack from the ground up — clean, reproducible, and portable.

Environment Setup: Keep It Dockerized

I didn’t want to pollute my local Python setup (I never do), so I built the whole thing around a Dockerfile + docker-compose.yml combo.

Here’s the Dockerfile I used — optimized for running ollama with GPU support (if available), and minimal bloat:

FROM python:3.10-slim

# Install basic dependencies
RUN apt-get update && apt-get install -y \
    git curl build-essential \
    && rm -rf /var/lib/apt/lists/*

# Install Ollama
RUN curl -fsSL https://ollama.com/install.sh | bash

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy app code
COPY . /app
WORKDIR /app

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Here’s the docker-compose.yml, which starts Ollama and your FastAPI app together:

version: '3.8'

services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama

  rag-app:
    build: .
    ports:
      - "8000:8000"
    depends_on:
      - ollama
    environment:
      - MODEL_NAME=mistral

volumes:
  ollama_data:

My requirements.txt was intentionally lightweight:

fastapi
uvicorn
llama-index
chromadb
sentence-transformers

No LangChain, no unnecessary wrappers. I wanted control — and fewer things breaking silently.

Vector DB: Why I Picked Chroma

I’ve worked with Qdrant and Weaviate on larger systems, and they’re great — especially when you’re running heavy workloads and need structured metadata filtering. But for a local-first, zero-hassle setup, nothing beats ChromaDB.

  • It runs in-process — no external server.
  • Blazing-fast startup and queries.
  • Easy to persist across sessions.

Here’s how I initialized it:

from chromadb import Client
from chromadb.config import Settings

chroma = Client(Settings(
    persist_directory="./chroma_store",
    anonymized_telemetry=False
))

collection = chroma.get_or_create_collection(name="docs")

And for metadata filtering?

results = collection.query(
    query_embeddings=[embedding],
    n_results=5,
    where={"source": "whitepaper"}  # Filter by doc type
)

Super lightweight, zero ceremony. For most use cases under 100K docs, this has been more than enough.

LLM Setup: Why I Went Local (and How I Did It)

Here’s the deal: I didn’t want to burn through tokens or depend on OpenAI uptime for local development. So I ran Mistral 7B locally via Ollama, which exposes a simple REST API on http://localhost:11434.

You can spin it up like this:

ollama run mistral

Then call it like this from Python:

import requests

def query_llm(prompt):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": "mistral", "prompt": prompt}
    )
    return response.json()["response"]

I’ve tested this pipeline with llama2, gemma, and even phi — swapping models is as simple as changing a string.

That said, if you’re deploying to prod and need reliability, just plug in OpenAI’s gpt-4-turbo or gpt-3.5-turbo. LlamaIndex abstracts it cleanly.

Tooling: Why I Picked LlamaIndex (and Skipped the Others)

I’ve used LangChain in enough projects to say this bluntly: it’s powerful, but also unpredictable.

Every time I needed to debug custom retrieval logic or inject tracing, I found myself fighting the abstractions. So for this project, I went with LlamaIndex — and I haven’t looked back.

  • It’s modular.
  • The API surface is predictable.
  • You get full control over each component — retrievers, chunkers, node parsers, etc.

And because I was running everything locally, I skipped Streamlit/Gradio for this build. Personally, I find a clean FastAPI interface more flexible, especially if you’re planning to integrate it with something downstream (like a Slack bot or internal dashboard).

Here’s a basic FastAPI endpoint:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class Query(BaseModel):
    question: str

@app.post("/query")
def query_rag(query: Query):
    answer = my_rag_pipeline(query.question)
    return {"answer": answer}

4. Data Ingestion and Chunking (Where Most People Screw Up)

“Garbage in, garbage out” might sound cliché, but it hits hard when your RAG app returns garbage for a well-formed query.

Let’s be real — chunking is where most people shoot themselves in the foot. I’ve been there. At one point, I thought I had everything wired up perfectly… until my app started hallucinating facts that weren’t in any of my documents. Turned out, my chunks were so poorly scoped they might as well have been abstract poetry.

Here’s exactly how I handle data ingestion and chunking now — from file loading to preprocessing to chunk strategies that actually work.

Ingesting Raw Data

I needed to load a mix of PDFs, Markdown, and structured JSON. Some had metadata, some didn’t. So I wrote custom loaders that handle all three gracefully, instead of relying on over-abstracted tools.

Here’s a minimal (and extensible) data loader I wrote:

from pathlib import Path
from PyPDF2 import PdfReader
import json
import markdown
import os

def load_documents(directory):
    docs = []
    for file_path in Path(directory).glob("**/*"):
        if file_path.suffix == ".pdf":
            docs.append(load_pdf(file_path))
        elif file_path.suffix == ".md":
            docs.append(load_markdown(file_path))
        elif file_path.suffix == ".json":
            docs.extend(load_json(file_path))
    return docs

def load_pdf(path):
    reader = PdfReader(str(path))
    text = "\n".join(page.extract_text() for page in reader.pages if page.extract_text())
    return {"content": text, "metadata": {"source": str(path)}}

def load_markdown(path):
    with open(path, "r") as f:
        html = markdown.markdown(f.read())
    return {"content": html, "metadata": {"source": str(path)}}

def load_json(path):
    with open(path, "r") as f:
        data = json.load(f)
    return [{"content": item["body"], "metadata": item.get("meta", {})} for item in data]

This loader gives me full control — no silent failures, no unnecessary libraries.

Chunking Strategy: What Actually Works

You might be wondering: “Should I just use fixed-size chunks and call it a day?”

That’s what I did… until I ran into weird context fragmentation issues — like splitting a single sentence across two chunks.

What I learned (painfully):
Chunking is not just about size — it’s about preserving semantic boundaries. Once I switched to recursive, sentence-aware chunking with overlap, my retrieval quality improved noticeably.

Here’s how I implemented it using LlamaIndex‘s SentenceSplitter:

from llama_index.text_splitter import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=64
)

chunks = []
for doc in load_documents("data/"):
    chunked = splitter.split_text(doc["content"])
    for chunk in chunked:
        chunks.append({
            "content": chunk,
            "metadata": doc["metadata"]
        })

This setup gave me:

  • Better semantic coherence.
  • Fewer hallucinations.
  • Smoother answers from the LLM.

And because I preserved metadata in each chunk, I could easily track sources during retrieval — super helpful for debugging and citations.

Preprocessing: The Stuff Nobody Talks About

This might surprise you: cleaning your text is often more valuable than tweaking your model.

Some of the documents I loaded were full of:

  • Boilerplate headers/footers from PDFs
  • Inline code snippets from Markdown
  • Encoding artifacts (\xa0, weird ligatures, etc.)

So I added a preprocessing step that strips noise but keeps structure. Here’s a small helper I still use:

import re

def clean_text(text):
    text = re.sub(r'\n{2,}', '\n', text)  # collapse multiple newlines
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)  # remove non-ASCII
    text = re.sub(r'\s+', ' ', text).strip()  # normalize whitespace
    return text

Before chunking, I run every document through this cleaner:

doc["content"] = clean_text(doc["content"])

The effect? Cleaner embeddings, tighter similarity scores, and way fewer surprises in retrieval.

Bonus: Domain-Specific Adjustments That Paid Off

In one of my use cases — financial regulation documents — I found that even sentence-based chunking wasn’t enough. Paragraphs carried meaning across bullet lists and inline tables. So I added a heuristic:

if "§" in chunk or "Act" in chunk:
    chunk = chunk.replace("\n", " ")  # flatten legally dense sections

It’s not elegant, but it worked. My point is: don’t treat chunking like a one-size-fits-all problem. Measure what works for your documents.


5. Embedding and Vector Store Indexing

“An embedding model is only as good as the way you pipeline it.”

If there’s one area I underestimated early on, it was the embedding pipeline. Not the model — I knew which ones were good — but the plumbing. Batch processing, retries, deduplication, memory constraints, metadata injection… the stuff most blog posts conveniently skip over.

Here’s how I actually do it.

Choosing the Right Embedding Model

I’ve used three in production:

  • sentence-transformers/all-MiniLM-L6-v2: Fast, solid baseline, great for CPU inference.
  • BAAI/bge-small-en: Better alignment with dense retrieval — especially useful with MMR.
  • OpenAIEmbeddings: High-quality, but not local and not cheap at scale.

When latency and cost were critical, I went local. For zero-setup demos, I used OpenAI. I also ran side-by-side evals — OpenAI models typically gave +5-8% better MRR in QA tasks, but with 5x the latency and $$$.

You’ve got to balance this yourself. Personally, I default to bge-small for local RAGs.

Batch Embedding with Metadata and Retries

This is a pipeline I built to handle chunk embedding with retries, batching, and full metadata tracking.

from sentence_transformers import SentenceTransformer
import chromadb
import uuid
import time

model = SentenceTransformer("BAAI/bge-small-en-v1.5")
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection(name="my_rag_collection")

def embed_documents(docs, batch_size=32):
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i+batch_size]
        texts = [doc["content"] for doc in batch]
        ids = [str(uuid.uuid4()) for _ in batch]
        metadatas = [doc["metadata"] for doc in batch]
        
        try:
            embeddings = model.encode(texts, convert_to_numpy=True, normalize_embeddings=True)
            collection.add(
                documents=texts,
                embeddings=embeddings.tolist(),
                ids=ids,
                metadatas=metadatas
            )
        except Exception as e:
            print(f"Failed batch {i}: {e}")
            time.sleep(2)
            continue  # retry logic could go here

This setup:

  • Batches for speed.
  • Normalizes embeddings for cosine distance.
  • Tags each vector with full metadata — not just title and source, but timestamps, tags, doc IDs.

Float16 vs. Float32? Yes, It Matters.

This might seem like an implementation detail, but it’s not.

If you store embeddings in Float16, you cut memory use in half — great for local vector DBs like Chroma or Qdrant.

But be careful: some similarity metrics get noisy with low precision vectors, especially with cosine. I use Float32 during inference, then cast down only if I know my retrieval threshold has a safe margin.

Deduplication Logic That Caught Me Off Guard

One lesson I learned the hard way: documents often repeat across sources (especially scraped docs). That bloated my index and killed precision.

Here’s a simple hash-based deduplication I now run pre-index:

import hashlib

def hash_text(text):
    return hashlib.md5(text.encode("utf-8")).hexdigest()

seen_hashes = set()
clean_docs = []

for doc in docs:
    h = hash_text(doc["content"])
    if h not in seen_hashes:
        seen_hashes.add(h)
        clean_docs.append(doc)

It’s not bulletproof, but it caught ~15% duplicates in one of my corpora.


6. Retriever Logic: Getting More than Just Top-K

“If your retriever is dumb, your RAG is dead.”

Too many RAG tutorials treat retrieval like this:

results = vector_store.similarity_search(query, k=3)

That’s fine for toy demos. In production? You’ll want more.

MMR + Filtering + Custom Scoring

I found the default top-K retrieval too brittle — especially when chunks were redundant or contextually adjacent. That’s where Maximal Marginal Relevance (MMR) gave me a clear advantage.

Here’s a stripped-down custom retriever I used:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def mmr(query_emb, docs_emb, docs_texts, k=5, lambda_param=0.5):
    selected = []
    remaining = list(range(len(docs_emb)))

    query_emb = query_emb.reshape(1, -1)
    docs_emb = np.array(docs_emb)

    while len(selected) < k and remaining:
        mmr_score = []
        for i in remaining:
            sim_to_query = cosine_similarity(query_emb, docs_emb[i].reshape(1, -1))[0][0]
            sim_to_selected = max([cosine_similarity(docs_emb[i].reshape(1, -1), docs_emb[j].reshape(1, -1))[0][0] for j in selected] + [0])
            score = lambda_param * sim_to_query - (1 - lambda_param) * sim_to_selected
            mmr_score.append(score)

        selected_idx = remaining[np.argmax(mmr_score)]
        selected.append(selected_idx)
        remaining.remove(selected_idx)

    return [docs_texts[i] for i in selected]

I use this with a pre-filtered set of candidate chunks, sometimes narrowed by tags or timestamps.

Custom Retriever Class

If you’re using LangChain or LlamaIndex, I highly recommend abstracting this into a custom retriever. Here’s a simplified one I wrote with LlamaIndex:

from llama_index.retrievers import BaseRetriever

class CustomMMRRetriever(BaseRetriever):
    def __init__(self, vector_index, lambda_param=0.5):
        self.index = vector_index
        self.lambda_param = lambda_param

    def retrieve(self, query):
        candidates = self.index.query(query, similarity_top_k=20)
        return mmr(query.embedding, [c.embedding for c in candidates], [c.text for c in candidates], k=5, lambda_param=self.lambda_param)

It’s just more flexible than out-of-the-box retrievers, especially when you want filtering, deduplication, and fallback strategies.

Caching: It’s Not Optional

You might be tempted to skip caching, especially when developing locally. Don’t.

I cache:

  • Embeddings for known queries
  • Final responses for repeated questions
  • Even filtered retriever results if the query is exact match

This shaved 30–40% off latency in one of my apps.


7. RAG Pipeline: Putting It All Together

“Anyone can build components. It’s the orchestration that makes it production-grade.”

After stitching together all the moving parts — loaders, chunkers, embedding models, retrievers — I finally hit the part where everything needs to talk to everything else without falling apart. This is where the real RAG architecture comes to life.

I’ve iterated through a few versions of my RAG stack. Here’s what I landed on for most of my use cases — reliable, fast, and extensible.

The Pipeline Layout

I’ll break it down by stage:

  1. Receive user query
  2. Embed the query
  3. Run retriever (MMR + filters)
  4. Select top chunks based on token budget
  5. Inject context into prompt
  6. Run LLM (streaming if enabled)
  7. Post-process response (e.g., citations, cleanup)
  8. Return the final answer

Here’s a trimmed version of the actual code I use:

def build_prompt(context_chunks, query):
    context_text = "\n\n".join([f"[{i+1}] {chunk}" for i, chunk in enumerate(context_chunks)])
    prompt = f"""You are an expert assistant. Use the context below to answer the question.

Context:
{context_text}

Question:
{query}

Answer:"""
    return prompt

def rag_pipeline(query, retriever, llm, tokenizer, max_tokens=3500):
    embedded_query = embed_query(query)
    retrieved_chunks = retriever.retrieve(embedded_query)

    # Rank and trim context to fit token limits
    token_count = 0
    context = []
    for chunk in retrieved_chunks:
        tokens = len(tokenizer.encode(chunk))
        if token_count + tokens > max_tokens:
            break
        context.append(chunk)
        token_count += tokens

    prompt = build_prompt(context, query)
    response = llm.generate(prompt)

    return postprocess_response(response, context)

Avoiding Hallucinations & Repetition

You might be wondering: what’s the trick to keeping LLMs grounded in the context?

Here’s what worked for me:

  • I always inject chunk numbers and structure into the prompt — so the model “sees” boundaries and sources.
  • I add a sentence in the prompt saying: “If the answer is not found in the context, say ‘I don’t know.’”
  • When the model responds, I match sentences back to retrieved chunks using fuzzy matching. If something seems out-of-place, I flag it.

Is this bulletproof? No. But it reduced hallucinations significantly — especially when I tuned my retriever to return more diverse context snippets.

My Evaluation Loop

I didn’t rely on ROUGE or BLEU. For my domain (enterprise support documents), those metrics were useless.

Instead, I built a feedback loop using this structure:

{
  "query": "...",
  "response": "...",
  "expected_keywords": ["X", "Y"],
  "flags": {
    "hallucination": false,
    "answer_quality": "good" | "incomplete" | "irrelevant",
  }
}

Every week, I’d review a sample batch manually, then update retriever settings or prompt format.

Eventually, I built a small Streamlit app for internal reviewers to click and grade results in real time.


8. Serving the Application

“If you can’t expose it as an API, it’s not a product — it’s a script.”

Once I had a stable RAG pipeline, exposing it via an API was the next step. I wanted something lightweight, async, and easy to test.

FastAPI Endpoint (Streaming + Non-Streaming)

Personally, I prefer FastAPI — clean, modern, and plays well with async LLMs.

Here’s how I serve the core pipeline:

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse, JSONResponse
import asyncio

app = FastAPI()

@app.post("/query")
async def query_rag(request: Request):
    body = await request.json()
    user_query = body.get("query", "")

    async def stream_response():
        async for chunk in rag_stream(user_query):  # generator from your LLM wrapper
            yield chunk

    if body.get("stream", False):
        return StreamingResponse(stream_response(), media_type="text/plain")
    else:
        response = await run_rag(user_query)
        return JSONResponse({"response": response})

This handles both streaming and non-streaming responses. For local LLMs like Llama2/Ollama, I pipe stdout directly into the stream. For OpenAI, I use their async API with stream=True.

Optional UI: Gradio + Real-Time Debugging

Sometimes, I just need to inspect the full query → context → prompt → answer path. Here’s a minimal Gradio setup I used internally:

import gradio as gr

def rag_gradio_interface(query):
    context_chunks, prompt, answer = run_debuggable_rag(query)
    return f"Prompt:\n\n{prompt}\n\nAnswer:\n\n{answer}"

gr.Interface(fn=rag_gradio_interface, inputs="text", outputs="text").launch()

Bonus: I added a debug toggle to expose token counts, selected chunk IDs, and latency breakdowns.

Auth, Logging, and Error Handling

This might surprise you: my first RAG app had no logging. Debugging a failed query felt like blindfolded surgery.

Now I log:

  • Incoming queries
  • Chunk IDs and token lengths
  • Prompt length and response time
  • Errors with traceback

For auth, I just use an API key in headers — simple enough unless you’re deploying publicly.


9. Deploying Locally or on a VM

“Your code doesn’t matter if it’s just sitting in a Jupyter notebook.”

Once I had the RAG system working end-to-end, the next challenge was shipping it to an environment that could run without babysitting. I’ve deployed it on both my local dev machine and a mid-tier VM (think: 2 vCPU, 8GB RAM). Here’s how I made it all work, without needing Kubernetes or hugging a GPU all day.

Docker Setup: My Optimized Config

I’m not a fan of bloated Docker images. I’ve seen people ship 10GB containers just to run a Flask app. Here’s the one I built for my RAG stack — under 2GB with full CPU support.

FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Use uvicorn if FastAPI is used
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "7860"]

And here’s the trick: I used sentence-transformers with quantized models (float16 or int8 using onnxruntime) to keep memory usage sane.

If I needed GPU (e.g., running Llama.cpp), I built a separate Dockerfile using NVIDIA’s base image and mounted the GPU with:

docker run --gpus all -it my-rag-app

But for CPU-only deployment? No sweat. It runs fine with quantized embeddings + streaming output.

Deployment Script (docker-compose)

For local or single-node VM deployments, I’ve used docker-compose with a tiny shell bootstrapper. Here’s the docker-compose.yml I’ve used:

version: '3.9'

services:
  rag_api:
    build: .
    ports:
      - "7860:7860"
    environment:
      - MODEL_PATH=sentence-transformers/all-MiniLM-L6-v2
    volumes:
      - ./data:/app/data
    restart: unless-stopped

With a simple shell script to rebuild and launch:

#!/bin/bash
docker-compose down
docker-compose build
docker-compose up -d

This setup has run for weeks at a time on a $10/month VM with no issues.

Resource Management: RAM and CPU

You might be wondering: how does this thing perform with just 8GB RAM and no GPU?

Here’s what I found in real-world use:

ComponentPeak RAMNotes
Embedding Model (MiniLM)~900MBUsing onnxruntime and float16
Vector Store (FAISS, 100k docs)~1.2GBIn-memory, mmap helps
FastAPI + app~300MBSteady, even under load

Total peak usage: ~2.5GB, leaving enough room for overhead.

What I offload:

  • Any GPU-dependent generation (e.g., long-form answers) → sent to OpenAI.
  • Heavy batch ingestion → run offline, not as part of the live service.

For most cases, local deployment is totally viable — especially if you’re not doing in-place training or chunking on the fly.


10. Performance and Cost Benchmarks

Let’s talk numbers — because eventually, someone’s going to ask “How fast is it?” or “How much does this cost us per user?”

Latency: Cold vs. Warm Queries

Measured on a clean VM using FastAPI with local embeddings + OpenAI (GPT-3.5):

OperationCold StartWarm
Embed Query120ms40ms
Vector Search (FAISS)10ms5ms
LLM Completion (OpenAI)1200ms800ms

Total Avg Response Time (Warm): ~900ms
Cold Start: Just under 1.5s with caching disabled.

I cached:

  • Embedding vectors for frequent queries.
  • LLM completions if query was an exact match (rare but useful for demos).

Cost (Using OpenAI)

If you’re using gpt-3.5-turbo:

  • Prompt tokens: ~300–600 per request
  • Response tokens: ~200–300
  • Average cost per 1K queries: $0.60–$1.20

On gpt-4, obviously more expensive — closer to $30–$50 per 1K queries depending on your context size.

What I did to keep costs sane:

  • Use smaller LLMs for pre-processing and re-ranking (e.g., Cohere, Claude, or even DistilBERT)
  • Stream only the final generation step with GPT
  • Monitor average prompt length and keep it under control

Vector DB Size vs. Ingestion Time

For FAISS + 100K documents (each ~300 tokens):

  • Embedding time (MiniLM): ~8 seconds per 1K docs
  • Total ingestion time: ~14 minutes on CPU (parallelized in batches)
  • Final index size: ~450MB

If you’re using OpenAI embeddings, expect:

  • ~20 seconds per 1K docs (API roundtrip)
  • ~$0.10 per 1K docs (for text-embedding-ada-002)

Lessons From Scaling Up

This might surprise you: My first scale-up to 100K docs broke because I didn’t dedupe content. Embeddings exploded in size and retrieval quality dropped.

What I fixed:

  • Deduplication: 15%+ of chunks were redundant
  • Token-bounding chunks: Chunks over 512 tokens were skipped by retrievers — I added a pre-check.
  • Parallel ingestion: Switched to ThreadPoolExecutor for embedding batches

11. What I’d Do Differently Next Time

“Every system looks clean—until you try scaling it.”

I’m all for shipping fast and iterating, but looking back, there are a few choices I’d rethink if I were starting over. Some of these were hard lessons, others just friction points I tolerated for too long.

Bottlenecks I Hit (and How I Eventually Fixed Them)

1. Over-relying on sentence-transformers in early iterations

I used all-MiniLM-L6-v2 out of habit. It’s fast, sure. But for technical content like code snippets, it just didn’t capture meaning well enough. Retrieval quality suffered — especially on nuanced queries.

What fixed it:
I switched to BAAI/bge-base-en-v1.5 and saw instant gains in relevance. Later I even experimented with co-embedding strategies where the query and document embeddings used different models — that helped niche domains.

2. Naive chunking wrecked semantic coherence

Early on, I was doing fixed-size chunking with basic newline splits. It worked for logs and raw text dumps, but totally tanked performance on structured docs like PDFs.

What fixed it:
I moved to RecursiveCharacterTextSplitter + metadata tagging (page number, section headers), then re-ranked chunks post-retrieval using cosine similarity + token overlap heuristics. Big win.

3. Index bloat

FAISS is fast, but without smart deduplication, my index size ballooned. At 100K+ chunks, retrieval latency started creeping up.

What fixed it:
I hashed content fingerprints with SHA-256 and dropped near-dupes before embedding. Also compressed float32 to float16 in-memory.

Limitations of Open-Source LLMs for RAG (Being Honest Here)

Look, I wanted to go full OSS. I tested Vicuna, Mistral, LLaMA2-7B, even Orca. But here’s the deal:

LLMWorked ForFailed At
LLaMA2-7BSmall talk, rephrasingDense retrieval QA, multi-hop
MistralShort completionsLongform citations, citations
OpenChat / ZephyrFormatted outputsAccurate multi-turn logic

They’re getting better, but if your app needs consistently accurate, citation-grounded answers, OSS models still lag — especially under constrained hardware.

My middle ground: OSS for preprocessing, reranking, and metadata extraction. Kept GPT-3.5 for final generation (pay-per-call, cache-heavy).

My RAG Wishlist: What I Wish I Had

Here’s what would’ve made this 10x easier:

  • Tokenizer-aware chunker that respects semantic breaks + token limits
    Current tools are either too naive or too slow.
  • FAISS with built-in deduplication and tag-based filtering
    I had to roll my own hybrid search with metadata filtering — would love native support.
  • Cheap fine-tuning pipeline on small LLMs (without needing 4 A100s)
    Sometimes you just need to teach the model your domain. We’re not there yet — even LoRA setups get hairy fast.
  • Built-in eval framework with retrieval metrics + user feedback integration
    I hacked together my own feedback loop, but there’s still no clean, open-source standard for measuring real RAG performance.

If I were rebuilding this today, I’d spend more time upfront designing for observability, deduplication, and eval tooling — not just the retrieval pipeline. You only realize how critical those things are once real users start typing unpredictable stuff.

Leave a Comment