1. Intro: What You’re Going to Build
“There’s no such thing as ‘just prompt engineering’ anymore. If your model can’t retrieve the right info, the output’s already doomed.”
In this guide, I’ll walk you through how I built a production-grade RAG pipeline using only open-source tools. I’m talking about tools I’ve used myself—LlamaIndex, FAISS, LangChain, and open models via HuggingFace Transformers.
Personally, I built this for a document-based Q&A system—something that could pull answers from 10K+ PDFs and respond like someone who actually read them. The whole setup runs locally and avoids closed APIs completely.
When I say production-grade, here’s what I mean: fast enough to be used interactively, accurate enough not to embarrass you, and built in a way that you can scale or ship without duct tape.
And no, we’re not doing a theory lesson here. Just code, decisions, gotchas, and lessons learned.
2. Set Up the Environment
You’ll need a handful of tools to make all the moving parts of this RAG system talk to each other. Here’s what I used and why:
llama-index
: Handles document loading, chunking, and retrieval. I’ve used LangChain too, but I keep coming back to LlamaIndex when I want more control over chunking and embeddings.faiss-cpu
: Lightweight, fast, and dead-simple to set up for vector storage. If you’re not running massive-scale queries, FAISS just works.transformers
+sentence-transformers
: To embed documents using solid open-source models likeBAAI/bge-base-en-v1.5
—great quality without needing a GPU.- Optional:
langchain
: Only if you prefer it for chaining logic or interfacing with other components. I used it in a few side experiments but stuck with direct Python scripts for most of this setup.
Installation
Here’s the minimal setup to get started:
pip install llama-index langchain transformers faiss-cpu sentence-transformers
Quick Tip (from experience):
If you’re using quantized models (especially anything using bitsandbytes
or GGUF), watch out for PyTorch version mismatches. I’ve had builds break just because the pip-installed version didn’t play nice with CUDA. For most setups, sticking to torch==2.1.0
has saved me the headache.
Also, if you’re working in a Jupyter environment, I recommend doing the heavy lifting (like embedding or loading large models) outside the notebook in a separate script—you’ll save yourself from random kernel crashes.
3. Step 1: Load and Chunk Your Data
“Garbage in, garbage retrieved.”
This might sound obvious, but how you load and chunk your data completely determines the quality of answers your RAG system will give. I’ve learned this the hard way—especially when working with multi-page PDFs and docs full of bullet points or nested sections.
Personally, I’ve had better results using LlamaIndex for loading and chunking. The DocumentLoaders in LangChain are decent too, but LlamaIndex gave me more flexibility around metadata injection and custom preprocessing logic.
Here’s a simple script I use to load a folder full of PDFs. You can swap in Markdown, .txt
, or even HTML pages without much change.
from llama_index.core import SimpleDirectoryReader
from pathlib import Path
# 👇 Point this to your folder of documents
docs_path = Path("data/") # e.g., data/*.pdf or .md, .txt, etc.
# Load files recursively
documents = SimpleDirectoryReader(docs_path, recursive=True).load_data()
print(f"Loaded {len(documents)} documents")
Once loaded, the next step is chunking—this is where most people get it wrong.
You might be wondering: What chunk size should I use?
Here’s what I’ve found in practice:
- For dense academic PDFs or legal docs:
chunk_size=512
tokens withchunk_overlap=50
works better. - For conversational content (blogs, FAQs, support docs):
chunk_size=256
gives more precise matches with fewer false positives.
Why not 1024? In my experience, longer chunks often blur together unrelated concepts. You end up retrieving almost relevant passages, and that hurts answer quality more than it helps.
Here’s the full chunking logic I use:
from llama_index.core.node_parser import SentenceSplitter
# Tweak this based on your task
chunk_size = 512
chunk_overlap = 50
parser = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
nodes = parser.get_nodes_from_documents(documents)
print(f"Chunked into {len(nodes)} nodes")
Quick tip: Always print and manually inspect a few chunks. I’ve caught broken formatting, weird encodings, and entire tables getting crammed into single nodes—things no retrieval algorithm can fix.
4. Step 2: Embed Your Documents
“The retrieval engine is only as smart as your embeddings.”
Once your docs are chunked, the next critical step is embedding them into a vector space that actually preserves meaning. I’ve played with several models, but the one I keep coming back to is this:
BAAI/bge-base-en-v1.5
This model just works. It’s fast, sentence-level accurate, and has outperformed some of the newer flashy models in actual RAG tasks—at least in my own evaluations.
Here’s how I embed with it using sentence-transformers
. This runs fine on CPU for small datasets but expect it to be slow once you scale.
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
# Initialize model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
# Get texts from nodes
texts = [node.get_content() for node in nodes]
# Generate embeddings
embeddings = model.encode(texts, show_progress_bar=True)
# Build FAISS index
dim = embeddings[0].shape[0]
index = faiss.IndexFlatL2(dim)
index.add(np.array(embeddings))
print(f"Added {len(embeddings)} vectors to FAISS index")
Want it faster? If you’ve got a GPU, you can load the model with device="cuda"
and batch larger chunks. For large-scale indexing, I’ve also tried faiss.IndexIVFFlat
, but unless you’ve got millions of docs, the flat index is simpler and solid.
Why I didn’t use OpenAI or Cohere:
I’ve used OpenAI’s embedding API (text-embedding-ada-002
) before, and it’s great—until latency and token limits kick in. Also, for local or airgapped deployments, it’s a no-go. That’s why I kept it open-source here.
Plus, running locally gives you full control over:
- Model versioning
- Cost (zero)
- Latency (no network calls)
- Privacy
5. Step 3: Set Up Vector Store
“Retrieval is where half the magic happens. Don’t skimp on your vector store.”
I’ll keep it real—I’ve tried a few different vector stores over the past year: FAISS, Chroma, Weaviate, even Qdrant. But when I need something local, minimal, and just works with 100K+ documents, FAISS is the one I reach for.
It’s C++ under the hood, battle-tested by Meta, and doesn’t introduce network or container overhead. That matters a lot when you’re running experiments on a laptop or deploying to a simple cloud box.
Here’s how I typically set it up with persistence logic baked in:
import faiss
import numpy as np
import pickle
from pathlib import Path
# Build FAISS index
dim = 768 # Make sure this matches your embedding size
index = faiss.IndexFlatL2(dim)
# Add embeddings
index.add(np.array(embeddings))
# Optional: Save metadata for mapping later
with open("vectorstore_metadata.pkl", "wb") as f:
pickle.dump(texts, f)
# Save FAISS index to disk
faiss.write_index(index, "vectorstore.index")
Heads-up on distance metrics:
This might surprise you: FAISS defaults to L2 distance, not cosine. That can throw off your similarity results if you’re using sentence embeddings that assume cosine similarity. You can normalize your vectors manually before adding them:
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
Or switch to IndexFlatIP
(inner product) if that aligns better with your use case. I’ve personally run into weird retrieval mismatches before realizing this mismatch, so now it’s something I double-check early on.
6. Step 4: Load an Open-Source LLM
“LLMs are like sports cars—you’ve got to pick the right one for the track you’re on.”
Let’s talk models.
When it comes to open-source LLMs, I’ve cycled through a lot of them: Mistral, Nous-Hermes, TinyLlama, and even some GGUF-quantized variants of LLaMA-2.
For this project, I leaned toward Mistral 7B running via Ollama—mainly because it gave me solid response quality with fast startup and zero drama during setup. No CUDA nightmares, no memory issues. Just pull and go.
Option 1: Using Ollama (fastest for local)
# Pull Mistral locally
ollama pull mistral
Then in Python:
import requests
def query_ollama(prompt, model="mistral"):
response = requests.post(
"http://localhost:11434/api/generate",
json={"model": model, "prompt": prompt, "stream": False},
)
return response.json()["response"]
output = query_ollama("Explain how transformers work in simple terms.")
print(output)
I like Ollama for quick iterations, especially when testing prompt formats locally without wiring up full pipelines.
Option 2: Transformers + Bitsandbytes (for GPUs)
If you’ve got a decent GPU, here’s how I load a quantized model using transformers
+ bitsandbytes
:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16,
low_cpu_mem_usage=True
)
input_text = "Explain how vector databases work."
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Gotcha I hit early on: If you’re using a quantized GGUF model, that won’t work with transformers
. You’ll need to use llama.cpp or tools like Ollama or KoboldCpp.
Also, don’t forget to format your prompt to match the model’s instruction style. Mistral Instruct expects ### User:
/ ### Assistant:
sections, while others might follow ChatML or plain text. If your responses feel “off,” 90% of the time it’s a prompt formatting issue.
7. Step 5: Build the Retrieval-Augmented Pipeline
“RAG is like plumbing—if your pipes don’t connect cleanly, the whole system leaks.”
This is the part where everything clicks. You’ve got chunked docs. You’ve embedded them. You’ve picked a solid model. Now it’s time to wire up retrieval and generation into a single, working pipeline.
I’ve built this both manually and using frameworks like LangChain and LlamaIndex. Personally? When I’m prototyping or debugging, I stick to manual—it gives me full control. But if I need to integrate with a larger system or expose an API, LangChain’s modules save a lot of boilerplate.
Let’s walk through the manual pipeline first. Then I’ll show how to hook it up with LangChain.
Here’s how the pipeline flows:
- User asks a question
- We embed the question using the same model we used for documents
- Retrieve top-K similar chunks from the FAISS index
- (Optional) Rerank if you’re using something like Cohere’s reranker or bge-reranker
- Inject those chunks into a prompt for the LLM
- Generate the final response
Here’s the exact code I’ve used in production:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pickle
import requests
# Load embedding model
embed_model = SentenceTransformer("BAAI/bge-base-en-v1.5")
# Load FAISS index and metadata
index = faiss.read_index("vectorstore.index")
with open("vectorstore_metadata.pkl", "rb") as f:
texts = pickle.load(f)
# Retrieve top_k relevant docs
def retrieve_relevant_chunks(query, top_k=5):
query_embedding = embed_model.encode([query])
D, I = index.search(np.array(query_embedding), top_k)
return [texts[i] for i in I[0]]
# Construct the prompt
def build_prompt(context_chunks, question):
context = "\n\n".join(context_chunks)
prompt = f"""### Context:\n{context}\n\n### Question:\n{question}\n\n### Answer:"""
return prompt
# Send to LLM (Ollama here)
def query_llm(prompt):
response = requests.post(
"http://localhost:11434/api/generate",
json={"model": "mistral", "prompt": prompt, "stream": False}
)
return response.json()["response"]
# 🔁 Full pipeline
def run_rag_pipeline(query, top_k=5):
retrieved_chunks = retrieve_relevant_chunks(query, top_k)
prompt = build_prompt(retrieved_chunks, query)
response = query_llm(prompt)
return response
# 🧪 Test it
query = "What are the best practices for fine-tuning a transformer on domain-specific data?"
response = run_rag_pipeline(query)
print(response)
A few things I’ve learned the hard way:
- Top_k isn’t one-size-fits-all.
I usually start with 5, but for dense academic content, I go up to 10. You don’t always want more — adding irrelevant context just confuses the LLM. - Prompt formatting matters more than you’d think.
Some models like Mistral or Zephyr respond better to labeled sections like### Context:
and### Question:
. GPT-style models prefer role-based formats (system
,user
,assistant
). - Context order matters.
Always put the most relevant chunks first. FAISS doesn’t do this by default—you might want to rerank based on question overlap, cosine sim, or position in original doc. - If you’re doing reranking, try the bge-reranker-large. It’s surprisingly good and fast enough to run locally for top-20 → top-5 filtering.
9. Making It Production-Ready
There’s a huge gap between a working RAG notebook and something you’d actually run in production. I’ve been down that road, and here’s how I’ve approached it when I needed something solid but still flexible.
Containerization: Docker, or Just a Script?
When I’m iterating quickly, I don’t always jump to Docker. A simple .sh
script that installs dependencies and spins up the service is usually faster to test and deploy. But if I’m sharing the setup across environments or running it on a remote GPU box, I wrap it up in a Dockerfile like this:
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "main.py"]
I’ve also had success using uvicorn
and FastAPI
together inside Docker for lightweight deployments.
Speed Hacks: Caching, Pre-fetching, and Batch Inference
One thing I learned early: if you’re embedding queries on the fly every single time, you’re burning cycles unnecessarily. For low-latency RAG setups, I cache embeddings for frequent questions, or I batch embed similar queries.
Here’s a quick example of how I pre-cache:
from sentence_transformers import SentenceTransformer
import pickle
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
frequent_questions = [
"How do I fine-tune LLaMA?",
"What are the limitations of retrieval-augmented generation?"
]
embeddings = model.encode(frequent_questions)
with open("cached_query_embeddings.pkl", "wb") as f:
pickle.dump(embeddings, f)
You can load this on startup and speed things up dramatically.
API and App Layer
I usually expose the RAG system using FastAPI, like this:
from fastapi import FastAPI, Request
import uvicorn
app = FastAPI()
@app.post("/query")
async def query_api(req: Request):
body = await req.json()
query = body.get("query")
response = run_rag_pipeline(query)
return {"response": response}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
If I just need a quick UI for demos or internal users, I go with Streamlit. But for anything meant to scale, I containerize it and deploy using Docker + FastAPI behind Nginx.
CPU vs GPU: What I’ve Actually Used
- CPU-only (FAISS + GGUF): Works well for smaller loads. I’ve deployed Mistral-7B.gguf with
llama-cpp-python
on a CPU-only machine, and it handled 2–3 concurrent requests fine. - GPU (ExLLaMA or vLLM): If latency’s critical and I’m on A100s or 3090s, I spin up models using ExLLaMA or vLLM for better batching. Just know that managing memory fragmentation on consumer GPUs is a thing.
In short: you don’t need a $20K server to run a useful RAG system—but you do need to tune the deployment path based on your actual traffic.
10. Bonus: What I’d Do Differently
Here’s where things get real. I’ve hit roadblocks in every part of this stack—and I’ve learned a few things the hard way.
What broke
- I once used a model that required
torch==2.0.0
, and forgot to pin it. The result? A downstream dependency silently broke the quantized model loader… and it took me hours to trace back. - FAISS indexes can get corrupted during CI/CD if you’re syncing them across systems with different hardware. I now always rebuild locally post-deployment.
What saved me time
- Caching prompt responses during eval runs. When you’re tweaking the retrieval step and re-running hundreds of prompts, saving generations to disk can save hours.
- Embedding reranker candidates offline rather than re-scoring live. Cuts response time from ~4s to ~1.2s in most cases.
What I’d avoid next time
- Avoid over-relying on LangChain’s default chains without inspecting what they actually do. I’ve had situations where prompts were bloated with metadata I didn’t need—and performance tanked.
- Don’t blindly use massive models. For most use cases, a fine-tuned 7B is plenty. Mistral or TinyLLaMA variants can outperform GPT-3.5 when your retrieval is strong.
11. Final Thoughts + GitHub Link
If you’ve followed along, you’ve now got a fully working Retrieval-Augmented Generation (RAG) system—built entirely with open-source tools, from scratch. Not a toy example, but something you can actually deploy, scale, and improve over time.
Here’s the full working code I used in this guide:
👉 [GitHub Repo / Colab Link]
I’ve kept the code modular so you can easily swap out models, chunking strategies, or vector stores based on your specific use case.
Where to Go Next?
Here’s what I’d suggest exploring next—based on where I hit scaling pain points myself:
- Scaling RAG: Try vLLM or TGI for serving larger models efficiently. I saw ~30% latency reduction just by switching to vLLM with batching enabled.
- Streaming Data: If your data changes frequently, look into periodic re-chunking and re-indexing pipelines. I used Apache Airflow + MinIO for one such setup and it worked great.
- Multi-modal RAG: I’m currently experimenting with image-text retrieval (CLIP + BLIP + LLM). If your domain has screenshots, diagrams, or scanned docs, this is worth looking into.
That’s it for now — if this helped, feel free to fork the repo and make it your own. And if you’re building something cool on top of it, I’d love to hear about it.

I’m a Data Scientist.