1. Introduction
“In theory, theory and practice are the same. In practice, they’re not.” – Yogi Berra
When I first started playing around with Retrieval-Augmented Generation (RAG), I ran into the same trap I’ve seen others fall into: bolting an LLM onto a vector store and expecting magic. Spoiler: it doesn’t work like that.
This guide walks you through how I actually build RAG pipelines using LlamaIndex. I’m not just repeating docs — this is based on real stuff I’ve implemented, tweaked, broken, and fixed.
I’ve tested it with multiple data sources, swapped out vector stores, tried different chunking strategies, and pushed it far enough to know where the cracks start to show.
I’m keeping this code-first, no fluff. If you’ve built NLP pipelines before and you’re ready to take RAG seriously, you’re in the right place.
2. Project Overview: What We’re Building
Let me show you what we’re building, then I’ll show you exactly how to build it.
I’m going to walk you through a complete RAG pipeline using LlamaIndex, wired up with a production-ready vector store (like MyScale or FAISS), backed by an LLM (you can use OpenAI or a local model). You’ll go from raw documents to real-time Q&A.
Here’s the flow:
[Document Sources]
↓
[Loader]
↓
[LlamaIndex Document Abstraction]
↓
[Embedding & Vector Store (FAISS/MyScale)]
↓
[Retriever]
↓
[Query Engine → LLM]
This isn’t a toy pipeline — I’m building it the same way I would for something customer-facing
Tech stack I’m using:
- LLM: You can use OpenAI or plug in a local model. I’ve done both — I’ll show you how to swap.
- Vector Store: I’ve used both FAISS for local testing and MyScale for production. I’ll show code for both.
- LlamaIndex: This is the core orchestrator — I’m using it for loading, chunking, embedding, and querying.
- LangChain (Optional): Only if you want more control over prompt chains or external APIs. I personally keep it minimal unless necessary.
3. Setting Up the Environment
“Before anything else, preparation is the key to success.” – Alexander Graham Bell
I’ve learned the hard way that half the battle in building production RAG pipelines is managing the environment. You don’t want to debug dependency hell when you’re trying to fine-tune chunking strategies.
Here’s a clean starting point I use in most of my LlamaIndex projects:
requirements.txt
llama-index==0.10.14
openai
faiss-cpu
tqdm
python-dotenv
unstructured
PyMuPDF
You can install everything in one go:
pip install -r requirements.txt
Pro tip: If you’re using MyScale as the vector store, you’ll need the clickhouse-connect
package too. I usually pin it to a stable version:
clickhouse-connect==0.6.5
Project Structure
I personally like to keep things modular and clean. Here’s a project layout that’s worked well for me:
graphqlCopyEdit
rag-llamaindex/
├── data/ # Raw documents (PDFs, markdown, etc.)
├── scripts/
│ ├── ingest.py # Load and preprocess documents
│ ├── build_index.py # Create embeddings & index
│ ├── query.py # Query engine setup
├── .env # API keys and config
├── config.yaml # Optional: chunking params, vector db config
├── requirements.txt
Trust me: this kind of structure pays off when you start testing different data sources or want to modularize query logic.
4. Ingesting and Preprocessing Data
This might surprise you: document loading is where most people screw things up in a RAG pipeline. Either they over-process and lose context, or under-process and clutter the embeddings with junk.
I’ve tried a bunch of approaches, but here’s what I’ve found works best when using LlamaIndex loaders.
Loading Documents
Let’s say you’re dropping PDFs and Markdown files into a folder called ./data
. Here’s how I load them:
from llama_index.readers import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_dir="./data").load_data()
This automatically pulls in PDFs, text, and markdown — but don’t stop there. Raw files usually contain headers, footers, timestamps, or other junk that’ll pollute your embeddings.
Custom Preprocessing
Here’s how I typically clean things up:
def clean_text(text):
lines = text.splitlines()
cleaned = [
line.strip() for line in lines
if line.strip() and not line.lower().startswith("page ") # remove page numbers
]
return "\n".join(cleaned)
Now apply it to each document:
for doc in documents:
doc.text = clean_text(doc.text)
Personally, I also like to normalize unicode characters and fix common OCR artifacts if I’m working with scanned PDFs. Don’t skip this — garbage in, garbage out.
You might be wondering: what if I need more structured loaders (e.g., HTML parsing with BeautifulSoup or JSON mapping)? I’ve had to write custom LlamaIndex loaders for that too — I’ll cover that in a later section when we deal with hybrid sources.
5. Chunking: Why and How to Do It Right
“The difference between something good and something great is attention to detail.”
This might sound trivial, but in my experience, chunking is where most RAG pipelines quietly fail. You can load documents, vectorize them, and still get garbage answers — because the LLM can’t recover context that got sliced in the wrong place.
I’ve tested everything from naive newline splits to semantically aware chunkers. Here’s what’s worked best for me.
Naive Splitting Breaks Context
You might be tempted to do this:
chunks = document.text.split("\n\n") # Don't do this
But in real-world documents — especially longform PDFs or scraped HTML — that almost always creates incomplete thoughts. I’ve had embeddings from half-sentences and section headers. It’s messy and retrieval suffers.
Sentence-Based Chunking
For most general-purpose use cases, SentenceSplitter does the job well:
from llama_index.text_splitter import SentenceSplitter
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_text(document.text)
This ensures your context windows aren’t cutting mid-sentence and you’ve got smooth overlap for context bleeding between chunks.
Personally, I’ve found 512 tokens with ~50 token overlap to be a good sweet spot for GPT-4 and OpenAI’s embedding models.
Semantic Splitter (If You Need Precision)
When I’m working on sensitive domains like legal or medical, I sometimes use SemanticSplitter to keep conceptually coherent chunks. It costs more (since it uses embeddings), but the quality boost is real.
You can plug it into LlamaIndex by overriding the default text splitter during document ingestion.
6. Embedding and Indexing the Documents
This is where the rubber meets the road: transforming your cleaned, well-chunked documents into vectors your LLM can retrieve against.
Pick an Embedding Model
I’ve used several — OpenAI for production-grade performance, HuggingFace for local setups, and even fine-tuned BGE variants when I needed domain-specific accuracy.
Here’s how I set up OpenAI:
from llama_index.embeddings import OpenAIEmbedding
embed_model = OpenAIEmbedding()
If you’re using a local model:
from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")
I usually set a caching layer on top of the embedding call (like joblib or SQLite) to avoid recomputation — especially if I’m iterating over chunking logic.
Vector Store: MyScale Example
I’ve used both FAISS (for fast local dev) and MyScale (for production). Here’s how I wired up MyScale:
from llama_index.vector_stores import MyScaleVectorStore
from llama_index import VectorStoreIndex
connection_args = {
"host": "your-host",
"port": 8443,
"username": "user",
"password": "pass",
"database": "default"
}
vector_store = MyScaleVectorStore(connection_args=connection_args)
index = VectorStoreIndex.from_documents(
documents,
vector_store=vector_store,
embed_model=embed_model
)
This setup gives you a persistent, scalable vector index you can build once and query many times. I’ve had good performance with millions of vectors using this.
If you’re using FAISS for testing:
from llama_index.vector_stores import FaissVectorStore
vector_store = FaissVectorStore()
Bonus Tip: Index Persistence
If I’m running iterative experiments, I always persist my index to disk. With LlamaIndex, it’s a one-liner:
index.storage_context.persist("./storage")
Saves you from rerunning embedding every time — especially handy when you’re fine-tuning retriever configs or prompt formats.
7. Query Pipeline: Retrieval + LLM
“Data is not information, information is not knowledge.” — Clifford Stoll
Here’s the deal: embedding your documents is only half the battle. The real magic happens when you wire up a retrieval layer that can intelligently surface relevant chunks — not just “similar” ones — and route them into your LLM for response generation.
This part is where I’ve spent the most time tuning. You can’t just throw a retriever and an LLM together and expect coherent answers. You’ve got to tune the retrieval logic, set the right top_k
, and pass the right context to your model.
Let me show you how I typically do it.
Setting Up a Custom Retriever
I start by wrapping my index with a retriever. This lets you customize the search behavior — filters, metadata fields, hybrid search (if your vector store supports it), and more.
Here’s a minimal setup using LlamaIndex’s built-in retriever:
retriever = index.as_retriever(similarity_top_k=5)
I usually experiment with
top_k
values depending on how verbose or sparse my docs are. For technical content, 3–5 is usually enough. For chatty or vague content, you might need 8–10.
If your vector store supports metadata filtering (like MyScale or Qdrant), you can take this up a notch:
retriever = index.as_retriever(
similarity_top_k=5,
filters={"doc_type": "policy", "lang": "en"}
)
Plugging In the LLM
Once the retriever’s in place, it’s time to connect the LLM. I’ve used OpenAI’s gpt-4
in production, but this setup works with local models too — as long as they implement the LlamaIndex LLM interface.
Here’s the cleanest way I’ve found to wire everything together:
from llama_index.llms import OpenAI
from llama_index.query_engine import RetrieverQueryEngine
llm = OpenAI(model="gpt-4") # Swap with local model if needed
query_engine = RetrieverQueryEngine.from_args(
retriever=retriever,
llm=llm
)
I prefer RetrieverQueryEngine
over index.as_query_engine()
because it gives me more flexibility — like plugging in custom prompt templates or modifying the response synthesis behavior.
Run a Sample Query
Now you’re ready to ask questions against your own corpus. Here’s a quick test run:
response = query_engine.query("What is the refund policy on international orders?")
print(response)
I always run a few smoke tests before integrating into an app — this helps you spot retrieval misses or formatting issues early.
Optional: Add a Custom Prompt or Postprocessor
Sometimes, I need to format answers in a specific way (like structured JSON for apps). LlamaIndex lets you inject a custom prompt template or even rewrite the response before returning it.
If you’re interested in this, I can walk you through it in the next section.
8. Advanced Customization (Production-Grade RAG)
“In theory, theory and practice are the same. In practice, they’re not.” — Yogi Berra
Once you’ve got your basic RAG pipeline working, it’s tempting to stop there. But from my experience, the difference between a weekend prototype and a real product boils down to customization. You need to tune it for your data, your use case, and your users.
Let me walk you through some of the things I’ve personally found critical in getting production-grade performance.
Custom Prompt Templates
One thing I tweak early: prompt templates. LlamaIndex lets you override the system prompt and query format to match your domain tone or extract structured answers.
from llama_index.prompts import PromptTemplate
from llama_index.query_engine import RetrieverQueryEngine
custom_prompt = PromptTemplate(
"You are an expert support agent. Answer clearly and include product names when relevant.\n\nContext: {context_str}\n\nQuery: {query_str}\nAnswer:"
)
query_engine = RetrieverQueryEngine.from_args(
retriever=retriever,
llm=llm,
text_qa_template=custom_prompt
)
I’ve used this approach for e-commerce use cases where generic prompts weren’t cutting it. It makes a big difference in tone and specificity.
Metadata Filtering
In many of my projects, docs span multiple languages, versions, or teams. Filtering by metadata helps route questions to the right context.
If your vector store supports it (MyScale, Weaviate, Qdrant), you can apply filters at retrieval time:
retriever = index.as_retriever(
similarity_top_k=5,
filters={"lang": "en", "source": "internal_docs"}
)
Don’t sleep on this — it massively cuts down on irrelevant context.
Choosing a Response Synthesis Mode
You’ve got a few modes when it comes to LlamaIndex’s response synthesis. Each one feels a little different in how it shapes the final answer.
"tree"
– builds a hierarchical answer (good for summaries)"refine"
– iteratively updates the response (my go-to for long-form)"compact"
– merges context into one pass (fast, good for chat)
Here’s how I configure it:
query_engine = RetrieverQueryEngine.from_args(
retriever=retriever,
llm=llm,
response_mode="refine"
)
I usually experiment between refine
and compact
. tree
is slower, but occasionally useful for deep-dive reports.
Add Logging & Callbacks
This part’s a lifesaver. I always hook into LlamaIndex’s callback system to see what chunks are being retrieved and why.
from llama_index.callbacks import CallbackManager, LlamaDebugHandler
debug_handler = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([debug_handler])
query_engine = RetrieverQueryEngine.from_args(
retriever=retriever,
llm=llm,
callback_manager=callback_manager
)
This helps debug weird outputs fast — especially when the LLM starts hallucinating off-topic content.
9. Evaluation and Debugging
Building is the easy part. Evaluating? That’s where most folks hit a wall.
Here’s how I’ve approached RAG eval in production — without boiling the ocean.
Log Retrievals Aggressively
You need to log both the retrieved chunks and the final answer. I usually log:
- Query string
- Retrieved document IDs + source
- Final answer
- Timestamps
This helps you retroactively debug failures and tune retrieval quality.
Use LlamaIndex Debug Tools
LlamaIndex offers a neat tool called QueryPipelineTool
— you can use it to visualize and trace the entire flow from query to generation.
from llama_index.tools.query_pipeline_tool import QueryPipelineTool
pipeline_tool = QueryPipelineTool.from_defaults(query_engine=query_engine)
response = pipeline_tool.query("Where is the API rate limit documented?")
For interactive debugging, this saves a ton of time. Especially if you’re integrating into a UI or feedback loop.
Add Unit Tests and Grounding Checks
This might surprise you, but yes — I’ve written unit tests for my RAG pipelines.
At minimum, I recommend:
- Grounding checks: is the answer actually in the source documents?
- Consistency checks: same question, same answer?
- Latency tests: make sure response time doesn’t spike with large docs.
If you want to go further, libraries like Ragas or custom eval scripts can help automate this.
10. Deploying the Pipeline
“An idea is worthless until you get it into the hands of users.” — Reid Hoffman (or maybe every engineering manager ever)
I’ll be honest — I’ve seen too many smart pipelines rot in Jupyter notebooks. Getting your RAG system into production is where it really starts earning its keep.
Here’s how I’ve deployed these setups in the real world.
Serve It with FastAPI
I usually reach for FastAPI — it’s fast, async, and plays well with modern backends. Here’s how I expose my LlamaIndex-powered RAG as a real-time API:
from fastapi import FastAPI, Request
from pydantic import BaseModel
from llama_index import VectorStoreIndex
from llama_index.llms import OpenAI
app = FastAPI()
# Load your index (from disk or cache)
index = VectorStoreIndex.load_from_storage("index_path")
query_engine = index.as_query_engine(llm=OpenAI())
class QueryRequest(BaseModel):
query: str
@app.post("/query")
async def query_llm(request: QueryRequest):
response = query_engine.query(request.query)
return {"response": str(response)}
This API is lightweight enough for internal tools or even Slackbots. In one project, I had this running behind an NGINX reverse proxy for a private doc assistant.
Dockerizing It (Optional but Recommended)
For reproducibility, I always package the whole pipeline into a Docker image.
# Dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY . /app
RUN pip install -r requirements.txt
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
In most cases, I build and push this to a registry, then deploy using Modal, Lambda, or even a simple EC2 instance depending on the latency budget.
Tips for Cloud Deployments
- Modal is great if you don’t want to deal with infra at all — I’ve shipped entire FastAPI+LLM pipelines this way in <10 minutes.
- Lambda can work, but cold starts + LLM latency make it tricky.
- GPU-backed HuggingFace Spaces is a nice hack for quick demos, especially if you’re running local models.
If latency matters and scale isn’t wild, Modal or a reserved EC2 instance has worked best for me.
11. What to Watch Out For (Pitfalls & Lessons Learned)
“Experience is what you get when you didn’t get what you wanted.” — Randy Pausch
This might sound harsh, but: most RAG pipelines fail not because of bad models — but because of bad plumbing. I’ve personally run into every one of these pitfalls, so here’s what to avoid.
Common Issues
- Token limits: If you dump too many chunks into the prompt, the LLM will choke. Trust me, keep context under control.
- Low-precision retrieval: If your top-k results aren’t good, the LLM will hallucinate beautifully wrong answers.
- Latency: Vector DBs like Pinecone or Qdrant can be surprisingly slow under load if not tuned.
I’ve learned to batch embedding jobs, precompute where possible, and always monitor response latency.
Mistakes I’ve Made (So You Don’t Have To)
- Bad chunking: If you split mid-sentence or paragraph, the retriever will lose semantic cohesion.
- Generic prompts: Not customizing prompts early on was a huge miss. It’s where your domain expertise shines.
- Skipping eval: This one hurt. I once deployed a pipeline without grounding checks. Users quickly found out it was “confidently wrong.”
Scaling Lessons
- Batch Pre-Embedding: Always embed offline during ingestion, not at query time.
- Async LLM Calls: If you’re handling multiple queries in a web app, go async. HuggingFace, OpenAI, they all support it.
- Cache Aggressively: I cache both query responses and embeddings. It cuts cost and improves latency dramatically.
Final Thoughts and What’s Next
“Build systems that learn. Then learn from those systems.” — Something I’ve come to live by when working with LLMs.
If you’ve made it this far, you’re not just tinkering — you’re building something real. And honestly, that’s where it gets fun.
Once I had my first RAG pipeline working end-to-end, the obvious question hit me: what now? Here’s where I usually go from there.
Keep It Current: Follow the Changelogs
LlamaIndex evolves fast. I’ve seen entire APIs change overnight — sometimes for good reasons, sometimes not so much. If you want your system to stay alive and relevant, bookmark these:
Personally, I subscribe to their GitHub releases feed. Saved me more than once.
Deploying at Scale? Look Into LLMOps
If you’re thinking about uptime, monitoring, CI/CD for prompt templates, or A/B testing different retrievers, welcome to the world of LLMOps.
Some of the tools I’ve either used or am actively exploring:
- Weights & Biases — great for logging LLM behavior, even in prod.
- Helicone — observability layer for LLM APIs.
- Traceloop — for inspecting pipeline steps in RAG setups.
- PromptLayer — versioning and analytics for prompt calls.
LangChain vs. LlamaIndex: A Quick Word
You might be wondering: Should I have used LangChain instead?
I’ve used both. Here’s my take:
Feature | LlamaIndex | LangChain |
---|---|---|
Ease of Use | ✅ Simpler | 🧩 Modular but verbose |
Abstractions | Clean | Sometimes too many layers |
Retrieval Customization | Top-notch | Decent |
Community Recipes | Fewer but growing | Huge ecosystem |
For fast prototyping with deep retrieval logic, I stick with LlamaIndex. For building orchestration-heavy workflows (think agents, tools), LangChain has the edge. Pick what fits the shape of your problem.
What You Can Add Next
Once your RAG system is live, don’t let it go stale. Here are some extensions I’ve either built or seen work really well in production:
- Chat UI (with session memory) — great for internal tools and support assistants.
- Feedback loops — let users rate or flag answers. You’ll build a goldmine of eval data.
- Reranking layers — sometimes, a second-pass LLM rerank of the top-k chunks boosts performance a lot more than tuning your retriever.
- Active learning — retrain or fine-tune your embedder using real user queries and failures.
- Guardrails — use tools like GuardrailsAI or custom logic to enforce response formats or policies.
Wrapping It Up
I’ll leave you with this:
Building RAG pipelines isn’t about connecting tools — it’s about understanding what the user really needs, and designing a system that can find and express that information — clearly, quickly, and reliably.
If that’s what you’re aiming for, you’re already ahead of 90% of the people playing with LLMs right now.

I’m a Data Scientist.