Build an Advanced RAG Application Using MyScaleDB and LlamaIndex

1. Introduction

“You don’t really understand a tool until you’ve tried to scale it.”

That’s been true for me every time I’ve built a Retrieval-Augmented Generation (RAG) system in production.

I’ve worked with a bunch of vector databases, embedded thousands of documents, and dealt with all the painful bottlenecks that show up once your prototype needs to actually work—for real users, on real data, at speed.

In this guide, I’m walking you through how I personally built an advanced RAG pipeline using LlamaIndex and MyScaleDB.

If you’re already using OpenAI or other LLMs and you’re at the point where you want real-time retrieval and production-level indexing, this guide is for you.

By the end, you’ll have:

  • A scalable, production-ready RAG stack
  • Clean code that plugs right into your own projects
  • Tips that I’ve picked up while debugging and scaling in the wild

I’ll skip the 101-level theory—if you’re here, I’m guessing you don’t need to hear what an embedding is. Let’s build.


2. Project Architecture Overview (With Visual + Code Map)

Here’s the structure I went with. It’s lean, fast, and built to scale without falling apart the moment you hit 10,000 documents.

Architecture Flow

[Ingestion]
   ↓
[Chunking + Embedding]
   ↓
[Indexing to MyScaleDB]
   ↓
[Retrieval via LlamaIndex Query Engine]
   ↓
[LLM Response Generation + Optional Rerankers]
   ↓
[Output + Eval + Tracing (Optional)]

Why This Setup?
I needed something that could:

  • Handle large-scale ingestion
  • Support fast semantic queries
  • Be modular enough to plug in rerankers or eval without breaking the flow

Code Roadmap
Here’s how this will unfold:

🔹 Setup & ingestion: You’ll see how I loaded unstructured content and handled chunking—because good chunking makes or breaks retrieval quality.
🔹 Embedding & MyScale index creation: I’ll show the full code to vectorize your content and push it into MyScaleDB.
🔹 Query routing & response generation: You’ll wire up a real-time query pipeline using LlamaIndex’s query engines.
🔹 Evaluation/debugging hooks: I’ll show you how I added tracing to inspect retrieval, cost, and answer quality.
🔹 Optimization for latency/cost: Some tweaks saved me thousands of tokens per day—I’ll share those too.

This isn’t just about making it work. It’s about making it sharp—fast, scalable, and robust under pressure.


3. Prerequisites and Stack Setup

“The best setups are the ones you only need to do once—because you got them right the first time.”

This section is exactly that. Here’s the setup I personally use when I’m working with MyScale and LlamaIndex. It’s tuned, lean, and production-minded. I’m skipping the basics like conda environments and code editors—if you’re reading this, you already have your dev setup dialed in.

Python Version & Dependencies

I used Python 3.10 for everything below. If you’re mixing packages or running multiple RAG projects in parallel, I strongly suggest isolating this in a fresh venv. Some LlamaIndex submodules have fast-evolving dependencies.

Here’s the pip install I always start with:

pip install llama-index openai myscale clickhouse-connect python-dotenv

You’ll also need the MyScale-specific vector store from LlamaIndex:

pip install llama-index-vector-stores-myscale

Note: If you skip that last one, LlamaIndex will throw a vague import error when you try to connect to MyScale—this one tripped me up early on.

API Keys and Secrets

I don’t hardcode secrets—ever. Here’s how I set things up locally using .env:

OPENAI_API_KEY=your-key
MYSCALE_HOST=https://your-cluster.us-east-1.aws.myscale.com
MYSCALE_USERNAME=your-user
MYSCALE_PASSWORD=your-password
MYSCALE_DATABASE=rag_db

And in code:

from dotenv import load_dotenv
import os

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

You’ll use those same values later when wiring up the MyScale connection. Trust me—keeping it consistent saves a lot of back-and-forth debugging.

MyScaleDB Setup: What Actually Matters

You might be wondering: “Do I really need to touch the schema if LlamaIndex handles it?”

Yes, you do—at least once.

Here’s a stripped-down schema I used for storing vectors:

CREATE TABLE IF NOT EXISTS rag_chunks (
    id UUID,
    content String,
    metadata JSON,
    embedding Array(Float32),
    PRIMARY KEY id
)
ENGINE = MergeTree()
ORDER BY id

What matters here:

  • embedding should be stored as Array(Float32)—MyScale expects this format for cosine similarity.
  • metadata lets you do filtering later (by source, type, timestamp, etc.)
  • Use UUID for IDs—it saves you from painful key collisions.

Also: set your region carefully. I’m based in the US, so I stick to us-east-1. Latency becomes noticeable if you’re pushing or querying across regions, especially with embedding-heavy pipelines.

A Few Pro Tips That Saved Me

  • Connection pooling: MyScale uses ClickHouse under the hood—create a persistent connection using clickhouse-connect instead of reconnecting for every query.
  • Indexing latency: If you’re uploading tens of thousands of chunks, batch insert your data (I’ll show you how in the next section). Single inserts = pain.
  • Permissions: If you’re setting this up on a shared workspace or team account, make sure the user has INSERT, SELECT, and CREATE TABLE privileges.

That’s your foundation. Everything else builds on this—don’t skip it.


4. Data Ingestion & Chunking Logic (Real-world Examples)

“RAG is only as smart as the chunks you feed it.”
I learned this the hard way when I fed my LLM 6,000-token markdown dumps and wondered why the responses were incoherent.

Let me walk you through how I ingest and chunk unstructured data now—this includes PDFs, Markdown, HTML, and scraped web pages.

Ingesting Docs — the Right Way

If your docs live locally, SimpleDirectoryReader does the job. For more complex formats (like HTML or PDFs), I prefer using custom readers—especially if you want to preserve metadata.

Here’s what I used recently to load a mix of files:

from llama_index.readers import SimpleDirectoryReader, BeautifulSoupWebReader
from pathlib import Path

docs_dir = Path("./data")

# Basic ingestion from local folder
documents = SimpleDirectoryReader(docs_dir).load_data()

# In case you’re scraping web content
web_docs = BeautifulSoupWebReader().load_data(urls=[
    "https://example.com/blog-post",
])

Advanced Chunking — Where Most People Go Wrong

This might surprise you: The chunking defaults in LlamaIndex are not optimized for retrieval.

I used to just pass documents directly into the service context, thinking it’d just work. It didn’t.

Instead, here’s how I chunk now—token-aware, recursive, and metadata-preserving:

from llama_index.text_splitter import TokenTextSplitter
from llama_index.schema import Document

# Custom chunker — respects token count and overlaps
splitter = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=64
)

# Apply chunking with metadata carried forward
chunked_docs = []
for doc in documents:
    chunks = splitter.split_text(doc.text)
    for chunk in chunks:
        chunked_docs.append(Document(
            text=chunk,
            metadata={
                "source": doc.metadata.get("file_path", "unknown"),
                "doc_type": "markdown",
                "tags": ["ingested", "test"]
            }
        ))

The chunk_overlap is critical. Without it, your context windows get hard breaks mid-thought, and that wrecks retrieval quality.

Bad Chunking = Bad Retrieval (Proof From the Field)

Let me show you what I mean.

Here’s a test I ran using bad chunking (no overlap, 1,024-token chunks):

# Retrieval scores tanked with semantic drift
Result: "The document discusses tax policy..." (it didn’t)

Now with overlap + tighter chunks (512 + 64):

# Same query, same dataset
Result: "The article explains how fiscal policy in 2008 impacted..." 

That difference? All chunking. No model change. No prompt change. Just better context windows.

Pro Tip: Add Metadata Now or Regret It Later

Adding metadata at this stage saves so much debugging later. Personally, I always tag:

  • source (where the doc came from)
  • type (md/pdf/html)
  • ingestion_time (timestamp)
  • tags (optional, like topics or content type)

This becomes useful when you start doing filtered search, tracing queries, or debugging weird results that trace back to one bad doc.


5. Embedding & Indexing in MyScaleDB

“Garbage in, garbage out.”
I used to think embedding was the easy part. But once I scaled past a few thousand chunks, everything from latency to search quality started to fall apart—because I hadn’t structured things properly in MyScale. Here’s how I fixed it.

Step 1: Choose Your Embedding Model (OpenAI or Custom)

Personally, I use OpenAI’s text-embedding-3-small when I want speed and good-enough quality. But for domains like legal or healthcare, I’ve swapped it out for a custom embedding model via HuggingFace + Langchain wrappers.

Here’s the standard setup using OpenAI:

from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    dimensions=1536  # Optional but good to be explicit
)

If you’re using a local model with InstructorEmbedding or HFTransformersEmbedding, just drop it in here—it works out of the box with LlamaIndex.

Step 2: Normalize and Prepare Data

Before embedding, I always normalize the text: remove weird characters, standardize punctuation, and strip boilerplate. You’d be surprised how much cleaner your vector space becomes.

This is my typical transform function:

def preprocess(text):
    return (
        text.replace("\n", " ")
            .strip()
            .replace("  ", " ")
    )

Then apply it to your chunks before embedding.

Step 3: Set Up MyScale Vector Store

This is where things get serious.

You’ll use MyScaleVectorStore from llama-index-vector-stores-myscale. If you haven’t installed it yet:

pip install llama-index-vector-stores-myscale

Now let’s configure it:

from llama_index.vector_stores.myscale import MyScaleVectorStore

vector_store = MyScaleVectorStore(
    host="your-myscale-endpoint.region.aws.myscale.com",
    port=443,
    user="your_user",
    password="your_password",
    database="rag_db",
    table="docs_index",
    index_type="IVFFLAT",  # Optional: use HNSW for large scale
    metric="cosine",
    overwrite_table=True
)

Real-World Schema Design (Trust Me, This Matters)

If you’re indexing tens of thousands of chunks, you must get your schema right from day one. Here’s what’s worked well for me:

CREATE TABLE IF NOT EXISTS docs_index (
    id UUID,
    text TEXT,
    metadata JSON,
    embedding VECTOR(F32, 1536),
    PRIMARY KEY (id)
) ENGINE = MergeTree()
ORDER BY id;

Why this works:

  • UUID lets you reindex selectively.
  • metadata stays flexible—you can filter on source, tags, etc.
  • embedding column must match the dimension of your model (1536 for text-embedding-3-small).
  • Using MergeTree with ORDER BY id optimizes point lookups and batch inserts.

Step 4: Upsert Your Vectors

Here’s how I batch everything into MyScale—something I fine-tuned after hitting some weird throttling issues:

from llama_index import VectorStoreIndex
from llama_index.storage.storage_context import StorageContext

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents=chunked_docs,
    storage_context=storage_context,
    embed_model=embed_model,
    show_progress=True  # Optional, but great for visibility
)

That’s it. If you’re doing large-scale ingestion, consider batching in groups of 500–1000 chunks with a sleep in between to avoid timeouts.

Pro Tip: Metadata Is Your Secret Weapon for Filtering & Re-ranking

Here’s how I structure metadata inside each Document before indexing:

Document(
    text=chunk,
    metadata={
        "source": "https://example.com/article.pdf",
        "tags": ["finance", "policy"],
        "timestamp": "2024-04-15",
        "doc_id": "report_0423"
    }
)

Later, this lets me:

  • Run filtered vector searches (WHERE metadata->'tags' HAS 'finance')
  • Re-rank based on recency (ORDER BY timestamp DESC)
  • Debug outliers by tracking doc_id

This might seem like overkill now, but trust me—it pays off when you’re debugging a rogue LLM output at 2AM.


6. Query Pipeline Using LlamaIndex Query Engines

“A good answer starts with a great question—but a smart pipeline decides what happens in between.”

I’ve spent an unhealthy amount of time tweaking query engines in production RAG systems. And if there’s one thing I’ve learned, it’s this: retrieval isn’t just about matching vectors—it’s about shaping context. Let’s break it down the way I’ve built this in real-world deployments.

Step 1: Set Up the Retriever Engine

You’ll usually want to use RetrieverQueryEngine—it gives you full control over the retrieval step before passing chunks to the LLM.

Here’s the barebones setup:

from llama_index.query_engine import RetrieverQueryEngine
from llama_index.llms import OpenAI
from llama_index.indices.vector_store import VectorStoreIndex

# Assume you’ve already built your index
query_engine = RetrieverQueryEngine.from_args(
    retriever=index.as_retriever(similarity_top_k=5),
    llm=OpenAI(model="gpt-4")
)

This will give you a clean end-to-end retrieval + generation pipeline. But in practice, I almost never stop there.

Real Tip: Custom Prompts Make or Break Your Outputs

Depending on whether I want a summary, an answer, or a follow-up question, I swap out prompts dynamically.

You can do this with the ResponseSynthesizer, like this:

from llama_index.response_synthesizers import ResponseSynthesizer
from llama_index.prompts import PromptTemplate

qa_prompt = PromptTemplate(
    "You are an expert answering user queries.\n"
    "Given the context below, answer the question.\n\n"
    "{context_str}\n\nQuestion: {query_str}\nAnswer:"
)

synthesizer = ResponseSynthesizer.from_args(
    llm=OpenAI(model="gpt-4"),
    text_qa_template=qa_prompt
)

query_engine = RetrieverQueryEngine.from_args(
    retriever=index.as_retriever(similarity_top_k=8),
    response_synthesizer=synthesizer
)

When building internal tools, I’ve wired this to a toggle in the frontend UI: summarization, QA, brainstorm—each with its own prompt set.

Bonus: Add Rerankers (It Makes a Bigger Difference Than You Think)

You might be wondering: What if the top chunks aren’t the most relevant?

That’s where rerankers saved me in a few client projects. You can plug in Cohere or use LlamaIndex’s in-house rerankers. Here’s how to use CohereRerank:

from llama_index.retrievers import VectorIndexRetriever
from llama_index.retrievers.reranker_retriever import CohereRerank

base_retriever = VectorIndexRetriever(index=index, similarity_top_k=20)

reranker = CohereRerank(api_key="your-cohere-api-key", top_n=5)

reranking_retriever = reranker.as_retriever(base_retriever)

query_engine = RetrieverQueryEngine.from_args(
    retriever=reranking_retriever,
    llm=OpenAI(model="gpt-4")
)

If you’re working with sensitive or long documents, rerankers dramatically improve factual accuracy—less fluff, more focus.

Hybrid Retrieval (Keyword + Vector)

Sometimes vector search alone just isn’t enough—especially if you’re dealing with exact terms, names, or IDs. In those cases, I combine keyword + vector using LlamaIndex’s HybridRetriever.

from llama_index.retrievers import HybridRetriever
from llama_index.vector_stores.types import MetadataFilters, ExactMatchFilter

hybrid_retriever = HybridRetriever(
    vector_retriever=index.as_retriever(similarity_top_k=10),
    sparse_retriever=index.as_retriever(similarity_top_k=5, mode="sparse")
)

query_engine = RetrieverQueryEngine.from_args(
    retriever=hybrid_retriever,
    llm=OpenAI(model="gpt-4")
)

This saved me when working with legal docs where names had to be matched exactly—but I still wanted semantic recall for surrounding context.

Putting It All Together

Here’s the final pattern I end up using in most projects:

  • RetrieverQueryEngine with hybrid + reranking
  • ResponseSynthesizer with dynamic prompt routing
  • OpenAI GPT-4 or Anthropic Claude depending on token budget
  • Chunk metadata passed all the way for custom scoring in post-processing

7. Tracing, Debugging, and Evaluation

“If you can’t see it, you can’t fix it.”

I learned that the hard way—back when a single query started hitting multiple retrievers for no reason, doubling the latency and tripling costs. The issue? A misconfigured reranker. I wouldn’t have caught it without proper tracing.

Observability Stack

Here’s what I use (depending on the project):

  • CallbackManager (LlamaIndex-native, fast to set up)
  • LangSmith (for full tracing across LLM chains)
  • OpenTelemetry (when I need to integrate with existing tracing infra)

In most cases, I start with CallbackManager—it’s built-in and works great for lightweight introspection.

from llama_index.callbacks import CallbackManager, LlamaDebugHandler

debug_handler = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([debug_handler])

query_engine = RetrieverQueryEngine.from_args(
    retriever=index.as_retriever(),
    callback_manager=callback_manager,
    llm=OpenAI()
)

Once that’s in place, you get full visibility into:

  • Which chunks were retrieved
  • Prompt templates used
  • LLM response times
  • Embedding/token usage

Real Tip: Traces Will Expose Silly Mistakes

There was this one case where every query hit two different filters… one of which had no matching chunks. Basically, the reranker was wasting cycles on dead data. A simple trace showed the filter conditions didn’t line up with how metadata was stored.

Fixing that saved ~30% latency per call.

Build Custom Evaluators

You might be wondering: How do I know if my retrieval is even good?

Here’s how I evaluate real-world RAG:

1. Precision@k

# Example: basic P@k evaluator (manual logic)
def precision_at_k(retrieved_docs, relevant_docs, k=5):
    hits = sum([1 for doc in retrieved_docs[:k] if doc in relevant_docs])
    return hits / k

I run this offline on labeled queries using golden data, usually after modifying chunking logic.

2. Query Match Quality

Here, I compare how well the generated answer overlaps with gold responses. Sometimes I use ROUGE/Token overlap. More often, I just use gpt-4 as a judge, like this:

from llama_index.evaluation import FaithfulnessEvaluator

evaluator = FaithfulnessEvaluator(llm=OpenAI(model="gpt-4"))
score = evaluator.evaluate_response(query, retrieved_context, generated_answer)

3. Latency + Cost Breakdown

Log all of this—OpenAI tokens, time-to-first-byte, reranker delays. If you’re not logging it, you’re guessing.


8. Optimization for Latency and Cost

“Latency is your silent killer; cost is the loud one.”

Honestly, once the pipeline works, this is where I spend most of my time. Tiny changes here compound fast—especially when your app scales.

Reduce Token Usage (Without Killing Accuracy)

1. Prompt Trimming

If your prompts include boilerplate context or overly long instructions—cut it. I use string length + token budget checks before submitting to the LLM.

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
def count_tokens(text):
    return len(tokenizer.encode(text))

2. Chunk Size Tuning

This might surprise you: Too small = noisy; too big = bloated context. I’ve had the best luck with 200–400 token chunks, but it depends on your domain.

3. Embedding Frequency

Don’t embed every time. Cache and reuse.

Real Tip: Use QueryEngineTool to Route Smartly

Not every query needs a vector search. Some are static FAQs. I use QueryEngineTool to route simple questions to pre-written answers and keep RAG for the rest.

from llama_index.tools.query_engine import QueryEngineTool
from llama_index.agent import OpenAIAgent

simple_query_engine = RetrieverQueryEngine.from_args(...)
rag_query_engine = RetrieverQueryEngine.from_args(...)

agent = OpenAIAgent.from_tools([
    QueryEngineTool.from_defaults(query_engine=simple_query_engine, description="Static FAQ"),
    QueryEngineTool.from_defaults(query_engine=rag_query_engine, description="RAG for custom answers")
])

This cut unnecessary retrievals by ~40% for one of my internal tools.

MyScale-Specific Optimizations

If you’re using MyScale (and you should—it’s damn fast), make sure you’re doing this:

1. Cluster Sizing

Start small, scale on demand. I’ve seen idle clusters eating up $$$.

2. Vector Caching

MyScale has caching for repeated queries. Enable it if you haven’t.

3. Index Maintenance

Don’t forget to run:

OPTIMIZE TABLE your_table FINAL;
VACUUM TABLE your_table;

You’d be shocked how much bloat builds up over time—especially with constant upserts during ingestion tests.


9. Optional: Build a Simple Frontend or API

“A model that can’t be accessed is a model that doesn’t exist.”

Once I had the RAG pipeline humming, the next logical step was exposing it—either to internal tools or external users. Personally, I’ve used FastAPI for endpoints and Streamlit for quick demos. Here’s what worked best for me.

FastAPI: Clean, Lightweight RAG Endpoint

Let’s keep this minimal but production-friendly. You send a query, it runs through LlamaIndex, and returns the final response.

# fastapi_rag_app.py
from fastapi import FastAPI, Request
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.llms import OpenAI
import uvicorn

# Assume index is already built + saved
query_engine = RetrieverQueryEngine.from_args(
    retriever=index.as_retriever(),
    llm=OpenAI()
)

app = FastAPI()

@app.post("/query")
async def query_rag(request: Request):
    body = await request.json()
    user_query = body.get("query", "")
    response = query_engine.query(user_query)
    return {"answer": str(response)}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8080)

I’ve wrapped this with an API gateway and deployed on a small container to handle spikes—nothing fancy, but effective.

Real Tip: Prompt Template Control Matters

One of the sneaky problems I ran into was different query types needing slightly different phrasing. For example, summarization vs QA. So I started dynamically injecting prompt templates based on query type—something like:

from llama_index.prompts import PromptTemplate

qa_prompt = PromptTemplate("Answer this question: {query}\n\nContext:\n{context_str}")
summary_prompt = PromptTemplate("Summarize this:\n\n{context_str}")

# Pick template dynamically
if "summary" in user_query.lower():
    engine.update_prompts(response_prompt=summary_prompt)
else:
    engine.update_prompts(response_prompt=qa_prompt)

This little tweak made the responses feel way more aligned with intent.

Streamlit: Instant UI for Demos

If you’re just showing this off to your team or stakeholders:

# streamlit_app.py
import streamlit as st

st.title("RAG Assistant")

user_input = st.text_input("Ask a question:")

if user_input:
    response = query_engine.query(user_input)
    st.write("### Answer")
    st.write(str(response))

Two files, one Dockerfile, and you’ve got something tangible.


10. Final Thoughts + Real-World Advice

This might sound cliché, but building this end-to-end changed how I approach LLM systems. What starts as a fun experiment gets complex fast—especially when latency, relevance, and cost all start fighting for priority.

Here’s what stood out the most from my build:

What Worked Well

  • LlamaIndex’s modularity was a life-saver. I could swap retrievers, rerankers, and LLMs without rebuilding the whole stack.
  • MyScale’s speed genuinely impressed me. Especially once I got caching and indexing tuned—it handled 1M+ chunks with almost no slowdown.
  • Tool abstraction via QueryEngineTool made multi-query routing a breeze. Underused but powerful.

What Surprised Me

  • MyScale’s latency spikes during index rebuilds or massive upserts. Always keep ingestion and retrieval separate.
  • LlamaIndex’s callback traces revealed more junk queries than I expected. Many came from unclear prompt boundaries.
  • OpenAI token usage balloons when you’re not paying attention to chunk size or context injection logic.

Tips for Scaling to Millions of Docs

  • Compact metadata—don’t store raw content in JSON. Instead, use lightweight identifiers + embed only what’s needed.
  • Chunk intelligently. Use sentence-boundary-aware tokenization. I wrote a custom splitter using spaCy + token budget logic.
  • Compress + batch upserts. MyScale handles big chunks better when you insert in pages, not record-by-record.

Self-Hosted Models? Here’s What to Watch For

I’ve swapped OpenAI for models like Mixtral, Gemma, and LLaMA3 via Ollama and vLLM. Here’s the deal:

  • Local models work—but prompt tuning becomes 10x more critical.
  • Make sure your vector search latency doesn’t get overshadowed by generation lag.
  • Add guardrails, especially with chat-style models. Some default responses are way too verbose for production.

“You don’t need a bigger model. You need better wiring.”

If there’s one thing I learned, it’s that clever prompt design, fast filtering, and tight retrieval logic beats raw LLM horsepower in most real-world cases.

Leave a Comment