1. Why OpenAI + MyScale Is a Killer Stack for Recommendations
“Simple systems scale better. Smart systems personalize better. The sweet spot? You want both.”
Over the past few months, I’ve built and tested a handful of recommendation systems using everything from pure Faiss + Flask setups to vector support in PostgreSQL, and even a few heavier ML pipelines using feature stores.
Here’s the thing I’ve learned the hard way: if you want fast, intelligent recommendations without drowning in infra, OpenAI + MyScale is one of the cleanest stacks I’ve worked with.
It’s not just about getting embeddings and stuffing them in a vector DB. This combo actually plays well in production, and more importantly—it doesn’t fight you when you scale up.
Why this stack works (and works well):
- Blazing-fast vector search at scale (MyScale):
MyScale is built on ClickHouse, but it’s not “just ClickHouse.” It has native support for ANN search and can handle real-time workloads without needing extra engineering scaffolding. I’ve personally pushed 10M+ records into it and got sub-100ms retrieval, no exotic setup required. - Powerful text understanding (OpenAI Embeddings/Completions):
I’ve used OpenAI’stext-embedding-3-large
to encode everything from product catalogs to job descriptions, and the quality of semantic matching blows traditional TF-IDF or even BERT out of the water. You feed it raw text, and it just gets the context. - Schema flexibility + low ops overhead:
MyScale lets you toss in structured and unstructured data side-by-side. You can filter by price, category, date, or location—and do vector search on top of that. This kind of hybrid search is not just a “nice-to-have”—it’s become a standard in my pipelines.
When This Stack Might Not Be Your Best Bet
I’ll be real with you—it’s not perfect for every scenario. If you’re doing pure offline batch recommendations (like nightly updates for a billion-item catalog) and latency isn’t an issue, you might be better off with a Spark pipeline and a local ANN index.
Or if you’re working with super-specialized data (like medical records or legal docs), you may need custom domain-trained embeddings. In those edge cases, OpenAI’s general-purpose models may not cut it.
But for 90% of the use cases I’ve worked on—personalized feeds, search ranking, job matching, product suggestions—this stack delivers the right blend of simplicity, power, and speed.
2. System Architecture Overview
“Any fool can make a system work once. The real test is: can you scale it, debug it, and deploy it on a Monday morning without losing your mind?”
Here’s the architecture I personally use when I’m building a recommendation system with OpenAI and MyScale. It’s clean, modular, and—most importantly—it works without needing a DevOps team on standby.
I’ve broken it down into the exact components I use, whether I’m testing on my laptop or scaling up in the cloud.
Core Components
1. Data Ingestion Pipeline
This is your foundation. You need a pipeline that can reliably pull, clean, and normalize your data—whether it’s clickstream logs, user interactions, or product metadata.
Personally, I use a simple pandas
-based loader when I’m prototyping:
import pandas as pd
df = pd.read_csv("user_events.csv")
df = df.drop_duplicates(subset=["user_id", "item_id"])
df = df[df["event_type"].isin(["click", "view", "purchase"])]
When I’m scaling, I switch over to a scheduled job that pulls from our data warehouse (ClickHouse, BigQuery, whatever the project is using). But the shape of the data always ends up the same: user_id
, item_id
, timestamp
, metadata
.
2. Embedding Layer (OpenAI)
This is where the magic happens. I use OpenAI’s text-embedding-3-large
model to generate embeddings for item descriptions, user search queries, and sometimes even support tickets or product reviews.
import openai
openai.api_key = "..."
def embed_text(text: str):
return openai.embeddings.create(
model="text-embedding-3-large",
input=text
)["data"][0]["embedding"]
I usually wrap this in a caching layer using SQLite or Redis. Embeddings don’t change often, so there’s no point paying for the same call twice.
3. Storage & Querying Layer (MyScale)
This might surprise you: MyScale isn’t just fast, it’s developer-friendly. I use it to store both metadata and vector embeddings, and I query them together using hybrid search.
Here’s how I define the schema:
CREATE TABLE items (
item_id String,
title String,
category String,
embedding Array(Float32)
) ENGINE = VectorEngine('cosine', 768);
That VectorEngine
syntax handles all the ANN indexing. No external Faiss server. No Kubernetes magic. Just straight SQL.
4. Retrieval & Ranking Logic
This is the part you customize based on your use case. I usually start with a cosine similarity search on the embedding, then apply a rerank based on metadata filters (category match, freshness score, etc.).
SELECT item_id, title
FROM items
WHERE category = 'books'
ORDER BY cosine_distance(embedding, your_query_embedding)
LIMIT 10;
Sometimes I rerank with OpenAI’s gpt-4
by sending a prompt that includes the user context + top retrieved results. Not always necessary, but it helps for edge cases where semantics matter more than proximity.
5. Deployment Context
Now for the real-world stuff:
- Local Dev: I run everything in Docker. MyScale gives you a local dev image that’s solid for experimentation.
- Cloud Scale: For production, I use MyScale Cloud. It has a SQL interface, auto-scaling, and integrates cleanly with API gateways.
- OpenAI Rate Limits: Don’t ignore this. I’ve hit the rate limits more times than I’d like to admit. Batch your requests, cache aggressively, and always have a retry policy.
Architecture Diagram
If you’re sharing this with your team or planning to productionize, drop in a quick diagram here. Tools like Excalidraw or dbdiagram.io are great for this. Keep it high-level: data in → embedding → store → search → rank → serve.
3. Preparing the Dataset
“Your model is only as smart as your data is clean.”
Let me be real with you—every time I’ve built a recommender system, the hardest part hasn’t been the model, the vector search, or even the deployment. It’s always the dataset prep. And not because it’s technically hard—it’s because the devil lives in the corner cases.
This section shows you how I usually approach it when I’m working with raw interaction logs and item metadata. No theory, just code and decisions you’ll need to make.
Load & Format: Real Data, Not Toy Examples
Personally, I almost always start with CSVs or Parquet files. In production, it’s often OLAP sources like ClickHouse or Snowflake, but for now, here’s what I do to get started:
import pandas as pd
# Load interaction data
df = pd.read_csv("interactions.csv")
# Drop malformed or partial rows
df = df.dropna(subset=["user_id", "item_id", "event_type"])
# Normalize item IDs and timestamps
df["item_id"] = df["item_id"].astype(str)
df["timestamp"] = pd.to_datetime(df["timestamp"])
Here’s the deal: you want clean interaction logs that reflect real user behavior. That means filtering out spammy sessions, bots, or accidental page reloads.
Real-World Preprocessing: What I Actually Do
This might surprise you, but I always apply a couple of battle-tested heuristics:
# Remove users with very few interactions
user_counts = df["user_id"].value_counts()
df = df[df["user_id"].isin(user_counts[user_counts >= 5].index)]
# De-duplicate rows where users interact with the same item multiple times in a row
df = df.sort_values(["user_id", "timestamp"])
df = df.drop_duplicates(subset=["user_id", "item_id"], keep="last")
Why these steps?
- Users with less than 5 interactions give almost no signal.
- Repeated views in short succession? Usually noise. I drop them unless I’m modeling dwell time separately.
Merge With Metadata (Hybrid FTW)
I’m a big fan of hybrid recommenders. Behavior tells you what users did, but metadata tells you why. I always merge in item features before generating embeddings:
items_df = pd.read_csv("items.csv")
# Join metadata into the interaction set
df = df.merge(items_df, on="item_id", how="left")
# Optional: filter out items missing descriptions
df = df[df["description"].notnull()]
If you’re embedding item descriptions, titles, or categories, now’s the time to validate that every row has that info. No point generating embeddings for NULLs.
Other Cleanup You Shouldn’t Skip
With my experience, here are the things that have saved me more than once:
- Sessionization: If you’re building session-based recommenders, group by user and break sessions by inactivity (30+ min gap usually works).
- Frequency cutoffs: Drop items that only appear once or twice. They’re not learnable.
- Cold start flags: Add a boolean
is_cold_start
column so you can test fallback strategies later.
4. Generating Embeddings with OpenAI
“You don’t feed the machine raw text—you feed it meaning, compressed into vectors.”
Let’s get real: OpenAI’s embedding models are fast, consistent, and honestly kind of addictive once you get into the groove. I’ve used text-embedding-3-large
in multiple projects now, and it’s become my default for anything semantic—product titles, article descriptions, user bios, you name it.
Here’s how I usually set up my embedding pipeline.
Embedding Function: Clean, Simple, Reusable
import openai
import time
openai.api_key = "your-api-key"
def get_embedding(text, model="text-embedding-3-large"):
try:
response = openai.embeddings.create(
model=model,
input=text,
)
return response["data"][0]["embedding"]
except Exception as e:
print(f"Embedding failed: {e}")
return None
A couple of notes here from my experience:
- Stick to text strings under 8192 tokens for this model.
- If you’re embedding structured objects (like JSON metadata), flatten them first—OpenAI works best with well-formatted, readable text.
Batch Handling + Rate Limits
This might surprise you, but rate limits will bite you if you don’t batch smartly. I’ve been there.
Here’s how I handle it:
from tqdm import tqdm
def embed_batch(texts, sleep_time=1.0):
embeddings = []
for text in tqdm(texts):
emb = get_embedding(text)
embeddings.append(emb)
time.sleep(sleep_time) # crude rate-limit management
return embeddings
Is sleep_time=1.0
ideal? Not really. In production, I recommend using exponential backoff and OpenAI’s Retry-After
headers. But for prototyping, this gets the job done.
Caching Is Not Optional
Personally, I never call OpenAI twice for the same text. I cache everything locally—either as a pickle file or in something like SQLite if I want searchability.
Here’s a quick pattern I use:
import hashlib
import pickle
import os
def hash_text(text):
return hashlib.sha256(text.encode()).hexdigest()
def get_cached_embedding(text, cache_dir="embeddings/"):
os.makedirs(cache_dir, exist_ok=True)
file_path = os.path.join(cache_dir, f"{hash_text(text)}.pkl")
if os.path.exists(file_path):
with open(file_path, "rb") as f:
return pickle.load(f)
emb = get_embedding(text)
with open(file_path, "wb") as f:
pickle.dump(emb, f)
return emb
It’s not fancy, but it works—especially when you’re iterating over large catalogs or long-form documents.
Optional Tweaks I’ve Actually Used
Let’s say you want to push retrieval quality a little further. Two things I’ve tried (and sometimes recommend):
- Dimensionality reduction using PCA or UMAP—helps when you’re embedding large metadata blobs.
- L2 normalization—especially useful when using cosine distance in MyScale.
from sklearn.preprocessing import normalize
import numpy as np
# Normalize a list of vectors for cosine similarity
normalized_embeddings = normalize(np.array(embeddings), norm='l2')
I usually apply this just before storing embeddings to MyScale, since it ensures cosine distance behaves as expected.
5. Ingesting Embeddings into MyScale
“It doesn’t matter how good your vectors are if you can’t query them fast enough.”
At this stage, you’ve got your embeddings—cleaned, cached, and (ideally) normalized. Now comes the part where I’ve seen a lot of folks overcomplicate things: loading those vectors into a vector database.
Personally, I’ve found MyScale to be one of the more straightforward options, especially if you’re already comfortable with ClickHouse-style SQL.
Let me walk you through how I typically set this up.
Table Schema: Define Once, Reuse Everywhere
Here’s the schema I’ve used in production for a hybrid recsys setup. It’s flexible enough to store not just the embedding, but also item metadata you’ll likely need during ranking.
CREATE TABLE items (
id String,
title String,
category String,
tags Array(String),
embedding Array(Float32)
) ENGINE = VectorEngine('cosine', 1536) -- 1536 for text-embedding-3-large
A few notes from my own setup:
Array(Float32)
is mandatory—don’t use Float64 here or you’ll get weird type coercion issues at query time.- I always align the embedding dimension with the model I used (e.g., 1536 for
text-embedding-3-large
). VectorEngine('cosine', ...)
is ideal when you’re planning to query via cosine similarity. I’ve tested dot product too, but cosine tends to win out for most semantic use cases.
Python Ingestion Script: Keep It Simple
Here’s how I push data into MyScale using their Python driver. I usually wrap this in a batch uploader, especially when I’m dealing with hundreds of thousands of records.
from clickhouse_driver import Client
import pandas as pd
client = Client('your-myscale-host')
# Example row: {"id": "123", "title": "iPhone 15", "category": "Electronics", "tags": ["apple", "phone"], "embedding": [...]}
def insert_items_to_myscale(df):
data = [
(
row["id"],
row["title"],
row["category"],
row["tags"],
row["embedding"]
) for _, row in df.iterrows()
]
client.execute("""
INSERT INTO items (id, title, category, tags, embedding)
VALUES
""", data)
This might look straightforward, but I’ve learned a few things the hard way:
- Ensure the order and types match exactly with your table schema—MyScale is strict.
- If you’re using nested or complex types like
Array(String)
, make sure they’re properly formatted Python lists. - For large uploads, I chunk the inserts—around 1,000 rows per batch is a good start for performance without overloading the driver.
Final Tip: Test Before You Trust
Before you go all-in, always query back a few records to make sure the embeddings are actually there and valid. I’ve once had a whole batch silently fail because of a malformed float in one embedding array.
SELECT id, length(embedding) FROM items LIMIT 5;
If your lengths are inconsistent or null, stop there and fix your ingestion logic first. Debugging this after you’ve indexed 10M items is a pain I wouldn’t wish on anyone.
6. Building the Retrieval System
“An embedding is only as good as the system that knows what to do with it.”
At this point, you’ve got your vectors sitting in MyScale, but they’re just raw potential. The real magic starts when you can query them intelligently. This is where I’ve spent a good chunk of my time—making retrieval feel instant, accurate, and personalized, especially in high-traffic environments.
Let me walk you through the stack I typically use to turn those static vectors into responsive, hybrid retrieval systems.
Querying with Vector Similarity + Metadata Filtering
This might surprise you, but raw vector similarity alone rarely cuts it. In almost every real system I’ve worked on, I had to blend it with metadata constraints: language, category, availability, whatever mattered for that domain.
Here’s the kind of query I actually run in MyScale when doing hybrid search:
SELECT id, title
FROM items
WHERE category = 'electronics'
ORDER BY cosine_distance(embedding, {target_embedding}) ASC
LIMIT 10
You can plug in any metadata dimension here—I’ve used everything from user segments to regional filters.
One tip from experience: make sure you index the metadata fields you’ll filter on. Otherwise, even a fast vector search slows down when it’s dragging unindexed filters along.
ANN Optimization for Top-K Search
Now, on large-scale systems (think millions of items), brute-force cosine comparison gets expensive. Personally, I rely on MyScale’s built-in approximate nearest neighbor engine when latency becomes a bottleneck.
No need to reinvent the ANN wheel here. Just enable the engine at table creation:
ENGINE = VectorEngine('cosine', 1536, 'hnsw')
From there, you don’t have to change your query—it runs via ANN under the hood. But in my experience, I still sanity-check ANN results against brute-force on a random sample. Especially if you’re tuning recall vs latency.
Cold-Start Handling (Yes, It Still Matters)
You might be wondering: What about when there’s no history? Cold-start is still one of those annoyances that creeps back in every project.
What I’ve found works best is having a fallback query baked into the pipeline—usually something like this:
SELECT id, title
FROM items
WHERE category = 'electronics'
ORDER BY created_at DESC
LIMIT 10
I’ve also experimented with boosting popular items or recent trends (think click-throughs, trending tags). But honestly, fallback logic should be tuned to your product’s UX, not just the data.
Here’s how I usually do it in code—fallback logic wrapped into the retrieval layer:
def retrieve_items(embedding, category):
results = vector_search(embedding, category)
if not results:
results = popularity_fallback(category)
return results
This isn’t just about coverage—it’s about confidence. Your system should never return nothing. I’ve seen that kill user trust more than bad recommendations ever could.
Real Example: Search in Action
Let me give you a real-world-style input/output flow I’ve run in testbeds.
Input:
{
"embedding": [0.02, -0.013, ..., 0.005],
"category": "electronics"
}
Returned Top-3 Results:
[
{"id": "abc123", "title": "Sony WH-1000XM5 Noise Cancelling Headphones"},
{"id": "def456", "title": "Apple iPhone 15 Pro"},
{"id": "ghi789", "title": "Samsung 49\" UltraWide Monitor"}
]
In one of my past projects, just switching from naive keyword filtering to this hybrid search bumped engagement up by ~14%. Why? Because the results started making sense to the user, not just to the data.
7. Optional: Ranking with OpenAI Completions
“Vector search gets you close. Language models help you pick the winner.”
There have been a few projects where I needed that extra layer of nuance—cases where a basic Top-K vector search felt too mechanical. That’s when I started layering in GPT for semantic re-ranking. Think of it like having a smart assistant who reads all your candidates and says, “Here’s the one that actually makes sense for what you’re asking.”
Let me show you the exact way I do it.
Prompt Engineering for Re-Ranking
You might be wondering how I prompt GPT to act like a ranking engine. I’ve experimented with a bunch of formats, but this one’s been the most stable:
import openai
openai.api_key = "..."
def rerank_items(query, candidates):
prompt = f"""You're an intelligent search assistant. A user searched for: "{query}"
Here are some items:
{chr(10).join([f"{i+1}. {item['title']}" for i, item in enumerate(candidates)])}
Rank these items by how well they answer the user's query. Respond with the numbers in ranked order, like: 3, 1, 2
"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
max_tokens=20
)
return response.choices[0].message["content"]
I’ve used this same strategy with product search, Q&A systems, and even internal knowledge bases. When I added GPT re-ranking on top of vector retrieval, click-throughs jumped noticeably—especially in long-tail queries.
Bonus: Summary Generation for Result Explanation
Sometimes, you don’t just want to show a list—you want to explain why these are good results. GPT can handle that too, with a slight prompt switch:
def summarize_match(query, top_item):
prompt = f"""A user searched for: "{query}".
You retrieved the following item: "{top_item['title']}".
Explain in one sentence why this is a good match."""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=60
)
return response.choices[0].message["content"]
I’ve used that for UX overlays—like little “why we recommended this” blurbs. Adds a layer of transparency, which users actually seem to trust.
Now, let’s wrap things up with the serving stack.
8. API Layer and Deployment Considerations
“If a rec system doesn’t have an API, does it even exist?”
I’ve shipped a few of these systems into production—small teams, fast turnarounds—and here’s the backend setup I keep coming back to: FastAPI + MyScale + OpenAI + caching layer, wrapped with Docker for sanity.
Let’s break it down.
Minimal FastAPI App for Recs
Here’s a lean version of the API I’ve actually deployed:
from fastapi import FastAPI, Request
from pydantic import BaseModel
import uvicorn
app = FastAPI()
class Query(BaseModel):
query: str
category: str
@app.post("/recommend")
def recommend(q: Query):
# Step 1: Embed query
query_embedding = get_embedding(q.query)
# Step 2: Retrieve candidates
candidates = retrieve_items(query_embedding, q.category)
# Step 3: Optional re-rank
ranked = rerank_items(q.query, candidates)
return {"results": ranked}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
You can wire this up to whatever frontend or batch job needs real-time results.
Caching Embeddings + Rate Limiting
This might seem obvious, but trust me—cache your embeddings. You don’t want to hit OpenAI every time someone searches “iPhone 15”.
Here’s how I’ve handled it with a dead-simple in-memory cache (just swap for Redis in prod):
embedding_cache = {}
def get_embedding(text):
if text in embedding_cache:
return embedding_cache[text]
embedding = openai.embeddings.create(
model="text-embedding-3-large",
input=text
)["data"][0]["embedding"]
embedding_cache[text] = embedding
return embedding
Add retries, exponential backoff, and you’re good for 99% of edge cases.
Deployment Stack: Docker + Gunicorn
Here’s a trimmed Dockerfile I’ve used to deploy these on AWS and GCP:
FROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["gunicorn", "-k", "uvicorn.workers.UvicornWorker", "app:app", "--bind", "0.0.0.0:8000"]
And don’t forget HTTPS setup if you’re exposing this externally—I usually throw in Caddy or use a managed reverse proxy.
9. Monitoring, Evaluation, and Logging
“If you can’t measure it, you’re just guessing.”
Once I had my first version of the recommendation engine running, I learned the hard way that solid retrieval wasn’t enough. Without evaluation and feedback loops, you’re flying blind. And when it comes to production systems, I don’t like flying blind.
Let me walk you through how I’ve built this out in real systems.
Offline Evaluation: Precision@k and Recall@k
Early on, before we ever went live, I used precision@k and recall@k for offline evaluation. These are the go-to metrics when you’re working with ground truth sets.
Here’s a quick snippet I’ve reused across projects:
def precision_at_k(predictions, ground_truth, k):
return sum([1 for item in predictions[:k] if item in ground_truth]) / k
def recall_at_k(predictions, ground_truth, k):
return sum([1 for item in predictions[:k] if item in ground_truth]) / len(ground_truth)
I usually evaluate these across a few thousand historical queries. The trick is to always benchmark against a strong baseline—like a popularity-based model—so you’re not optimizing in a vacuum.
Real-Time Logging: Closing the Loop
This might surprise you: some of my best ranking insights didn’t come from offline metrics—they came from real-world user behavior.
Here’s how I log interactions:
@app.middleware("http")
async def log_requests(request: Request, call_next):
response = await call_next(request)
log = {
"path": request.url.path,
"query": dict(await request.json() if request.method == "POST" else {}),
"status_code": response.status_code,
"timestamp": datetime.utcnow().isoformat()
}
send_to_logging_service(log)
return response
And yes, I’ve actually wired this into a user clickstream pipeline. I log impressions, clicks, dwell time, and even skipped items. You’d be surprised how often the second item is clicked more than the first—that’s a re-ranking signal you can’t ignore.
Observability with MyScale, Grafana, Prometheus
If you’re using MyScale, it already exposes query performance and metadata you can hook into. What I usually do is connect that into Prometheus for metrics collection, and then visualize with Grafana.
You can set up alerting like:
- Query latency > 500ms
- Retrieval count < expected
- Embedding cache miss rate > threshold
These aren’t just fancy dashboards—they’ve helped me catch model drift, caching bugs, and even accidental zero-result retrievals before users did.
10. Conclusion + What’s Next
Let’s wrap this up.
You’ve seen how to go from raw text to embeddings, hybrid retrieval, GPT-based re-ranking, and full-stack deployment. This isn’t just theory—it’s the kind of setup I’ve shipped, tuned, broken, and rebuilt across multiple products.
But here’s the deal: this is just the foundation. The real fun starts after launch.
What’s Next?
Here are a few next steps I’ve taken on projects that needed to level up:
- Real-Time Personalization: Use session data or user history to re-weight your Top-K candidates before re-ranking.
- Feedback Loops: Build systems that learn from user actions over time—clicks, skips, saves.
- Massive Scale: If you’re dealing with millions (or billions) of items, look into distributed vector stores, approximate reranking, and asynchronous retrieval pipelines.
“You don’t build a recommendation system—you grow one.”
And that’s been my experience. What starts as a basic Top-K engine slowly evolves into a multi-layered learning loop.
Ready to build your next layer?

I’m a Data Scientist.