How to Prepare Data for AI?

1. Introduction

“The model’s only as smart as the data you feed it.”

I’ve seen this play out over and over — teams pouring months into model tuning, only to get mediocre results because the dataset was a mess. The real bottleneck? It’s almost never the model architecture. It’s always the data — its structure, consistency, and contextual relevance.

In this guide, I’m not going to walk you through the basics. If you’re here, you already know how to remove nulls or do a quick .fillna(0).

I won’t waste your time with toy datasets either — this is about real-world data, the kind that breaks pipelines and sneaks in subtle leakage if you’re not careful.

Everything I’m going to cover comes from projects I’ve worked on personally — production-grade systems, data pipelines that needed to scale, and models that couldn’t afford to get things “mostly right.” If I share a method, it’s because I’ve used it and know it works.

Here’s what I will cover:

  • How to shape your data to fit your model’s needs (and not the other way around).
  • How to catch issues early — the kind that silently kill model performance.
  • And how to prepare data with deployment in mind, not just training.

Let’s get into it.


2. Aligning Data Preparation with Model Objectives

Before touching a line of transformation code, I always ask: what is the model actually trying to do?

You’d be surprised how often I’ve seen pipelines where the feature engineering seems completely disconnected from the objective. It’s easy to fall into that trap — I’ve been there myself.

Here’s the deal: the way you prepare your data should be tightly coupled with the type of problem you’re solving and the behavior you want from the model.

Let me give you two examples from my own work:

Case 1: Classification with Class Imbalance

I was working on a customer churn model where only ~3% of users ever churned. At first glance, the model looked decent — 92% accuracy. But you and I both know that’s meaningless when 97% of the classes are negative.

What helped? Visualizing the label distribution early and planning resampling strategies before training.

import seaborn as sns
sns.histplot(df['target_label'], bins=20)  # Simple but often skipped

This helped me catch the imbalance upfront. Depending on the case, I’d either:

  • Downsample the majority class using stratified sampling.
  • Or use more advanced approaches like target-aware augmentation or generative oversampling with CTGAN.

Case 2: NLP with Long-Context Inputs

In one NLP project — a summarization model with customer support transcripts — token length distribution became critical. If 60% of your samples get truncated, you’re not training on the content you think you are.

Here’s a quick diagnostic trick I use:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
df['token_count'] = df['text'].apply(lambda x: len(tokenizer.tokenize(x)))
df['token_count'].describe()

From there, I might redesign the truncation logic, resegment the data, or switch to models with longer context windows (like Longformer or GPT-NEO), depending on the distribution.


3. Schema Validation and Data Contracts

“Trust, but verify. Especially when the data’s coming from a microservice you didn’t build.”

I’ve been burned enough times by silent data drift that I now treat schema validation as non-negotiable — especially when the data flows through multiple hands before it hits your model.

Let me be blunt: if you’re not enforcing schema contracts, you’re not production-ready. I’ve had models crash in prod because someone upstream decided to send null instead of an empty string — and that was on me for not locking it down.

When I want automated, reusable validation, I reach for tools like pandera, pydantic, or great_expectations. They each shine in different parts of the stack. Pandera’s been my go-to when working directly with DataFrames.

Here’s a basic, yet real-world, schema I’ve used:

import pandera as pa
from pandera import Column, DataFrameSchema

schema = DataFrameSchema({
    "customer_id": Column(pa.String),
    "event_timestamp": Column(pa.DateTime),
    "purchase_amount": Column(pa.Float, checks=pa.Check.gt(0)),
})

# Validates or raises with clear error logs
schema.validate(df)

This catches:

  • Wrong dtypes (e.g., someone sneaking in an int ID instead of a string UUID).
  • Invalid values (like negative purchases).
  • Missing or renamed columns (which happen more than you’d like to admit).

CI/CD Integration — Breaking the Build on Schema Drift

I always wire this into the pipeline — either in a CI/CD step or as a standalone data_contracts check. Think of it like a unit test for your data inputs.

If a new data file breaks schema expectations, I want the pipeline to fail. Silently accepting garbage only leads to more work later — especially when it’s subtle things like float64 vs float32.

In production pipelines, I’ve used Great Expectations to validate batch loads before ingestion. For APIs, pydantic works beautifully when coupled with FastAPI.


4. Data Leakage and Temporal Leakage Prevention

“If your model’s accuracy is too good to be true, it probably is.”

I learned this the hard way. A few years ago, I shipped a model that showed 97% F1 score in validation… but flopped in production. Turned out, one of my time-based aggregations was leaking future data.

This might sound obvious, but the subtle ways leakage creeps in will surprise you. I’ve seen it happen through:

  • Joins where lookup tables include future knowledge.
  • Forward-filling time series without respecting causality.
  • Rolling windows that peek ahead instead of lagging properly.

Let me show you how I approach this in code — this is straight from a recommender pipeline I worked on:

# Step 1: Sort your data properly
df = df.sort_values(['user_id', 'event_time'])

# Step 2: Lag before rolling — NEVER roll then shift
df['rolling_mean'] = (
    df.groupby('user_id')['purchase_amount']
    .shift(1)  # shift FIRST to prevent peeking into current event
    .rolling(3)
    .mean()
)

By shifting first, I’m making sure the rolling mean only uses past events. That shift is crucial. Miss it, and you’re effectively using the current (or even future) data to create your feature.

Rule of thumb I follow:

If you couldn’t have had that data at prediction time, don’t let it into your features.

You can even wrap this logic into a test — I write simple guards that verify no label or target variables are included in the feature matrix. Tools like Deepchecks or WhyLabs can help automate this at scale, but even basic assert statements have saved me from public embarrassment.


5. Feature Engineering at Scale

“At small scale, it’s art. At production scale, it’s architecture.”

There’s a moment in every project where you realize your carefully crafted sklearn.Pipeline just doesn’t scale. I’ve been there. What worked locally starts breaking down once you hit tens of millions of rows or need to recompute features daily across distributed systems.

That’s when I started reaching for tools like Featuretools, Polars, and dbt — not for the buzzwords, but because I had to. Manually recreating joins and lags across 10+ entities? I’ve done it, and I’m not going back.

Here’s a real example from a sales prediction pipeline I built where we had to aggregate behavioral data across customers, products, and transactions. I used featuretools to build deep, entity-based features in a repeatable, testable way.

import featuretools as ft

# Set up entity set
es = ft.EntitySet(id="sales_data")

es = es.add_dataframe(
    dataframe_name="transactions",
    dataframe=df,
    index="transaction_id",
    time_index="event_time"
)

# Autogenerate deep features
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="transactions"
)

This gave us features like:

  • Mean transaction amount in the last 30 days
  • Time since last purchase per user
  • Unique items bought by user in past N sessions

And yes, they were deployable. We stored them in a Feast feature store, versioned by model, so the training-serving skew was zero. I can’t emphasize this enough — if your features during inference don’t exactly match training, your model is going to fail silently.

When I needed tighter control over performance, I also leaned into Polars for its speed. For declarative logic and cross-team collaboration, dbt has become a go-to — especially when working directly in warehouses like BigQuery or Snowflake.

You might be wondering: Isn’t this overkill for some models? Maybe — but the moment you have retraining cycles, feature drift monitoring, or multiple consumers of the same features, trust me, you’ll want that foundation.


6. Handling Imbalanced and Rare Classes (Beyond SMOTE)

“SMOTE is fine… until it’s not.”

In my experience, traditional oversampling techniques like SMOTE start to fall apart when:

  • Your feature space is high-dimensional,
  • The class distribution is very skewed (say <2% minority),
  • Or you’re dealing with structured time-based data.

I’ve faced all three in fraud detection and medical event prediction projects. In those cases, I had to get creative — and generative oversampling changed the game for me.

CTGAN + TabDDPM — Synthetic Data That Respects the Distribution

CTGAN has been solid when I need to generate realistic samples that mimic the true feature distribution. TabDDPM is newer but handles continuous variables and multimodality better.

Here’s a simple workflow I’ve used:

from ctgan import CTGANSynthesizer
from sklearn.model_selection import train_test_split

X_minority = df[df['target'] == 1].drop('target', axis=1)
ctgan = CTGANSynthesizer(epochs=300)
ctgan.fit(X_minority)

synthetic = ctgan.sample(500)
synthetic['target'] = 1
df_balanced = pd.concat([df, synthetic], ignore_index=True)

This gave me a much better separation in latent space than SMOTE ever did — and more stable validation scores.

“But what if you don’t want to generate data at all?”

Glad you asked. In a few cases, I’ve used prior-preserving reweighting instead. It’s a great trick when you trust your data but want the model to treat rare classes more seriously.

Rare Class Evaluation — Stop Using Just F1

Let me be blunt: accuracy and F1 are trash for rare class tasks. I rely on:

  • Precision@K (especially for ranking tasks like lead scoring)
  • PR-AUC (gives a better signal when positive class is sparse)
  • Cost-sensitive evaluation, when misclassification comes with real-world risk.

Here’s how I tracked Precision@10 during one project:

from sklearn.metrics import precision_score

# Simulate top-K selection
top_k_preds = y_pred_proba.argsort()[-10:]
precision_at_10 = precision_score(y_true[top_k_preds], y_pred[top_k_preds])

7. Missing Data: Structural vs Random

“Not all missing values are created equal. Some are bugs — others are signals.”

I’ve had projects where missing data told a better story than the actual values. One that stuck with me was a churn model — the fact that a user didn’t update their profile photo told me more than any numeric feature ever could.

The first thing I always do? Ask whether the missingness is:

  • MCAR (Missing Completely At Random),
  • MAR (Missing At Random, conditioned on observed data),
  • or MNAR (Missing Not At Random — aka the spicy one).

This might surprise you: I don’t always impute. In production, I often model the missingness directly — especially when I suspect it’s MNAR.

Injecting Missing Indicators

Here’s a move I use in 80% of my pipelines. It’s simple but crazy effective:

# Flag the missingness as its own binary feature
df['feature1_missing'] = df['feature1'].isnull().astype(int)

This has helped me catch patterns like:

  • Users who don’t provide income → more likely to default
  • Customers who skip feedback forms → churn within 30 days

Imputation: Use With Caution

For quick baseline modeling, sure — I use sklearn.impute.SimpleImputer. But when I want something smarter, IterativeImputer (or fancyimpute under the hood) gives me better signal preservation.

from sklearn.impute import IterativeImputer

imputer = IterativeImputer()
df_imputed = imputer.fit_transform(df)

That said — I only use this if I must fill values. Otherwise, I prefer encoding the absence directly and letting the model decide what it means.

Bonus: Missing Embeddings in TabNet

When I’m working with deep tabular models like TabNet, I lean into their ability to embed the notion of missingness. No manual flagging — it just works. But you still need to prep inputs so that missing values are recognized.


8. Encoding Complex, Mixed-Type Data

“One-hot is a last resort. Not a default.”

This part gets tricky fast. I’ve worked with datasets where category_id had over 50,000 unique values — and using one-hot would’ve been an act of self-sabotage.

Let’s break down how I handle the mess.

High-Cardinality Categorical Features

For features like merchant_id, product_id, or device_type, I almost always use target encoding. But I’ve learned — painfully — to protect against leakage.

Here’s how I encode safely using K-folds:

import category_encoders as ce
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)
encoder = ce.TargetEncoder(cols=['merchant_id'])

df['merchant_encoded'] = 0
for train_idx, val_idx in kf.split(df):
    df_train, df_val = df.iloc[train_idx], df.iloc[val_idx]
    encoder.fit(df_train['merchant_id'], df_train['target'])
    df.loc[val_idx, 'merchant_encoded'] = encoder.transform(df_val['merchant_id'])

This avoids peeking at the target during training — which is where most target encoders get people into trouble.

Mixed-Type Columns (Text + Categorical)

I had a client with a product_description column that was half templated text, half dirty category tags. Regex + tokenization got me halfway, but the real win came from combining text vectorization + category embeddings.

What worked best was treating the text part like a mini NLP problem (TF-IDF + SVD), and the structured labels with learned embeddings. You’d be surprised how well this hybrid setup performs.

Multi-Hot Sequences

When you’ve got something like genres, tags, or skills stored as lists — avoid MultiLabelBinarizer if you care about scale or sparsity.

I’ve switched to token sequence embedding:

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Let's say we have: ['comedy|drama', 'action|thriller']
tokenizer = Tokenizer(char_level=False, split='|')
tokenizer.fit_on_texts(df['tags'])

sequences = tokenizer.texts_to_sequences(df['tags'])
X_seq = pad_sequences(sequences, padding='post', maxlen=5)

This gives me a dense, learnable representation of multi-hot info — especially useful for feeding into deep models.

Embedding Initialization — My Shortcut

In NLP-tabular hybrids, I’ve even preloaded GloVe or FastText embeddings when working with domain-specific vocab like medical codes or financial tickers.

“If the feature has semantics, consider letting it learn them — not encode them manually.”


9. Data Versioning, Lineage, and Auditability

“You can’t fix what you can’t trace — and good luck debugging a model trained on ghost data.”

I learned this the hard way. A few years back, I had a model suddenly lose performance by 12 points overnight. Nothing in the code changed. Turned out, a teammate silently updated the training data with an extra month of logs — which shifted the label distribution. That’s when I swore I’d never skip data versioning again.

Since then, I’ve used tools like DVC, lakeFS, and Delta Lake in different workflows — depending on whether I’m working locally, on S3, or in a full-blown Spark environment.

DVC — Git for Data

DVC fits like a glove if your data lives in folders, and you want Git-style control without pushing gigabytes to GitHub. I’ve used it to track everything from raw dumps to preprocessed folds — all versioned, all reproducible.

# Initialize DVC
dvc init

# Start tracking a dataset
dvc add data/training_set.csv

# Commit metadata to Git
git add data/.gitignore data/training_set.csv.dvc
git commit -m "Track training set with DVC"

Later, I could reproduce an exact experiment by running dvc checkout. The best part? It plays nicely with CI/CD — so your models don’t train on mystery data anymore.

🔹 lakeFS — Git for Your Data Lake

For large-scale projects where everything lives on object storage, I’ve leaned on lakeFS. You get branching, commit history, even rollback — all inside your S3 bucket. I once built a branching strategy where each ML experiment got its own isolated data snapshot. Rollbacks took seconds.

If you’re already neck-deep in a Spark ecosystem, Delta Lake might be the cleaner choice. Time travel + schema enforcement — no more guessing what the table looked like two weeks ago.

Lineage Tracking: mlflow + Dagster/Prefect

Versioning is only half the story. You also need to trace what created the data.

I usually plug in mlflow to track model inputs/outputs, and pair it with Dagster or Prefect for pipeline orchestration. What I like about Dagster is how it forces you to declare inputs and outputs explicitly — which means data lineage becomes automatic.

@op
def load_training_data() -> pd.DataFrame:
    return pd.read_csv("data/training_set.csv")

@op
def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    return custom_cleaning(df)

@job
def training_pipeline():
    preprocess_data(load_training_data())

Every time this runs, I know exactly what data version, what function version, and what config produced the result.

Reproducibility & Rollback

Personally, I treat data the same way I treat code: version everything, tag what works, and never train on untracked inputs. With DVC and mlflow combined, I can reproduce a result from 9 months ago — model, data, config, the whole deal.

And when something breaks in prod? I’ve got the audit trail to roll back with confidence — no finger-pointing, no guessing games.


10. Preprocessing Pipelines for Production

“Training is a moment. Inference is forever.”

I’ve made this mistake before — crafting a beautiful preprocessing pipeline in a Jupyter notebook, only to realize it couldn’t be reused cleanly in production. That’s when I started building everything as if it was heading to prod — even exploratory prototypes.

Here’s the deal: if your training pipeline isn’t serializable, modular, and mirrored at inference time, you’re setting yourself up for pain later.

Let me show you how I structure my pipelines now — and what I use to make them bulletproof.

Modular Pipelines with scikit-learn

scikit-learn’s Pipeline and ColumnTransformer are deceptively powerful. They don’t just keep your code clean — they ensure every transformation is applied exactly the same way at inference.

Here’s a pattern I’ve reused across dozens of projects:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

numeric_features = ['age', 'income']
categorical_features = ['gender', 'occupation']

numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, numeric_features),
    ('cat', categorical_pipeline, categorical_features)
])

What I love about this structure is how portable it is. Once the pipeline is trained, I just pickle or joblib it, and I’m ready to deploy.

Serialization for Deployment

This might surprise you: I rarely export models without their preprocessing baked in. Too risky. One forgotten StandardScaler, and your whole inference goes sideways.

Here’s how I tie it all together using Pipeline and joblib:

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
import joblib

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

clf.fit(X_train, y_train)

# Save the full pipeline
joblib.dump(clf, 'rf_pipeline.joblib')

# Load it during inference
model = joblib.load('rf_pipeline.joblib')
preds = model.predict(X_test)

No missing steps. No mismatched scaling. The same logic runs at train and inference — that’s what makes it production-safe.

When I Need More Than scikit-learn

There’ve been times where I needed something scikit-learn couldn’t give me: like serializing PyTorch prepro logic, or packaging a full API with preprocessing + prediction in one artifact. That’s where I reach for:

  • bentoML – for packaging models as APIs with baked-in logic.
  • skops – great for sharing scikit-learn models securely and with clear inspection.
  • ONNX / TorchScript – when I need cross-platform compatibility (especially for edge or mobile).

Here’s how I wrapped a full sklearn pipeline with bentoML for a client-facing service:

import bentoml

bentoml.sklearn.save_model("rf_pipeline", clf)

This gives me a deployable bundle I can ship to any inference server with full dependency isolation. Clean, versioned, repeatable.

A Word on Drift Detection

I’ve also started embedding drift detectors right into my pipelines. Tools like evidently or custom-stat checks between train and serve distributions are lifesavers.

When a feature starts drifting, I want alerts before performance tanks — not after.

Bottom line: preprocessing is not something I “do before modeling” anymore. It’s something I design for deployment from day one. Think of it as part of the model contract — if you break it, nothing else matters.


11. Unit Testing for Data Transformations

“Trust, but verify.” — a lesson that applies just as much to data pipelines as to geopolitics.

There was a time when I thought unit tests were overkill for data transformations — I mean, the data looked fine… until it didn’t. One malformed string, one unexpected encoding, and the whole pipeline silently broke. Ever since, I write unit tests for my key transformations before shipping anything downstream.

What I Test

Let’s skip the obvious and talk about what I actually test in real-world projects:

  • Datetime parsing — because you’ll always get at least one "01/13/2024" that breaks everything.
  • Handling of nulls/extreme values — especially when building log features or ratio-based transformations.
  • Edge casing categorical logic — this one bites during inference if left unchecked.

Here’s a quick example I use when working with time-sensitive features:

from datetime import datetime
import pytest

def parse_date(s):
    return datetime.strptime(s, "%Y-%m-%d")

def test_parse_date_valid():
    assert parse_date('2021-01-01') == datetime(2021, 1, 1)

def test_parse_date_invalid_format():
    with pytest.raises(ValueError):
        parse_date('01/01/2021')

You might be thinking: “This looks basic.” And you’re right — but catching these early means fewer silent errors during retraining or inference. Trust me, it’s saved me more than once.

Using Synthetic Fixtures

In my own workflow, I keep a separate test_fixtures/ directory filled with synthetic edge-case datasets:

@pytest.fixture
def edge_case_df():
    return pd.DataFrame({
        'income': [np.nan, -1000000, 999999999],
        'signup_date': ['not_a_date', '2022-03-15', None],
        'category': ['A', 'B', 'Unknown', '']
    })

What I like about this is: I can stress-test pipelines without polluting production data or depending on an external source.

Pro Tip: Test Your Custom Transformers

If you’re using custom classes that inherit from TransformerMixin, write unit tests around fit_transform and transform. Catching a bug here means you don’t learn about it when your model starts returning NaNs.


12. Final Sanity Checks Before Training

“All models are wrong, but some are wrong before they even start training.”

This is the part where I’ve seen the most hidden failures — not in code, but in assumptions.

Distribution Shifts (Train vs. Validation)

One sanity check I always run before hitting .fit() is distributional divergence across splits. If train and validation look like they came from different planets, your cross-validation scores are lying to you.

Here’s a quick way I use JS Divergence between distributions:

from scipy.spatial import distance
import numpy as np

def js_divergence(p, q):
    p = np.asarray(p)
    q = np.asarray(q)
    p /= p.sum()
    q /= q.sum()
    m = 0.5 * (p + q)
    return 0.5 * (distance.entropy(p, m) + distance.entropy(q, m))

train_dist = np.histogram(train_df['age'], bins=20, density=True)[0]
val_dist = np.histogram(val_df['age'], bins=20, density=True)[0]

div = js_divergence(train_dist, val_dist)
print(f"JS Divergence for age: {div:.4f}")

Personally, I flag anything above 0.1 for further inspection. It’s a cheap but powerful trick that’s helped me spot split leakage and stratification issues more than once.

Target Leakage via Permutation Importance

This might sound obvious, but I’ve had real-world projects where target leakage was caught only after model performance started degrading in production. Now, I always run permutation importance to catch any suspiciously predictive features.

from sklearn.inspection import permutation_importance

result = permutation_importance(clf, X_val, y_val, n_repeats=10, random_state=42)

for i in result.importances_mean.argsort()[::-1]:
    print(f"{X_val.columns[i]}: {result.importances_mean[i]:.4f}")

When a user_id_hash shows up as the top feature — well, you know you’ve got a problem.

Dimensionality Sanity Checks (PCA or t-SNE)

Sometimes I use PCA or t-SNE to sanity-check the geometry of my features. If high-cardinality features collapse into a few dominant components or cluster strangely — it’s a red flag.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

X_embedded = PCA(n_components=2).fit_transform(X_num)

plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y, cmap='viridis', alpha=0.6)
plt.title('PCA Projection - Last Check Before Training')
plt.show()

This gives me a quick visual cue if something went wrong during feature engineering — and trust me, I’ve had np.log(x + 1) blow up features more than once.

These checks might seem small, but they’re like tightening the bolts on a race car before a track day. You don’t want to find out something’s loose at 200 mph.


13. Wrapping It All Together

“The model is the last mile. Everything else is infrastructure.”

If there’s one thing I’ve learned the hard way, it’s that how you organize your ML project often matters more than the model you pick.

You can have the cleanest code in the world — but if your data, transformations, and tests are scattered across random scripts, maintaining and scaling that pipeline will become a nightmare.

So let me walk you through a structure that I’ve personally used in production-grade projects — one that’s helped me keep track of datasets, schemas, versioning, transformations, and all the moving parts.

Project Folder Structure That’s Saved Me More Than Once

Here’s what a well-organized repo looks like on my end. I’ve tweaked and iterated on this structure over time to keep things testable, debuggable, and version-safe.

/data/
│
├── raw/                 # Immutable source dumps (drop here, never edit)
├── processed/           # Intermediate clean data used for training
├── versioned/           # Saved snapshots of training-ready datasets
├── schemas/             # JSON or YAML schema files for validation
├── transformations/     # All preprocessing functions + pipeline classes
├── tests/               # Unit tests for custom transformations and features
├── pipeline.py          # Master pipeline that connects everything

Here’s how I think about each part:

  • raw/ is sacred. I never touch files here — no edits, no renaming. It’s the original dump from the source system.
  • processed/ is where most of the cleaning happens: null imputation, outlier clipping, etc. Temporary, disposable.
  • versioned/ is the goldmine. When I run an experiment, I snapshot the dataset here with a hash. That way, I can always reproduce results even months later.
  • schemas/ holds validation logic — column types, allowed values, ranges. I use this both for unit testing and runtime checks.
  • transformations/ includes all the building blocks — think custom sklearn transformers, tokenizers, feature engineers.
  • tests/ has tests for every custom transformation. If a change breaks something, I catch it here — not three hours into training.
  • pipeline.py wires everything together. I treat it like a DAG builder: it defines how raw turns into versioned, and keeps the logic declarative.

Sample pipeline.py

You might be wondering how I wire it all together. Here’s a minimal sketch I’ve used to tie schema checks, transformations, and serialization into one reproducible flow:

from transformations.preprocessing import build_pipeline
from utils.validation import validate_schema
import pandas as pd
import joblib
import hashlib

def load_data(path):
    df = pd.read_csv(path)
    validate_schema(df, "schemas/input_schema.yaml")
    return df

def save_versioned_data(df, base_path):
    hash_id = hashlib.md5(pd.util.hash_pandas_object(df, index=True).values).hexdigest()
    df.to_parquet(f"{base_path}/versioned/data_{hash_id}.parquet")
    return hash_id

if __name__ == "__main__":
    df = load_data("data/raw/dump.csv")
    pipeline = build_pipeline()
    X_transformed = pipeline.fit_transform(df)

    joblib.dump(pipeline, "models/feature_pipeline.joblib")
    hash_id = save_versioned_data(X_transformed, "data")

    print(f"Versioned dataset saved with hash: {hash_id}")

This pattern’s helped me make the whole pipeline traceable — from raw to trained model — with auditability built in.

Version Everything That Matters

I personally version:

  • Datasets (input + final)
  • Preprocessing pipeline code
  • Schema definitions
  • Model artifacts
  • Test results (for new data or new logic)

This isn’t just about reproducibility — it’s about confidence. When someone asks “Why did this model behave differently in March?”, I can trace it down to a schema change or data drift in minutes.

That’s the gist. In my experience, once you structure things like this, onboarding new team members gets faster, productionizing becomes cleaner, and debugging turns into a process — not a guessing game.

Leave a Comment