Creating and Deploying a Machine Learning Pipeline with Kubeflow

1. Introduction

“Scaling machine learning workflows is a little like building a castle out of sand — looks easy until you actually try it.”

I’ve been through enough production ML rollouts to tell you: scaling isn’t just about training bigger models. It’s about stitching together data ingestion, feature engineering, training, validation, deployment, and monitoring — without everything falling apart halfway through.

When I first stumbled into Kubeflow, I’ll be honest — it felt like overkill. But after setting up pipeline after pipeline manually and hitting wall after wall, I realized Kubeflow isn’t just for fancy demos. It’s the real deal when you’re serious about ML Ops at scale.

If you’re tired of piecing things together with scripts and hope, this guide is for you.
I’ll walk you through — from setting up Kubeflow to deploying a real ML pipeline — without drowning you in theory.

Let’s get hands-on.


2. Prerequisites (Environment Setup)

Before we jump into the code, let me quickly walk you through what you’ll need.
I’m not going to list every obvious tool like Python or Docker — you already know that if you’re here.
Instead, here’s exactly what I use when setting up my Kubeflow pipelines:

Core Requirements:

  • Kubeflow Version: 1.8 (I’ve found it’s stable enough for production use, but still modern.)
  • Kubernetes Version: 1.25 (Works well with Kubeflow 1.8 — newer versions sometimes introduce breaking changes.)
  • Python Version: 3.9 (Safe bet with most Kubeflow components and SDKs.)
  • CLI Tools:
    • kubectl (v1.25+)
    • kfp SDK (Kubeflow Pipelines SDK)
    • Optional: minikube if you’re running locally.

Quick Install Commands:

Here’s what I usually run right after spinning up a fresh environment:

# Install Kubeflow Pipelines SDK
pip install kfp==2.0.0

# Verify kubectl installation
kubectl version --client

# (Optional) Install Minikube for local Kubernetes
brew install minikube

Pro Tip from my own headaches:
If you’re setting this up on GKE, EKS, or AKS, double-check your IAM roles and Kubernetes RBAC permissions before deploying Kubeflow.
Trust me — finding out your pipeline can’t create pods halfway through a run is not how you want to spend your afternoon.


3. Setting Up the Kubeflow Environment

“As they say, building the ship before you sail it is a smart idea — unless you enjoy sinking.”

Before you can run any pipelines, you’ll need a working Kubeflow setup. I’ve tried both local and cloud options, and honestly, each comes with its own set of quirks.

Here’s exactly how I approach it:

(A) Setting up Locally with MiniKF

When I’m prototyping or just testing new pipeline designs, I usually prefer setting up MiniKF on a local machine.
It’s quick, lightweight, and honestly saves me from burning cloud credits unnecessarily.

Here’s the exact way I get MiniKF running:

# Initialize Vagrant with the MiniKF box
vagrant init arrikto/minikf

# Start the VM
vagrant up

Heads-up:
The first boot can take a while since it needs to download a full VM image. I usually grab a coffee at this point.

Once it’s up, you’ll get a URL for the Kubeflow Dashboard (something like http://192.168.x.x).
Use the default credentials unless you specifically configured auth.

(B) Setting up in the Cloud (GCP, AWS)

For real-world production pipelines, I personally lean toward GCP. (AWS is great too, but GKE tends to play a bit nicer with Kubeflow, from my experience.)

If you’re setting up on GKE:

  • Spin up a Kubernetes cluster manually or use the GCP Marketplace Kubeflow deployment.
  • Make sure you enable:
    • Workload Identity
    • Node auto-scaling
    • Proper IAM roles (especially for storage access)

For production-grade reliability, I usually deploy Kubeflow using a Helm chart:

# Example: Installing Kubeflow Pipelines Standalone (lightweight version)
helm repo add kubeflow https://kubeflow.github.io/manifests
helm install my-kubeflow kubeflow/pipelines --namespace kubeflow --create-namespace

This might surprise you:
Even if you’re on managed Kubernetes, you’ll still need to configure ingress controllers manually if you want SSL/TLS with Kubeflow.

Gotcha Box (Learned the Hard Way):

  • Dashboard won’t load? Check if your Ingress Controller or LoadBalancer is properly forwarding traffic.
  • Pipelines stuck at “Pending”? Nine times out of ten, it’s a ServiceAccount permission issue.
  • Kubeflow pods crashing? Always double-check your cluster’s node size and attached GPUs (if any).

“Kubeflow doesn’t care how many YAMLs you throw at it — if the RBAC isn’t right, nothing will work.”


4. Designing the ML Pipeline (High-level Architecture)

When I’m building out a real pipeline, I always design it modularly first — no matter how tempting it is to jam everything into one messy DAG.
Think of it like wiring a racecar: every part should be independent but tightly integrated when needed.

Here’s the basic architecture I’m going to show you:

[Data Ingestion] -> [Preprocessing] -> [Model Training] -> [Evaluation] -> [Deployment (optional)]

Every block you see here will be its own Kubeflow Component, packaged neatly.

Step-by-Step Breakdown:

  • Data Ingestion:
    Pull data from an external source (GCS, S3, database).
    I usually write ingestion components that either pull nightly batches or can be manually triggered.
  • Data Preprocessing:
    Clean up nulls, engineer features, normalize distributions — all the good stuff.
    I always keep preprocessing separate because you will end up needing to reprocess new data without retraining the model.
  • Model Training:
    Standard training phase. (Later, you can easily plug in distributed training or hyperparameter tuning if you modularize properly.)
  • Model Evaluation:
    Evaluate accuracy, ROC-AUC, or whatever metric matters.
    I personally always write evaluation outputs as metadata artifacts — helps a lot with model governance later.
  • Model Deployment (Optional, but Valuable):
    Push the trained model into a serving system — KFServing, Seldon Core, or even a custom Flask API.

Here’s the deal:
If you build your pipeline like this, adding steps later (like bias detection or model explainability) becomes trivial.
If you cram everything into a monolithic component upfront… well, good luck refactoring it three months from now.


5. Writing Pipeline Components (Code Section 1)

“There’s an old saying: ‘You don’t rise to the level of your goals, you fall to the level of your systems.’
In Kubeflow, your components are your system — so it pays to design them carefully.”

When I first started building pipelines, I made the mistake of stuffing too much logic into single steps.
Trust me — you’ll save yourself a world of pain by keeping components simple, modular, and transparent.

Let’s dive into real examples, not theory.

Example 1: Preprocessing Component

Here’s a real, production-like preprocessing step that I actually use when cleaning tabular data.

from kfp import dsl
from kfp.dsl import component

@component(
    base_image="python:3.9",
    packages_to_install=["pandas", "scikit-learn"]
)
def preprocess_op(input_path: str, output_path: str):
    import pandas as pd
    from sklearn.preprocessing import StandardScaler

    # Load the raw data
    df = pd.read_csv(input_path)
    
    # Basic cleaning
    df = df.dropna()

    # Feature scaling (important for models like SVM, KMeans, etc.)
    scaler = StandardScaler()
    scaled_features = scaler.fit_transform(df.select_dtypes(include=["float64", "int64"]))
    df[df.select_dtypes(include=["float64", "int64"]).columns] = scaled_features

    # Save the processed data
    df.to_csv(output_path, index=False)

Notice:

  • I didn’t hardcode file paths inside the function.
  • I made sure any package I need (pandas, scikit-learn) is declared in packages_to_install.
  • The component is stateless — it only touches the inputs and outputs it’s given.

Quick Note: Packaging Components (Containerization)

You might be wondering: How does this Python function turn into something Kubeflow can actually run?

Kubeflow automatically packages these components into containers behind the scenes, but when you’re building custom components, you sometimes want to manually containerize and push them.

The minimal way I do it:

  1. Write a Dockerfile
    (You’ll almost never need this unless doing something very custom.)
FROM python:3.9-slim
RUN pip install pandas scikit-learn
COPY preprocess.py /preprocess.py
ENTRYPOINT ["python", "/preprocess.py"]

2. Build and Push to a Registry

docker build -t gcr.io/your-project-id/preprocess-component:latest .
docker push gcr.io/your-project-id/preprocess-component:latest

But again, if you stick to @component the way I showed you above, Kubeflow Pipelines will handle this for you under the hood when you compile the pipeline.

Example 2: Training Component

Here’s a minimal but real model training step I often use:

@component(
    base_image="python:3.9",
    packages_to_install=["pandas", "scikit-learn", "joblib"]
)
def train_op(input_path: str, model_output_path: str):
    import pandas as pd
    from sklearn.linear_model import LogisticRegression
    import joblib

    # Load preprocessed data
    df = pd.read_csv(input_path)
    X = df.drop("target", axis=1)
    y = df["target"]

    # Train the model
    model = LogisticRegression()
    model.fit(X, y)

    # Save the model
    joblib.dump(model, model_output_path)

Here’s the deal:

When you write Kubeflow components, treat them like LEGO blocks, not like glued-together junk.

If you:

  • Avoid hardcoding: Always pass paths/params explicitly.
  • Handle exceptions: Add try/except blocks if your code might fail on weird input.
  • Write clean imports: Import libraries inside the component functions (not globally) to make packaging easier.

Then your pipelines will be 10x easier to maintain when business needs inevitably shift.


6. Building the Pipeline (Code Section 2)

“A good pipeline isn’t just a bunch of tasks strung together — it’s a living map of your ML system.”

When I first started wiring components together, I’ll be honest: I underestimated how much small mistakes early on could create huge bottlenecks later.

Things like hardcoded parameters, non-reusable paths, or overcomplicated dependencies — they always came back to haunt me.

Let me show you how I personally structure my pipelines now, based on lessons learned the hard way.

Composing the Pipeline

Here’s how I typically build a clean, modular Kubeflow pipeline using kfp.dsl.Pipeline.

from kfp import dsl
from kfp.dsl import pipeline

@pipeline(
    name="Example Training Pipeline",
    description="An example ML pipeline with preprocessing, training, and evaluation."
)
def training_pipeline(
    data_path: str = "gs://my-bucket/raw_data.csv",
    model_output_path: str = "gs://my-bucket/models/model.joblib"
):
    # Step 1: Preprocessing task
    preprocess_task = preprocess_op(
        input_path=data_path,
        output_path="/tmp/cleaned_data.csv"
    )

    # Step 2: Training task
    train_task = train_op(
        input_path=preprocess_task.outputs["output_path"],
        model_output_path=model_output_path
    )

Some expert notes here:

  • Linking tasks:
    Notice how I didn’t manually specify the output file from preprocessing when passing to training?
    Instead, I’m dynamically pulling preprocess_task.outputs["output_path"].
    This keeps things tightly connected without hardcoding.
  • Parameterizing everything:
    I’m using function arguments like data_path and model_output_path.
    This means I can reuse this pipeline across datasets, models, even across teams, without rewriting it.
  • Keeping steps modular:
    Each step only knows about its own input/output.
    Zero knowledge of downstream or upstream tasks — pure, clean modularity.

Pro Tip: Keep Paths Abstracted

You might be wondering: “Why are you using /tmp/cleaned_data.csv for intermediate files?”

Here’s the deal:

In my experience, intermediate outputs should stay local to the container unless they need to be persistent.
If you start writing everything directly to cloud storage (like S3 or GCS) between every step, you’re going to run into crazy I/O bottlenecks when pipelines scale.

My rule of thumb?

  • Temporary artifacts ➔ Local /tmp/ paths inside the container.
  • Final outputs ➔ Push to object storage.

7. Compiling and Uploading the Pipeline

“Code that isn’t deployed is just a very elaborate draft.”

After wiring up the pipeline, the next step is making it real.
I’ve personally tripped here more times than I care to admit — so I always double-check this part carefully.

Compiling the Pipeline

First things first: you need to compile your Python pipeline into a .yaml file that Kubeflow can understand.

Here’s exactly how I usually do it:

from kfp import compiler

if __name__ == "__main__":
    compiler.Compiler().compile(
        pipeline_func=training_pipeline,
        package_path="training_pipeline.yaml"
    )

Now, just run your script:

python3 pipeline.py

This generates a training_pipeline.yaml that you can upload into Kubeflow.

Uploading the Pipeline

You’ve got two options — and I’ve used both depending on the situation:

Manual Upload (through the UI)

  1. Open your Kubeflow Pipelines UI.
  2. Click on PipelinesUpload Pipeline.
  3. Upload your training_pipeline.yaml.

Simple, but can get tedious if you’re iterating a lot.

Programmatic Upload (my preferred method)

When I’m working fast, I prefer using the kfp.Client() method:

import kfp

client = kfp.Client()  # Assumes you're already authenticated

# Upload a new pipeline
pipeline = client.upload_pipeline(
    pipeline_package_path="training_pipeline.yaml",
    pipeline_name="Training Pipeline Example"
)

This way, you can upload, version, and even update existing pipelines without touching the UI.


8. Running the Pipeline (Manual and Automated)

“Launching a pipeline manually feels like flying a plane with training wheels. Automation? That’s flying a jet.”

Alright — now that your pipeline is uploaded, it’s go time.

Manual Run (via UI)

If you’re just testing, you can kick off a run directly from the UI:

  1. Go to the Pipelines section.
  2. Select your pipeline.
  3. Click Create Run ➔ Fill in parameters like data_path.
  4. Hit Start.

Personally, I only do this for initial sanity checks.

Programmatic Run (Production Style)

Here’s how I usually trigger pipelines programmatically — much cleaner for scheduled jobs or production workflows:

from kfp import Client

client = Client()

run = client.create_run_from_pipeline_func(
    training_pipeline,
    arguments={
        'data_path': 'gs://your-bucket/path/to/data.csv',
        'model_output_path': 'gs://your-bucket/path/to/model.joblib'
    }
)

Notice how you pass parameters cleanly — no hardcoded magic inside the pipeline itself.

Monitoring Runs and Accessing Logs

Once the pipeline run kicks off, monitoring becomes critical.
In my experience, these are the top places to keep an eye on:

  • Dashboard UI: Watch the run status change from Pending ➔ Running ➔ Succeeded/Failed.
  • Logs: Click on any component step to drill into real-time logs.
  • Artifacts: Output files, models, and metrics are automatically saved and linked.

👉 Pro Tip from my own burns:
Always set up alerts on failures (Slack, PagerDuty, etc.) if you’re moving to production.
You will thank yourself later when things break at 3 AM.


9. Handling Model Deployment (Bonus Advanced Section)

“Training a model is just half the battle. Real heroes ship it.”

Now, based on my own experience, deploying models inside Kubeflow felt like stepping into a different world compared to just running pipelines.
If you’re serious about production (and I know you are), you’ll eventually run into KFServing or Seldon Core.

Personally, I’ve leaned toward KFServing for most use cases — it’s pretty streamlined for common ML frameworks.

Quick Example: Deploying a Scikit-learn Model with KFServing

Here’s an actual YAML I’ve used in real-world deployments:

apiVersion: serving.kubeflow.org/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    sklearn:
      storageUri: "gs://my-bucket/sklearn-model/"

This YAML does a few magical things:

  • It pulls your model directly from a GCS bucket.
  • It spins up a server using KFServing’s built-in Scikit-learn server.
  • It manages scaling, logging, and retries without you writing any serving code.

Exposing the Endpoint

This might surprise you: Kubeflow doesn’t automatically make your model accessible from the outside world.
You have to configure an Ingress or expose a LoadBalancer service.

In one of my projects, I usually did it like this (basic LoadBalancer):

apiVersion: v1
kind: Service
metadata:
  name: my-model-external
spec:
  type: LoadBalancer
  selector:
    serving.kubeflow.org/inferenceservice: my-model
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080

And boom — your model gets an external IP you can hit with REST calls.


10. Best Practices and Common Pitfalls

“Experience is what you get when you didn’t get what you wanted.”

Believe me, I’ve learned these the hard way.
If you’re going to build real, maintainable Kubeflow systems, here’s what’s saved me (and my sanity):

Design for Modularity and Versioning

You might be tempted to cram everything into one giant pipeline.
I’ve been there — and regretted it.

Instead:

  • Treat every component as a standalone microservice.
  • Keep your inputs/outputs super clean and version-controlled.

This way, updating just the preprocessing logic doesn’t mean rewriting the entire pipeline.

Avoiding Artifact Caching Headaches

This might seem small… until it burns you:
Kubeflow caches pipeline step outputs by default.
If your component logic changes but the inputs stay the same, Kubeflow might just reuse old outputs.

Personally, I always do one of these after major logic changes:

  • Bump the @component version (even if unofficially via a tag in the image or name).
  • Clear previous runs manually when necessary.

Debugging Pipeline Failures (Without Losing Your Mind)

You might be wondering: “How do I actually debug failures without getting lost in a maze of logs?”

Here’s what worked for me:

  • Always print outputs explicitly at every major component boundary.
  • Use structured logging if possible (e.g., JSON logs).
  • Make heavy use of retry policies inside your components for flaky jobs.

And trust me, set timeouts — don’t let rogue steps eat up your cluster.

Bonus Tools Worth Bookmarking

I personally keep these in my quick-access bookmarks:


11. Final Thoughts (Short and Actionable)

“In the world of machine learning, ideas are cheap. It’s execution that scales.”

Looking back at my own journey, I’ve realized that Kubeflow Pipelines aren’t just a “nice-to-have” for flashy demos — they’re absolutely critical once you start operating at serious scale.

You might be thinking: “Alright, I’ve got a working pipeline now — what’s next?”
Here’s the deal: this is just the beginning.

Tweak and Expand Your Pipeline
Personally, one of the first things I always recommend is to modularly slot in hyperparameter tuning.
Tools like Katib integrate nicely and let you optimize without rewriting your pipeline.

Experiment with AutoML Components
I’ve plugged in AutoML systems where needed — sometimes to fast-track early model versions, sometimes to beat baselines quickly.

Make It YOURS
Seriously — the fastest way I leveled up was by breaking the pipeline, fixing it, and customizing it for weird, real-world datasets.
Don’t just clone — own your pipeline.

One last thing I’d leave you with:
Kubeflow shines when you stop treating it like a static framework and start treating it like an evolving ecosystem you’re actively building.

And from my experience… once you see your first few models automatically trained, deployed, and monitored without manual babysitting —
you won’t want to go back.

Leave a Comment