Fine-Tuning VGG16 for Custom Image Classification

1. When to Use Transfer Learning with VGG16

“Not every hammer is meant for every nail — but when you’ve got VGG16 in your toolkit, some jobs become a lot simpler.”

I’ll be honest — VGG16 isn’t the latest or flashiest model out there. It’s been outpaced by more efficient architectures like EfficientNet or ConvNeXt in most benchmarks.

But here’s the deal: when I’m working with a mid-sized dataset (say, 5k to 50k images) and don’t have unlimited compute, I still reach for VGG16 more often than you’d think.

Why? Because the architecture is dead simple. No fancy tricks. Just clean, predictable convolutional blocks. That makes it incredibly easy to fine-tune — and I’ve found it tends to converge faster than deeper models, especially when you don’t have tens of millions of labeled samples.

I’ve used it in cases like:

  • Niche classification tasks (e.g., industrial defect detection, satellite image tagging) where pretrained features from ImageNet still carry over surprisingly well.
  • Projects where inference speed isn’t critical, but stability and interpretability matter.
  • Training under constraints — tight budgets or restricted GPU access, where I just need something solid and proven.

If you’re building a pipeline where reproducibility matters, or you want something you can train, deploy, and debug without surprises — VGG16 still holds its ground.


2. Project Setup and Dataset Prep

This is where most people try to cut corners. But I’ve learned the hard way that a clean project structure saves you hours later — especially when you’re running multiple experiments or collaborating with teammates.

2.1 Directory Structure

Here’s how I typically organize a VGG16 fine-tuning project:

project/
│
├── config.yaml
├── train.py
├── evaluate.py
├── models/
│   └── vgg16_custom.py
├── data/
│   ├── train/
│   ├── val/
│   └── test/
├── outputs/
│   ├── checkpoints/
│   ├── logs/
│   └── predictions/

Some quick things I always set up from day one:

  • config.yaml — All hyperparameters, paths, and training options in one place.
  • Seed setting — I fix seeds across random, numpy, and tensorflow/torch for reproducibility.
  • Logging — Whether it’s TensorBoard, W&B, or plain CSV logs, tracking training history is a must.
  • Checkpoints — I always use ModelCheckpoint with both val_loss and val_accuracy monitored.

2.2 Dataset Loading

I usually avoid toy datasets like CIFAR or MNIST in serious work — they’re not helpful benchmarks anymore. Here’s how I load a real-world custom dataset structured into folders by class:

With TensorFlow (Keras):

from tensorflow.keras.utils import image_dataset_from_directory

train_ds = image_dataset_from_directory(
    "data/train",
    image_size=(224, 224),
    label_mode="categorical",
    batch_size=32,
    seed=42,
    shuffle=True
)

val_ds = image_dataset_from_directory(
    "data/val",
    image_size=(224, 224),
    label_mode="categorical",
    batch_size=32,
    seed=42,
    shuffle=False
)

With PyTorch:

from torchvision import datasets, transforms
from torch.utils.data import DataLoader

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

train_dataset = datasets.ImageFolder("data/train", transform=transform)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

val_dataset = datasets.ImageFolder("data/val", transform=transform)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

Handling Class Imbalance:

One time, I had a dataset where one class made up 80% of the samples. It completely threw off training. Since then, I usually either:

  • Use class_weight in Keras:
class_weights = {0: 1.0, 1: 3.0, 2: 2.5}
model.fit(..., class_weight=class_weights)
  • Or in PyTorch, I use WeightedRandomSampler to balance the mini-batches.

2.3 Preprocessing

You might be tempted to just resize and call it a day — don’t. VGG16 was trained on ImageNet with very specific preprocessing. If you skip this, you’ll start your fine-tuning on the wrong foot.

For Keras:

from tensorflow.keras.applications.vgg16 import preprocess_input

train_ds = train_ds.map(lambda x, y: (preprocess_input(x), y))

For PyTorch, I already included it in the transform:

transforms.Normalize(mean=[0.485, 0.456, 0.406],
                     std=[0.229, 0.224, 0.225])

Tip: I apply data augmentation only on the training set. Stuff like random flips, small rotations, and brightness tweaks — but nothing too aggressive that might alter class semantics.


3. Loading and Modifying VGG16

“Start with a strong foundation, or everything you build on top will crack — same goes for your model.”

When I’m fine-tuning with VGG16, I always start by loading the base model with pretrained weights and stripping off the top layer. That top part was trained to classify 1,000 ImageNet classes — which, in most real-world cases, are completely irrelevant to your task.

3.1 Load the Base Model (Pretrained on ImageNet)

Here’s how I typically do it in Keras:

from tensorflow.keras.applications import VGG16

base_model = VGG16(
    weights="imagenet", 
    include_top=False, 
    input_shape=(224, 224, 3)
)

base_model.trainable = False  # freeze all layers initially

You might be wondering: Why freeze it at first?
Well, from my experience, jumping straight into training all layers usually hurts more than it helps. You’re essentially asking the model to forget everything it learned on ImageNet — which defeats the point of transfer learning.

By freezing the base, you let your custom head learn first. Then once it stabilizes, you can selectively unfreeze and fine-tune deeper layers.

3.2 Build the New Model on Top

I rarely use Flatten() here. Instead, I go with GlobalAveragePooling2D() — it reduces parameters and tends to generalize better, especially when data is limited.

Here’s the custom classifier head I often use:

from tensorflow.keras.models import Model
from tensorflow.keras.layers import GlobalAveragePooling2D, Dropout, Dense, BatchNormalization

x = base_model.output
x = GlobalAveragePooling2D()(x)

# Helps avoid overfitting, especially on small datasets
x = Dropout(0.5)(x)

# I've found BatchNorm to really help stabilize training
x = BatchNormalization()(x)

# A dense layer to project features into a space the classifier can use
x = Dense(256, activation='relu')(x)

# Final classifier
predictions = Dense(num_classes, activation='softmax')(x)

model = Model(inputs=base_model.input, outputs=predictions)

Here’s what this model looks like when summarized (actual output will depend on your input shape and class count):

Model: "functional"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
vgg16 (Functional)           (None, 7, 7, 512)         14714688
global_average_pooling2d     (None, 512)               0
dropout                      (None, 512)               0
batch_normalization          (None, 512)               2048
dense                        (None, 256)               131328
dense_1                      (None, num_classes)       ??
=================================================================

And yes — I almost always inspect this summary before training. One time I forgot and ended up with a final output of shape (None, 1000) because I forgot to override the top layer. Cost me a full day of debugging.

💡 Quick tip: If your model ends up being too heavy, try reducing the size of the dense layer or using GlobalMaxPooling2D() instead — it’s slightly more aggressive but still works in many cases.


4. Phase 1: Feature Extraction (Train Only the Head)

“Don’t touch the foundation before you know the roof fits — same rule applies here.”

I always start by freezing every single layer in the base model. At this stage, VGG16 is just acting as a feature extractor — like a very smart filter bank trained on millions of images.

Here’s the code I use to do that:

for layer in base_model.layers:
    layer.trainable = False

Once that’s done, I compile the full model. I like to keep things simple but precise here:

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

model.compile(
    optimizer=Adam(learning_rate=1e-4),  # go slow, especially with a new head
    loss=CategoricalCrossentropy(),
    metrics=['accuracy']
)

And here’s the thing: I always add callbacks. Skipping them is like training without a seatbelt.

callbacks = [
    EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True),
    ModelCheckpoint('model_feature_extract.h5', monitor='val_loss', save_best_only=True)
]

Now you’re ready to train — but don’t go overboard. This is just to get the classifier head in shape. I usually stop around 10 epochs or even earlier if validation plateaus:

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=10,
    callbacks=callbacks
)

Here’s a sample from one of my logs to show you what to expect:

Epoch 3/10
val_accuracy: 0.8421 - val_loss: 0.4978
Epoch 4/10
val_accuracy: 0.8549 - val_loss: 0.4760

Note: If your validation accuracy is stagnant while training accuracy climbs, it’s probably time to stop early and move to fine-tuning.


5. Phase 2: Fine-Tuning (Unfreeze Selective Layers)

“This is where you stop treating the model like a black box — and start collaborating with it.”

Once your classifier is decent, that’s your cue to unlock more layers and help the model adapt better to your domain.

5.1 How Many Layers to Unfreeze?

Here’s a rule of thumb I’ve picked up after doing this across multiple domains:

  • Start with the last 4–6 convolutional blocks.
  • If your dataset is small or noisy, stay conservative.
  • If your dataset is larger and closer to ImageNet-style domains (e.g. dogs, cars, nature), you can unfreeze a bit more.

You might be wondering:
Why not just unfreeze everything?
From what I’ve seen, doing that too early wrecks performance. You end up overwriting pretrained weights before your head has even stabilized.

5.2 Code: Selective Unfreezing

Here’s how I typically unfreeze just the tail end of VGG16:

for layer in base_model.layers[-6:]:
    layer.trainable = True

Once you unfreeze layers, always recompile the model. The new trainable params need to be picked up:

model.compile(
    optimizer=Adam(learning_rate=1e-5),  # slower LR to avoid destructive updates
    loss=CategoricalCrossentropy(),
    metrics=['accuracy']
)

Notice the learning rate is even lower now — that’s intentional. You want to gently nudge the pretrained weights instead of jamming gradients through them.

Then I train again, this time for longer:

history_finetune = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,  # or 20, based on early stopping
    callbacks=callbacks
)

Here’s what a good fine-tuning curve might look like from one of my past projects:

Epoch 7/15
val_accuracy: 0.9021 - val_loss: 0.3228
Epoch 9/15
val_accuracy: 0.9136 - val_loss: 0.2874

It’s subtle, but you’ll start noticing small gains in accuracy — especially on tricky validation samples.

💡 Pro tip: I’ve found that reducing the batch size during fine-tuning sometimes leads to better generalization — especially when memory isn’t a bottleneck.


6. Model Evaluation

“A model is only as good as what it gets wrong — that’s where I always start looking.”

Once I’m done training, I evaluate the model only on the test set — untouched till now. Here’s my go-to setup using sklearn.metrics:

from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Predict on test set
y_pred = model.predict(test_ds)
y_pred_classes = np.argmax(y_pred, axis=1)
y_true = np.concatenate([y for x, y in test_ds], axis=0)
y_true_classes = np.argmax(y_true, axis=1)

# Confusion matrix
print(confusion_matrix(y_true_classes, y_pred_classes))

# Classification report
print(classification_report(y_true_classes, y_pred_classes))

This might surprise you, but I often learn more from misclassifications than from the accuracy number itself. In one project, I had near-perfect accuracy — until I visualized the failures and realized the model was overfitting on watermark patterns, not object shape.

So I recommend: always plot your misclassified images.

import matplotlib.pyplot as plt

for i in range(10):
    img, label = test_ds[i]
    pred = model.predict(tf.expand_dims(img, axis=0))
    pred_label = np.argmax(pred)
    true_label = np.argmax(label)

    if pred_label != true_label:
        plt.imshow(img.numpy().astype("uint8"))
        plt.title(f"True: {class_names[true_label]} | Pred: {class_names[pred_label]}")
        plt.axis('off')
        plt.show()

This isn’t just cosmetic. I’ve used these plots to justify augmentations, rebalancing classes, and even cleaning mislabeled samples.


7. Optimization Tips

“It’s the boring tuning that separates a decent model from a production-ready one.”

Over the years, here are a few tricks I’ve leaned on — not theory, just what actually worked in my runs.

Regularization That Actually Helped

I’ve seen Dropout and L2 regularization work wonders — especially when the classifier is a bit too confident too early.

from tensorflow.keras.regularizers import l2

x = Dense(256, activation='relu', kernel_regularizer=l2(1e-4))(x)
x = Dropout(0.4)(x)

It’s subtle, but this often helped reduce variance in my validation scores.

Data Augmentation: Real-World Impact

Everyone talks about data augmentation — but here’s the deal:

  • Random rotation + flips? Usually helps.
  • Color jitter or contrast shifts? Worked for natural images, but made things worse for medical imaging in my case.

Keep it domain-aware. Here’s my setup that worked well for product catalog images:

from tensorflow.keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=10,
    horizontal_flip=True,
    width_shift_range=0.1,
    height_shift_range=0.1
)

Mixed Precision = Free Speed-Up (if you’re using GPUs)

Once I started using mixed precision, training speed went up almost 2x on supported GPUs. Here’s the minimal setup I used:

from tensorflow.keras import mixed_precision

mixed_precision.set_global_policy('mixed_float16')

You might be wondering: Does it affect accuracy?
In my tests, not at all — but always monitor the stability when you’re fine-tuning pretrained models.

Learning Rate Schedulers

I’ve personally had the most luck with ReduceLROnPlateau:

from tensorflow.keras.callbacks import ReduceLROnPlateau

lr_scheduler = ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=2,
    verbose=1
)

For longer runs or bigger datasets, I sometimes swap in CosineDecay from the tf.keras.optimizers.schedules API — especially when using warm-up strategies.

Speed Tricks That Actually Matter

  1. Cache + Prefetch in tf.data pipelines? It’s not just fluff — my training loop time dropped by ~30%.
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
  1. Use model.evaluate() instead of custom loops when doing fast evals — it’s optimized under the hood.

8. Exporting and Using the Model

“A model not deployed is just an expensive curve-fitting hobby.”

Once the model performs well, I don’t wait — I immediately export it. I’ve had scenarios where unexpected crashes during experimentation caused loss of weights. So saving early and often is now my habit.

Here’s how I typically save the model:

# Save full model (architecture + weights + optimizer state)
model.save('vgg16_custom.h5')

And to bring it back later:

from tensorflow.keras.models import load_model

model = load_model('vgg16_custom.h5')

You might be wondering: What about inference on new images?
Here’s a minimal setup I’ve used for image inference, wrapped neatly into a function:

from tensorflow.keras.preprocessing import image
import numpy as np

def predict_image(model, img_path, class_names):
    img = image.load_img(img_path, target_size=(224, 224))
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = x / 255.0  # Remember to normalize like during training
    preds = model.predict(x)
    return class_names[np.argmax(preds)]

Personally, I always validate this function on a few hand-picked examples before wrapping it in an API.

Bonus: Real-time Inference with FastAPI

For quick demos, I often spin up a FastAPI app to serve predictions. Here’s a basic FastAPI route I’ve reused across multiple projects:

from fastapi import FastAPI, File, UploadFile
from io import BytesIO
from PIL import Image
import uvicorn

app = FastAPI()

@app.post("/predict/")
async def predict(file: UploadFile = File(...)):
    contents = await file.read()
    img = Image.open(BytesIO(contents)).resize((224, 224))
    img = np.array(img) / 255.0
    img = np.expand_dims(img, axis=0)
    
    preds = model.predict(img)
    class_idx = np.argmax(preds)
    return {"class": class_names[class_idx]}

Trust me — being able to test your model in real-time from a browser helps catch issues you’d never notice in notebooks.


9. Common Pitfalls (And How I Avoided Them)

“These are the mistakes I’ve actually made — and still double-check for.”

Here’s a short list of things that tripped me up, even after years of working with transfer learning:

Layers Not Actually Frozen

This one burned me more than once. I thought the base layers were frozen… until I noticed suspiciously high GPU usage during Phase 1.

Always double-check with:

for layer in base_model.layers:
    print(layer.name, layer.trainable)

Input Not Normalized Like VGG Expects

VGG models are sensitive. I learned this the hard way — performance was unstable until I normalized images to match the training scale (0-1 in this case, since I used a custom head).

Overfitting the Head

In Phase 1, I’ve seen the head start overfitting within 5–6 epochs. You’ll notice it when the training accuracy jumps but val accuracy plateaus or dips.

Solution? Smaller dense layers, heavier dropout, early stopping.

Unfreezing Too Many Layers Too Soon

Back when I was new to this, I used to unfreeze the entire base model right after Phase 1. The result? Catastrophic forgetting — the pretrained weights were wiped out quickly.

Now, I always unfreeze gradually — usually 4 to 6 convolutional blocks — and train with a very low learning rate.

LR Too High After Unfreezing

If you leave the learning rate unchanged after unfreezing, good luck. The pretrained layers will destabilize fast.

Here’s what I do:

from tensorflow.keras.optimizers import Adam
model.compile(optimizer=Adam(learning_rate=1e-5),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

Conclusion

So here’s what we pulled off:

  • Loaded a pre-trained VGG16 model, stripped off the top, and bolted on a custom classifier.
  • Ran feature extraction first (freezing the base), then fine-tuned selectively for better generalization.
  • Trained with best practices — dropout, batch norm, callbacks, proper learning rate schedules.
  • Evaluated with real metrics, exported the model cleanly, and even prepped it for real-time inference.

But here’s the question I often get from peers:
“Is VGG16 still worth it?”

When to Use This vs EfficientNet or ResNet?

Let me be straight with you:

  • Use VGG16 when you want something quick, interpretable, and relatively lightweight. It’s great for educational projects, prototyping, or low-compute settings. Also, the uniform structure of VGG makes debugging cleaner.
  • Switch to EfficientNet or ResNet if:
    • You’re dealing with noisy data or want state-of-the-art performance.
    • You need better parameter-efficiency (VGG is big and shallow).
    • You’re deploying to production environments where every FLOP counts.

Personally, I still keep VGG16 in my toolbox because it’s reliable and does the job surprisingly well on mid-sized datasets — especially if you’re not chasing leaderboard scores.

A Real Story from the Trenches

Let me leave you with a quick story.

I used this exact fine-tuning pipeline on a client project where we had ~10,000 labeled chest X-rays for multi-class classification. The catch? The data was highly imbalanced, and the client didn’t have the resources for massive training runs.

Instead of going with a heavy backbone like EfficientNet-B7, I went with VGG16 — and with just careful phase-wise training, augmentation, and learning rate tuning, we hit over 92% accuracy on the minority classes. That model is still running behind their internal dashboard, serving doctors in real time.

Leave a Comment