1. Introduction
“When it comes to training custom object detectors, YOLOv8 makes the process feel deceptively simple—but fine-tuning it properly is where things get interesting.”
In this guide, I’ll walk you through how I personally fine-tuned YOLOv8 on a custom industrial inspection dataset—something with tiny defects, overlapping parts, and inconsistent lighting. These weren’t textbook-perfect images, and that’s exactly why I had to get hands-on with every part of the pipeline.
If you’re working with a custom dataset—whether it’s traffic surveillance, wildlife monitoring, or product quality control—this guide is meant to get you from “it runs” to “it performs”.
I’m not going to waste your time on the basics. This is all based on what I’ve done, what’s worked, and where I’ve hit walls. From setting up the environment properly to avoiding silent failures mid-training, I’ll share everything that matters—and nothing that doesn’t.
2. Setup & Environment (With No Room for Errors)
2.1 Python Environment + Packages
Let me start by saying: if your setup isn’t airtight, your training will break in ways that make zero sense.
Personally, I prefer using a virtual environment (either venv
or conda
) just to keep dependencies isolated. Here’s the exact setup I’ve used:
pip install ultralytics==8.1.0 torch==2.0.1 opencv-python==4.8.0.76
This combo has worked reliably for me on both local GPU machines and cloud-based setups (like Paperspace or Colab Pro). If you’re using CUDA 11.8 or higher, this torch version is solid. Just be careful mixing torch and CUDA versions—if they don’t align, you’ll end up with errors that look unrelated.
Also, avoid installing torchvision
unless you actually need it. YOLOv8 doesn’t depend on it directly, and I’ve seen weird conflicts on some machines.
2.2 GPU Check and Compatibility
Before anything else, make sure your GPU is being picked up by PyTorch. This sounds obvious, but I’ve personally wasted hours debugging slow training speeds only to realize PyTorch fell back to CPU because CUDA wasn’t set up correctly.
Here’s a simple sanity check:
import torch
if torch.cuda.is_available():
print(f"CUDA is available. Device: {torch.cuda.get_device_name(0)}")
else:
print("CUDA NOT available. You're training on CPU.")
If you see CPU here and you were sure CUDA was installed, check your driver version. I’ve had cases where torch.cuda
silently failed due to a mismatch between the driver and the installed CUDA toolkit.
2.3 Folder Structure for Custom Projects
This might sound like a small thing, but a clean folder structure can save your sanity later when you start experimenting with multiple models, datasets, and training configs.
Here’s the project structure I personally stick to:
project_root/
│
├── data/
│ ├── images/
│ │ ├── train/
│ │ └── val/
│ ├── labels/
│ │ ├── train/
│ │ └── val/
│ └── data.yaml
│
├── models/
│ └── yolov8_custom.yaml
│
├── weights/
│ └── best.pt # exported model weights
│
├── scripts/
│ ├── convert_labels.py
│ ├── visualize_annotations.py
│ └── infer_custom.py
I’ve made the mistake early on of dumping everything into one folder—logs, weights, scripts, you name it. It worked for a day or two… until I had to rerun experiments or retrain from scratch. Now, I treat my folders like code repos: modular, clean, and reproducible.
3. Dataset Preparation
3.1 Label Format (And Why Getting This Wrong Wastes Hours)
Let me be real—this part is where I’ve personally made the most mistakes early on. Even now, I triple-check the format before training because YOLO doesn’t throw loud errors for wrong annotations; it just silently learns garbage.
YOLOv8 expects annotations in .txt
files, one per image, with each line in this format:
<class_id> <x_center> <y_center> <width> <height>
All coordinates are normalized (i.e., values between 0 and 1 relative to image size). Here’s a real example from my dataset:
2 0.512 0.348 0.210 0.160
0 0.752 0.602 0.188 0.294
In this case:
2
and0
are the class IDs- The rest are the bounding boxes:
[x_center, y_center, width, height]
🔧 Tip from experience: If your dataset comes with bounding boxes in pixel format (which many do), you must convert them—YOLO will not do it for you.
3.2 Dataset Conversion Scripts (No One Talks About These Edge Cases)
Now here’s the deal—most datasets aren’t in YOLO format. I’ve had to convert from COCO JSON, Pascal VOC XML, and even raw CSVs. The trick is to not just write a converter—but write a reliable one that handles odd cases like:
- Negative coordinates
- Rotated or flipped boxes
- Class name mismatches
Here’s a minimal example I’ve used to convert VOC XML to YOLO format:
import os
import xml.etree.ElementTree as ET
from PIL import Image
classes = ["cat", "dog", "person"] # must match your dataset.yaml
def convert_bbox(size, box):
dw = 1. / size[0]
dh = 1. / size[1]
x = (box[0] + box[1]) / 2.0
y = (box[2] + box[3]) / 2.0
w = box[1] - box[0]
h = box[3] - box[2]
return (x * dw, y * dh, w * dw, h * dh)
def convert_annotation(xml_path, output_path):
tree = ET.parse(xml_path)
root = tree.getroot()
image_path = root.find("path").text
img = Image.open(image_path)
w, h = img.size
with open(output_path, 'w') as out_file:
for obj in root.iter("object"):
cls = obj.find("name").text
if cls not in classes:
continue
cls_id = classes.index(cls)
xmlbox = obj.find("bndbox")
b = (
float(xmlbox.find("xmin").text),
float(xmlbox.find("xmax").text),
float(xmlbox.find("ymin").text),
float(xmlbox.find("ymax").text)
)
bb = convert_bbox((w, h), b)
out_file.write(f"{cls_id} {' '.join(map(str, bb))}\n")
Pro tip: Always round YOLO coords to 6 decimals to avoid floating-point issues. And keep a few samples manually reviewed to sanity check your script output.
3.3 Directory Layout for YOLOv8 (This One’s Non-Negotiable)
YOLOv8 expects a very specific folder layout. If it’s even slightly off, you’ll get silent failures—or worse, training runs but results are nonsense.
Here’s the structure I stick to:
dataset/
├── images/
│ ├── train/
│ └── val/
├── labels/
│ ├── train/
│ └── val/
└── data.yaml
And here’s a quick Python snippet I wrote to auto-structure my raw dataset:
import shutil, os, random
def organize_yolo_format(raw_img_dir, raw_label_dir, dest_dir, split_ratio=0.8):
os.makedirs(f"{dest_dir}/images/train", exist_ok=True)
os.makedirs(f"{dest_dir}/images/val", exist_ok=True)
os.makedirs(f"{dest_dir}/labels/train", exist_ok=True)
os.makedirs(f"{dest_dir}/labels/val", exist_ok=True)
images = [f for f in os.listdir(raw_img_dir) if f.endswith(".jpg")]
random.shuffle(images)
split = int(len(images) * split_ratio)
for i, img_name in enumerate(images):
base = img_name.split(".")[0]
label_name = base + ".txt"
set_type = "train" if i < split else "val"
shutil.copy(f"{raw_img_dir}/{img_name}", f"{dest_dir}/images/{set_type}/{img_name}")
shutil.copy(f"{raw_label_dir}/{label_name}", f"{dest_dir}/labels/{set_type}/{label_name}")
Heads up: Make sure your label files have the exact same filename (minus extension) as the image. YOLO won’t warn you if a label is missing—it’ll just skip that image.
3.4 Verifying Annotations Visually (Trust, Don’t Assume)
If there’s one thing you should never skip, it’s verifying the labels visually. I’ve run entire experiments with misaligned bounding boxes just because I skipped this step.
Here’s a simple visualization script I use with OpenCV:
import cv2
def draw_yolo_label(image_path, label_path, class_names):
img = cv2.imread(image_path)
h, w = img.shape[:2]
with open(label_path, "r") as file:
for line in file:
cls_id, x_c, y_c, bw, bh = map(float, line.strip().split())
x1 = int((x_c - bw / 2) * w)
y1 = int((y_c - bh / 2) * h)
x2 = int((x_c + bw / 2) * w)
y2 = int((y_c + bh / 2) * h)
cv2.rectangle(img, (x1, y1), (x2, y2), (0,255,0), 2)
cv2.putText(img, class_names[int(cls_id)], (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0,255,0), 2)
cv2.imshow("Annotated", img)
cv2.waitKey(0)
cv2.destroyAllWindows()
My habit: I sample 20–30 random images from both train and val and run them through this script before touching the training step. It’s saved me more than once from training on trash labels.
4. Model Configuration
4.1 Selecting the Right YOLOv8 Variant (Don’t Just Go for the Biggest One)
Here’s the deal: when I first got started with YOLOv8, I was like, “Why not just go with yolov8x
? Bigger = better, right?” Big mistake. Not because it’s a bad model—it’s a beast—but unless you’ve got a top-tier GPU and a dataset with thousands of high-quality images, it’ll just bottleneck your workflow.
So here’s how I choose now—based on actual use:
Variant | Use Case |
---|---|
yolov8n | Fast testing, embedded devices, very small datasets |
yolov8s | Great starting point for small-to-medium custom datasets |
yolov8m | Balanced for most real-world projects, my personal go-to |
yolov8l | Use when you’ve got at least 16GB+ GPU VRAM and high-resolution inputs |
yolov8x | Only worth it for large-scale, production-grade datasets and top-tier GPUs |
My take: I usually start with
yolov8s
just to validate that the pipeline is working. Once I’m confident, I move toyolov8m
for most production-quality experiments. If I can’t train a full epoch in under 5 minutes, I downsize.
Command example:
yolo task=detect mode=train model=yolov8m.pt data=dataset/data.yaml epochs=100 imgsz=640
4.2 Customizing data.yaml
(This Tiny File Can Break Everything)
This might surprise you, but 99% of the time I see someone stuck in training, it’s because of this one file. It’s deceptively simple, but if the paths or class definitions are even slightly off, YOLOv8 either crashes or trains garbage silently.
Here’s a working example I’ve used:
path: /home/user/datasets/my_project/
train: images/train
val: images/val
nc: 3
names: ["cat", "dog", "person"]
Breakdown:
path
is the root path to your dataset foldertrain
andval
are relative topath
nc
is the number of classesnames
must be in index order matching your labels
Real mistake I made once: I accidentally set
nc: 2
but had 3 class names innames:
. YOLO didn’t complain—it just skipped training for the third class. Took me hours to catch.
Here’s a quick sanity check in Python:
from ultralytics import YOLO
model = YOLO("yolov8m.pt")
model.train(data="dataset/data.yaml", epochs=1, imgsz=640)
If this runs fine for one epoch, you’re structurally good to go.
4.3 Hyperparameter Tuning Strategy (This Is Where Performance Lives)
Let me be blunt: if you’re using default hyperparameters, you’re leaving performance on the table. I’ve seen mAP improvements of 5–10% just from tuning LR, augmentation, and image size.
Here’s the approach I follow:
Start Simple:
Start with the command below to train on your config:
yolo task=detect mode=train model=yolov8m.pt data=dataset/data.yaml epochs=100 imgsz=640
Then tweak based on what your dataset is doing.
Adjust the following first:
--lr0 # Initial learning rate
--epochs # More if your loss plateaus late
--imgsz # Larger = more detail, but more GPU needed
--batch # Adjust based on your VRAM
Example:
yolo task=detect mode=train model=yolov8m.pt data=dataset/data.yaml \
epochs=200 imgsz=768 lr0=0.005 batch=16
Want more control? Use a custom hyp YAML:
lr0: 0.005
lrf: 0.1
momentum: 0.937
weight_decay: 0.0005
warmup_epochs: 3.0
warmup_momentum: 0.8
box: 0.05
cls: 0.5
cls_pw: 1.0
obj: 1.0
obj_pw: 1.0
hsv_h: 0.015
hsv_s: 0.7
hsv_v: 0.4
flipud: 0.0
fliplr: 0.5
mosaic: 1.0
mixup: 0.2
copy_paste: 0.0
Run it like this:
From my own tuning cycles: Lowering
mixup
andcopy_paste
helped me stabilize training on a noisy dataset. I also found that usingimgsz=768
gave sharper boxes, but required slicing my batch size in half.
5. Fine-Tuning the Model
5.1 Training Command + Full Explanation (No Guesswork Here)
Let’s not sugarcoat it: I’ve lost days trying to debug training runs just because I forgot to double-check a flag or misconfigured a name. Over time, I’ve settled into a reliable training command that just works.
Here’s my go-to command:
yolo task=detect mode=train \
model=yolov8m.pt \
data=dataset/data.yaml \
epochs=100 \
imgsz=640 \
batch=16 \
name=wildlife_detector
Let me break it down based on how I actually use it:
task=detect
– You’re doing object detection. Obvious, but I once left this out when running a segmentation task, and YOLOv8 defaulted weirdly.mode=train
– This kicks off training.model=yolov8m.pt
– Start from a pretrained model. I often useyolov8s.pt
for quick iterations, then bump up tom
orl
if it’s worth it.data=...
– The YAML file you prepped earlier.epochs=100
– I tend to start with 100, but I rarely go with round numbers blindly. If loss is still improving, I just extend.imgsz=640
– Go with 768 or 896 if your GPU allows—it often gives better localization.batch=16
– You’ll want to experiment here based on VRAM. I once had silent crashing on batch 32, fixed by cutting it to 12.name=wildlife_detector
– This is crucial. Every run gets its own folder inruns/detect/
, and naming saves you from that dreadedexp2
,exp3
,exp4
chaos.
5.2 Checkpoints and Resume Training (Don’t Start from Scratch)
This might surprise you, but I’ve seen people rerun full training from scratch after their laptop rebooted. You don’t have to live that life.
Here’s how I resume:
Using the best checkpoint:
yolo task=detect mode=train \
model=runs/detect/wildlife_detector/weights/best.pt \
data=dataset/data.yaml \
epochs=200 \
imgsz=640 \
name=wildlife_detector_v2
This continues training with the best-performing weights from the previous run. I usually do this when I want to push performance without starting from zero.
Pro tip: If you want to resume exactly where training left off (including optimizer state), use the
--resume
flag:
yolo task=detect mode=train --resume runs/detect/wildlife_detector/weights/last.pt
YOLOv8 handles this pretty well, and it saves me when my colab runtime times out halfway through.
5.3 Dealing with Overfitting or Underfitting (Read the Signs)
Here’s the deal: training logs and the results.png
graph aren’t just pretty charts—they’re diagnostics. I learned to read them the hard way.
Let’s break it down like I do when reviewing a run:
Signs of overfitting (Been there, seen it):
- Training loss keeps going down, but validation mAP stagnates or drops
- Big gap between
box_loss
andval/box_loss
- Precision improves, recall tanks
Fix it with:
- Stronger augmentations (
mosaic
,hsv_h
,fliplr
) - Early stopping or reduce epochs
- Reduce model size (yes, that helps)
- Use dropout or decrease
obj
/cls
weight in hyp file
Signs of underfitting (Had this too):
- Both train and val losses stay high
- mAP is flatlined near zero
- Training is painfully slow in learning
Tweak this:
- Raise
lr0
slightly (e.g., 0.01 → 0.015) - Increase epochs (maybe 200–300)
- Improve label quality (seriously, check 10 random samples manually)
- Train longer with cosine decay (YOLOv8 does this automatically)
Example: Reading results.png
Let me show you what I look at:
from matplotlib import pyplot as plt
import cv2
img = cv2.imread("runs/detect/wildlife_detector/results.png")
plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
plt.title("Training Metrics Overview")
plt.axis("off")
plt.show()
If I see val/box_loss
diverging from train/box_loss
—I know something’s cooking.
6. Evaluating the Model
6.1 Metrics That Actually Matter
Let’s cut through the noise. I’ve seen plenty of people obsess over mAP without understanding which mAP they’re quoting. Personally, I focus on both mAP@0.5 and mAP@0.5:0.95—but for different reasons.
- mAP@0.5: Great for a quick sanity check. Think of it as the low-hanging fruit.
- mAP@0.5:0.95: This is what separates average from state-of-the-art. It’s stricter and gives a more honest view of how well your model localizes across IoU thresholds.
I’ve had models scoring 90+ on mAP@0.5 but barely crossing 60 at mAP@0.5:0.95. Trust me, the latter exposes weaknesses—especially on edge cases.
Precision-Recall Curves
This might surprise you, but I find more insight from PR curves than a single mAP score. Look for sharp drop-offs—that usually signals inconsistency in how confidently your model predicts certain classes.
When I saw a PR curve flatten out early for a specific class, it led me to realize the model wasn’t confident enough. Turns out, I had class imbalance issues in my dataset.
6.2 Using yolo val
Like You Mean It
Here’s how I validate:
yolo task=detect mode=val model=best.pt data=dataset/data.yaml
And here’s what you get out of it:
mAP@0.5
,mAP@0.5:0.95
- Precision, Recall
- Confusion matrix
- Per-class performance
But don’t just look at numbers—tune your confidence threshold. By default, YOLOv8 uses 0.25. I usually sweep from 0.1 to 0.5 depending on the use case. For safety-critical domains, I’ve used 0.6+ just to avoid false positives.
6.3 Confusion Matrix & Class-Level Deep Dives
This is where it gets real. That matrix isn’t just decoration.
For one wildlife project, I noticed the model kept confusing “bobcat” with “lynx.” The confusion matrix made it painfully obvious. Both had similar patterns, but one class was underrepresented.
Here’s what I did:
- Boosted samples of “lynx” using synthetic augmentation
- Added a new background class to reduce noise
- Increased image size from 640 to 896 for better feature capture
That one change? Took my per-class mAP from 61 → 79 in a single run.
7. Inference & Deployment
7.1 Inference on Single or Batch Images
Once I’ve got my best.pt
, I always validate inference on sample images before even thinking about deployment.
from ultralytics import YOLO
model = YOLO("runs/detect/wildlife_detector/weights/best.pt")
# Single image inference
results = model("test_images/lion.jpg")
results[0].show()
For batch inference:
import os
image_folder = "test_images/"
for file in os.listdir(image_folder):
if file.endswith(".jpg"):
path = os.path.join(image_folder, file)
results = model(path)
results[0].save(filename=f"inferred/{file}")
7.2 Performance Benchmarking (Know Before You Deploy)
You might be wondering: Is this model fast enough for real-time deployment?
Here’s how I quickly benchmark on different devices:
import time
import torch
from ultralytics import YOLO
model = YOLO("best.pt")
start = time.time()
_ = model("test_images/lion.jpg")
end = time.time()
print("Inference Time (Single Image):", round(end - start, 3), "seconds")
print("Running on:", "GPU" if torch.cuda.is_available() else "CPU")
On my RTX 3060, I get ~23ms per frame at 640×640. On CPU? Closer to 300ms. That’s your bottleneck if deploying to edge.
7.3 Exporting the Model (The Gotchas)
Exporting your model for deployment isn’t always a one-liner. But YOLOv8 makes it as painless as it gets.
yolo export model=best.pt format=onnx
YOLO also supports:
torchscript
openvino
coreml
engine
(TensorRT)
I personally prefer ONNX for cloud deployment. But be warned—ONNX opset mismatches can silently kill performance or throw cryptic errors.
Here’s one I hit:
RuntimeError: Exporting the operator 'aten::meshgrid' to ONNX opset version 11 is not supported.
Fix? Just bump the opset:
yolo export model=best.pt format=onnx opset=12
8. Advanced Tips (Hard-earned Lessons)
“In theory, there is no difference between theory and practice. In practice, there is.”
— Yogi Berra
That quote hits home when you start pushing YOLOv8 to its limits. What I’m about to share here are the kinds of things you learn the hard way—after multiple experiments, a few failed models, and way too many TensorBoard sessions.
Custom Anchor Boxes: When You Actually Need Them
YOLOv8 is anchor-free by default, so most folks ignore anchors entirely. But here’s the catch: If you’re still using legacy anchor-based versions (like YOLOv5 or your project depends on custom behavior), you can’t afford to overlook this.
I had a dataset once with mostly tiny objects—like 12×12 pixels in 640×640 images. Default anchors were useless.
What worked for me:
python utils/autanchor.py --data data.yaml --img 640
This generates optimized anchor boxes based on your dataset. After that, retraining made a huge difference in recall—especially for small-object detection.
If your objects vary wildly in size, don’t even bother with custom anchors. Let YOLOv8’s transformer-based backbone handle it.
Mosaic and HSV Augmentation—But Not Blindly
This might surprise you: overusing Mosaic can mess up localization on tightly packed objects. I once trained a traffic detection model and noticed weird bounding box jitters during inference. Turned out, Mosaic augmentation was distorting context too much.
Here’s a more surgical approach:
augment:
mosaic: 0.7 # default is 1.0
hsv_h: 0.015
hsv_s: 0.7
hsv_v: 0.4
My rule of thumb: dial Mosaic down to 0.6–0.8 if your dataset has dense or overlapping objects. For HSV, I tweak saturation more than hue—it tends to generalize better.
Model Ensembling: When One YOLO Isn’t Enough
There was a time I had two models—yolov8m
trained on clean, curated data and yolov8s
trained on noisy, diverse real-world samples. Neither was perfect. But together? That’s where ensembling came in.
I used Non-Maximum Suppression (NMS) voting for post-processing results from multiple models:
from ensemble_boxes import weighted_boxes_fusion
# Combine predictions from two models
boxes_list = [boxes_model1, boxes_model2]
scores_list = [scores_model1, scores_model2]
labels_list = [labels_model1, labels_model2]
boxes, scores, labels = weighted_boxes_fusion(
boxes_list, scores_list, labels_list, iou_thr=0.55, skip_box_thr=0.4
)
It boosted mAP by 3 points in production. Not massive, but when you’re squeezing out every last bit of performance, it matters.
TTA (Test Time Augmentation)
I don’t use TTA on every project—but when I do, it’s for final eval or deployment edge-cases. It’s like giving your model a second (or third) opinion on the same image.
With YOLOv8, you can hack in TTA by flipping, resizing, or rotating inputs during inference and averaging the predictions.
Here’s a rough structure I use:
flipped = cv2.flip(image, 1)
resized = cv2.resize(image, (720, 720))
original = image.copy()
results = []
for img in [original, flipped, resized]:
r = model(img)
results.append(r[0].boxes.xyxy.cpu().numpy())
# Merge results with NMS
Pro tip: If you’re using TTA, make sure your NMS thresholds are tuned accordingly. You’ll get way more overlapping boxes.
Final Thoughts
What I’ve Learned (The Hard Way)
When I first started working with YOLOv8, I thought it would just work out of the box. It did—for basic stuff. But the moment I moved into real-world data (imperfect, noisy, imbalanced), I hit a wall.
What changed the game for me was:
- Understanding how to balance augmentation vs. overfitting
- Reading PR curves instead of just chasing mAP
- Learning when to stop training early (thanks, TensorBoard)
The biggest shift? I stopped treating it like a plug-and-play tool and started treating it like a system I had to understand.
When Not to Use YOLOv8
You might be wondering: Is YOLOv8 always the right call?
Here’s the deal:
- Not great for extreme precision: If you’re in medical imaging where false positives are unacceptable, YOLO’s confidence thresholding can be too relaxed.
- Not ideal for ultra-tiny objects in huge images: You’ll be better off with DETR or hybrid transformer backbones that preserve spatial resolution.
- Long training cycles: Large-scale datasets might need more efficient distributed training than YOLOv8 currently supports natively.
Bonus: My GitHub & Colab
I always like to leave something tangible. If you want to dive deeper or test a few of these ideas in your own project, here’s a sample Colab and GitHub I put together:

I’m a Data Scientist.