1. Why Fine-Tune SD 3.5 Large?
Let me get straight to the point—there comes a point where prompt engineering just doesn’t cut it. I’ve been there.
I was trying to get Stable Diffusion to generate a specific style of technical illustration for a robotics use case.
After 50+ prompt variations and CLIP interrogator tweaks, the results were still off-brand and visually inconsistent.
That’s when I realized: if you want truly controllable outputs—something that looks like your data, your aesthetic, your domain—you’re not getting there without fine-tuning.
Personally, I’ve fine-tuned 3.5 Large on product catalog photos, cinematic environments, and even protein structure visualizations.
The moment you inject domain-specific visual priors into the model, the results stop being generic and start being yours.
2. Environment Setup (No Fluff, Just What You’ll Actually Use)
If you’ve tried messing with 3.5 Large without a solid setup, you already know—this model is a memory hog. I’ve tested this on an A100 80GB, 2x A6000s, and even tried a single 3090 just to see how far I could push it (spoiler: not far).
What worked for me consistently:
- GPU: A100 80GB is ideal. You can get away with two A6000s using gradient checkpointing.
- CUDA: 12.1 worked best in my tests. Lower versions led to weird runtime errors.
- PyTorch: 2.1.0 with
torchvision
0.16 - Diffusers: Use the latest version from the Hugging Face repo, not PyPI. I’ve run into multiple mismatches otherwise.
- Accelerate: Essential if you’re working with multiple GPUs or want mixed precision.
Here’s the exact environment I personally use when starting fresh:
# Environment setup that just works
conda create -n sd3.5-ft python=3.10 -y
conda activate sd3.5-ft
# Install PyTorch with GPU support
pip install torch==2.1.0 torchvision --index-url https://download.pytorch.org/whl/cu121
# Install core packages
pip install diffusers[torch] accelerate transformers datasets peft bitsandbytes
Pro tip: If you’re using A100s with bf16 support, pass
--mixed_precision=bf16
to accelerate config. It’s significantly faster, and I’ve noticed fewer stability issues in long training runs.
Make sure you have at least 200GB of free space before downloading SD 3.5 Large and your datasets. I’ve filled up SSDs mid-run and had to redo everything more times than I’d like to admit.
3. Getting the Base Model: Stable Diffusion 3.5 Large
Before you do anything, get ready for a hefty download. The model itself takes up serious disk space, and if you’re on a slow connection, it’s going to test your patience.
I personally use the diffusers
library with transformers
for model loading. I’ve also run into my fair share of odd bugs—everything from pickle
issues to models failing to load halfway through training. Here’s what’s worked for me consistently.
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-3.5-large",
torch_dtype=torch.float16,
use_safetensors=True # important — prevents a lot of weird load-time crashes
)
pipe.to("cuda")
Heads-up: You’ll need to be logged in to Hugging Face with an access token to pull this model. It’s gated behind a terms-of-use wall. I’ve had it silently fail on remote servers because the token wasn’t set correctly—make sure you run
huggingface-cli login
on the right machine.
Also, use use_safetensors=True
. I learned the hard way that standard PyTorch .bin
weights can randomly trigger deserialization issues, especially with diffusers+accelerate. This small flag saves hours of debugging.
Finally, make sure your disk has at least 30-40GB free—the model, cache, optimizer states, and any temp files will pile up quick.
4. Dataset Preparation (Crucial Section)
If there’s one part of the pipeline where I spend the most time—it’s this one. Getting the dataset right is the difference between a model that looks stunning and one that outputs chaos.
I’ve worked with several dataset formats over time. Here’s what I’ve found works best:
Supported Formats I’ve Used
- Folder of images + prompts in a CSV or JSON
→ Great for quick tests and small custom datasets. - WebDataset (tar shards + caption txt)
→ What I use for large-scale fine-tuning. Efficient for streaming, and integrates nicely withdiffusers
. - LAION-style JSON (image_url + caption)
→ Perfect when scraping new domains withimg2dataset
.
Cleaning & Preprocessing (What Actually Matters)
You can fine-tune on junk, or you can fine-tune on gold—it all comes down to how you prep your data.
Here’s my usual checklist:
- Resize all images to 1024×1024 (or model’s native size) using
PIL.Image.LANCZOS
.
I avoid center-cropping unless the composition demands it. - Prompt deduplication: Remove near-identical captions. Repetitive prompts kill variance during training.
- Caption enhancement: If your dataset has weak prompts, use BLIP or CLIP Interrogator to auto-generate better ones.
I wrote a quick script to re-caption an entire dataset with BLIP2—it made a noticeable difference in final outputs.
pip install img2dataset
# Example: Re-caption with BLIP2
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base").to("cuda")
image = Image.open("example.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt").to("cuda")
out = model.generate(**inputs)
caption = processor.decode(out[0], skip_special_tokens=True)
print(caption)
Tip from experience: Tokenize prompts before training. Use the tokenizer from your model checkpoint—not the default CLIP one. It reduces CPU bottlenecks during training, especially on multi-GPU setups.
5. Choosing the Right Fine-Tuning Strategy
There’s a lot of noise out there about which strategy to use. I’ve tested full fine-tuning, LoRA, QLoRA, even layer-wise freezing. Here’s what’s actually worked in my projects, especially when balancing compute cost with output quality.
5.1 Full Fine-Tuning (Rarely Recommended)
Let me be blunt—I’ve only used full fine-tuning when I had access to A100s and a reason to care about every last bit of model behavior. Even then, the memory overhead, training time, and instability made it a last resort.
You’re looking at:
- >150GB VRAM if you don’t use optimization tricks
- Long convergence cycles
- Higher risk of overfitting or catastrophic forgetting unless your dataset is massive and diverse
Unless you’re doing a full domain shift (e.g., MRI scans or infrared satellite imagery), I’d skip this.
5.2 LoRA / QLoRA on UNet + Text Encoder (What I Actually Use)
Here’s the deal: for 90% of my fine-tuning jobs, I use LoRA on the UNet and text encoder—and freeze everything else, including the VAE.
You could fine-tune the VAE, but in my experience, it barely moves the needle unless you’re changing image resolutions or domain drastically. Freezing it saves a ton of VRAM and compute.
Here’s the exact config I’ve used:
from peft import get_peft_model, LoraConfig
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["TransformerBlock"], # Customize for your model’s layer names
lora_dropout=0.1,
bias="none",
task_type="TEXT_TO_IMAGE"
)
model = get_peft_model(model, config)
Pro tip: If you’re using a custom UNet or a heavily modified text encoder, double-check the
target_modules
. I once used a config that silently didn’t apply LoRA at all—no warnings, just zero effect.
Also, if you’re constrained on memory, look into QLoRA—I’ve used it with bitsandbytes
to train on a single 3090. Not ideal, but it got the job done for a quick concept art prototype.
6. Training Loop (Real Code, Not Pseudo)
This is the part where most tutorials start to go vague. So let me show you what I actually use in production when training SD 3.5 Large with LoRA on multiple GPUs.
I use 🤗 accelerate
for orchestrating mixed-precision and multi-GPU. Here’s a trimmed-down version of the loop I use in most of my runs:
from accelerate import Accelerator
accelerator = Accelerator(mixed_precision="fp16")
model, optimizer, dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, dataloader, lr_scheduler
)
model.train()
for step, batch in enumerate(dataloader):
with accelerator.autocast():
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
# Logging
if step % 10 == 0:
print(f"Step {step} | Loss: {loss.item():.4f}")
# Save every 500 steps
if step % 500 == 0:
accelerator.save_state(f"./checkpoints/step-{step}")
Real-world tip: Always run with gradient accumulation if your batch size is tiny due to VRAM limits. I’ve used
--gradient_accumulation_steps=4
on A6000s to simulate batch size 8.
Also—don’t skip logging loss or saving partial checkpoints. I once had a 16-hour run interrupted because of a power glitch, and I hadn’t saved anything since step 0. Never again.
7. Saving and Loading Fine-Tuned Models
You might be wondering: Do I save the whole model or just the LoRA adapters? Here’s what I usually do—and why.
If You’re Using LoRA (Which I Almost Always Am)
There’s no need to save the full UNet or text encoder weights if you’re doing LoRA. What matters is saving the trained LoRA adapters, and reapplying them during inference.
Here’s exactly how I save and load them in my own workflows:
# Save trained LoRA adapters
model.save_pretrained("sd3.5-custom-lora")
Then when I want to run inference later:
from peft import PeftModel
# Load original SD3.5 model pipeline
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-3.5-large",
torch_dtype=torch.float16,
use_safetensors=True
)
pipe.to("cuda")
# Inject LoRA into the UNet
pipe.unet = PeftModel.from_pretrained(pipe.unet, "sd3.5-custom-lora")
Heads-up: If your LoRA adapters are targeting both the UNet and text encoder, you’ll need to load adapters into both components. In one of my early runs, I forgot the text encoder—prompt conditioning was way off until I fixed it.
If You Actually Did Full Fine-Tuning
In rare cases where I fine-tuned the whole model (and not just adapters), here’s how I save everything:
# Save entire pipeline (UNet, text encoder, VAE, scheduler)
pipe.save_pretrained("./sd3.5-finetuned-full")
Just make sure to keep your tokenizer and scheduler versions aligned when reloading. I’ve had mismatched versions break output consistency across sessions.
8. Validation and Evaluation
Alright, you’ve trained the model—now how do you know it worked?
I’ve learned to rely on a mix of visual grids, statistical metrics, and some brutally honest prompt testing.
Visual Inspection (Still My #1 Check)
What I usually do is fix a prompt and a few seed values, then generate a before vs. after grid. Something like this:
from diffusers.utils import make_image_grid
images = [pipe(prompt, generator=torch.manual_seed(seed)).images[0] for seed in [42, 123, 456]]
grid = make_image_grid(images, rows=1, cols=3)
grid.save("comparison.png")
Real-world example: I once trained a model to generate stylized product shots for a niche brand. Without a grid, I thought the fine-tuned model was better—but once I laid out the original vs. fine-tuned outputs, I realized the style had collapsed into near-identical compositions. Huge red flag.
Quantitative Checks: CLIPScore / FID
If you’re comparing distributions—say, before vs. after fine-tuning across a validation set—I use CLIPScore and sometimes FID for broader visual diversity checks.
I recommend the evaluate
library for CLIPScore. Here’s a basic usage:
from evaluate import load
clip_score = load("clip_score")
scores = clip_score.compute(
predictions=generated_captions,
references=original_prompts
)
Overfitting Signs to Watch For
With SD models, overfitting can show up in weird ways. Here’s what I’ve personally seen:
- Prompt memorization: Model parrots exact training prompts even when you try variations
- Style collapse: Everything starts looking the same (same angles, same lighting, etc.)
- Loss keeps dropping, but visual quality degrades
If you’re seeing any of that, it’s usually time to:
- Add more training variation
- Lower learning rate
- Stop training earlier (I often early-stop around 2000 steps if it plateaus visually)
9. Inference Pipeline After Fine-Tuning
“A model that can’t generate on-demand is just a fancy checkpoint collecting dust.”
I’ve learned that integrating a fine-tuned LoRA into the original pipeline isn’t just about loading weights—you need to tune inference parameters carefully to get consistently good results.
Injecting LoRA into the Original Pipeline
Once you’ve trained your LoRA adapters and saved them, here’s how I load them into the DiffusionPipeline
and start generating.
from diffusers import DiffusionPipeline
from peft import PeftModel
# Load base model
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-3.5-large",
torch_dtype=torch.float16,
use_safetensors=True
)
pipe.to("cuda")
# Inject LoRA weights into UNet
pipe.unet = PeftModel.from_pretrained(pipe.unet, "sd3.5-custom-lora")
# Optional: enable attention slicing if you're on a GPU with <16GB
pipe.enable_attention_slicing()
Best Generation Settings (That I’ve Found Useful)
Here’s what I’ve personally found to be reliable defaults post fine-tuning, though you’ll want to tweak per use case:
prompt = "a fantasy castle in sunset, golden hour, ultra detailed"
output = pipe(
prompt,
num_inference_steps=30, # I usually stay between 25–50
guidance_scale=7.5, # 6.5–8.5 is my usual sweet spot
)
output.images[0].save("result.png")
Pro tip: If your outputs are overly literal or noisy, try switching from DDIM to Euler a or DPM++ 2M Karras via a custom scheduler. In one of my portrait LoRA runs, switching to Euler gave a huge bump in texture fidelity without changing the prompt at all.
10. Deployment Considerations
You’ve trained it. You’ve tested it. Now comes the “make it usable for others” step. Personally, I’ve shipped these models into everything from internal tools to public demo apps—and here’s what’s worked best.
Exporting to ONNX or TensorRT (Optional, Advanced)
If you’re planning to serve models at scale or on edge devices, exporting to ONNX or TensorRT can shave off latency. That said, I only go this route if I’m optimizing for throughput.
I’ll be honest—ONNX export for Diffusers isn’t trivial and tends to break with custom LoRA injections, especially on non-standard schedulers. Unless you’re doing hardcore deployment, I usually just skip this.
Serving: FastAPI / Gradio / Streamlit
For quick demos, I’ve built out lightweight Gradio interfaces like this:
import gradio as gr
def generate(prompt):
image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
return image
gr.Interface(fn=generate, inputs="text", outputs="image").launch()
But when I’m packaging this into a backend service for real users, I reach for FastAPI:
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class PromptRequest(BaseModel):
prompt: str
@app.post("/generate")
def generate_image(data: PromptRequest):
image = pipe(data.prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
image.save("latest_output.png")
return {"status": "generated"}
Memory-Saving Tricks (PyTorch 2.1 and Beyond)
One trick I’ve personally found helpful—especially on A100s and H100s where inference memory is at a premium—is using torch.compile()
in PyTorch 2.1+ to JIT optimize the pipeline.
import torch
pipe.unet = torch.compile(pipe.unet)
I’ve seen up to 10–15% faster generation this way, especially when batching multiple prompts.
If you’re generating multiple images per request, batch them. Diffusers handles batched prompts well, and you’ll reduce redundant compute.
11. Common Pitfalls and Debugging
“Experience is the name everyone gives to their mistakes.” – Oscar Wilde
Let me walk you through the kind of mistakes I’ve either made myself—or helped others debug—so you don’t have to repeat them.
“CUDA out of memory” Errors: The Real Fixes
Everyone hits this wall eventually. Even on A100s, memory gets tight fast, especially if you forgot to enable 16-bit training or didn’t batch properly.
Here’s what’s actually helped me, ranked by impact:
- Use
accelerate
withmixed_precision="fp16"
(or"bf16"
if your GPU supports it) - Set
gradient_accumulation_steps > 1
— spreads the batch across steps - Enable
attention slicing
andmemory efficient attention
in the pipeline - Reduce image size (e.g., 768×768 → 512×512) during training
- Patch training loop to
torch.cuda.empty_cache()
after.step()
if memory leaks
I also make it a habit to monitor GPU memory live using nvidia-smi dmon -s u
. It’s the best way to spot fragmentation creeping in during long runs.
UNet Freezing by Mistake
This one can silently ruin a run—everything looks fine, except the model learns absolutely nothing. I’ve run into this mostly when integrating LoRA adapters manually.
What to check:
- If using LoRA, make sure
requires_grad=True
only for LoRA-injected parameters, not the entire UNet. - If you’re freezing the VAE or other modules, confirm with:
for name, param in model.named_parameters():
if "unet" in name:
print(name, param.requires_grad)
- Also double check that the optimizer is only getting trainable parameters, like:
optimizer = torch.optim.AdamW(
filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4
)
optimizer = torch.optim.AdamW(
filter(lambda p: p.requires_grad, model.parameters()), lr=1e-4
)
Tokenizer Mismatch
This one is sneaky and usually happens when you swap out or re-caption prompts using a different tokenizer than what the model expects.
Stable Diffusion 3.5 uses a specific tokenizer aligned with its text encoder (often a CLIP variant). Here’s what I now do religiously:
tokenizer = pipe.tokenizer
inputs = tokenizer(["a tiger in space", "a castle on fire"], padding="max_length", return_tensors="pt")
You don’t want to tokenize with GPT-2 or something incompatible. If you’re preprocessing prompts outside the training loop (say, in a JSON preprocessor), make sure it’s using the same tokenizer object.
I once pre-tokenized captions using
transformers.AutoTokenizer
with a default model—big mistake. Everything looked okay until fine-tuned generations came out garbled.
Broken Scheduler or Loss Not Going Down
If your training loss flatlines or bounces around wildly, here’s what I’d recommend checking:
Learning Rate Scheduler
Sometimes, lr_scheduler.step()
is called incorrectly or too often. Use accelerate
’s built-in scheduler tracking, or double-check your step logic:
# Step only once per actual optimizer update
if (step + 1) % gradient_accumulation_steps == 0:
lr_scheduler.step()
Bad LoRA Config
If your loss never moves at all, your LoRA config might be too conservative.
- Try lowering the rank (e.g.,
r=8
instead ofr=16
) - Increase
lora_alpha
- Disable dropout (just to test learning capacity)
LoraConfig(
r=8,
lora_alpha=64,
lora_dropout=0.0,
...
)
Sanity Prompt Test
I always run a before/after comparison using the same seed + prompt every 1000 steps. If the output doesn’t change, something’s definitely broken.

I’m a Data Scientist.