Fine-Tuning LLaVA for Vision-Language Tasks

1. Introduction

“The moment you add vision to language models, everything breaks — preprocessing, formatting, memory requirements, even your idea of what ‘fine-tuning’ means.”

I wish someone had told me that earlier.

This post is not a gentle introduction to LLMs, vision transformers, or multimodal learning. I’m assuming you’ve already been in the trenches with LLMs like LLaMA, BERT, or Mistral. You’ve probably fine-tuned a few on text. Now you’re looking to level up — adding images into the mix using LLaVA.

Here’s the deal: fine-tuning LLaVA is a different beast. It’s not just “plug in some images and go.” I’ve gone through the headaches myself — bad formats, mismatched encoders, and memory crashes at the worst moments. That’s why I’m writing this.

What you’ll get in this post:

How to structure your own multimodal dataset for LLaVA.
How to fine-tune LLaVA using either full training or LoRA/QLoRA (I’ll show both).
How to run inference and sanity check your outputs.
And yes — you’ll get working code, config examples, and personal insights from my own setup. Nothing vague.

Let’s get into it.

2. Setup and Environment

“Before you train LLaVA, make sure your environment doesn’t sabotage you.”

I’ve personally spent hours debugging things that turned out to be minor version mismatches or invisible CUDA errors. So here’s what actually works — not what the README says, but what I’ve tested myself.

Hardware I Used:

2× A100 80GB for full fine-tune
1× A100 or 3090 is enough for LoRA/QLoRA
(If you’re just prototyping, I’ve even gotten LoRA training started on a 24GB 3090 — though you’ll need to reduce batch size and enable gradient checkpointing.)

CUDA/cuDNN Notes:

CUDA 11.8 + cuDNN 8.6 worked fine for me.
If you’re using a different setup, double-check torch.version.cuda before moving forward.

Environment Setup

I used conda, and I’d recommend it unless you’re doing containerized training.

conda create -n llava-finetune python=3.10 -y
conda activate llava-finetune

Install PyTorch with CUDA:

# Adjust according to your GPU
pip install torch==2.1.2 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Then install LLaVA dependencies:

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .

Pin exact versions — I’ve had incompatibilities with newer releases, especially with transformers.

pip install transformers==4.39.1 peft==0.8.2 bitsandbytes==0.42.0 accelerate==0.28.0

This will save you from a lot of headaches down the road — trust me on this.

3. Cloning and Modifying the LLaVA Repository

“If you treat the LLaVA repo like a plug-and-play library, it will fight back.”

I learned this the hard way — you do need to peek under the hood if you’re serious about fine-tuning. Especially if you’re adapting it to your own data or trying out LoRA.

First things first: clone the repo I used (yes, this exact one — avoid random forks unless you know what’s different):

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .

Installing in editable mode (-e) is crucial if you’re modifying code — and trust me, you will.

Relevant Files to Actually Look At

You might be wondering, “Which files do I actually need to care about?”

Here’s what mattered in my own workflow:

train/train_mem.py or train/train.py: Main entrypoints for training. If you’re using LoRA, go with train_mem.py — it’s optimized for memory-efficient training.
conversation.py: This handles the prompt formatting logic — the way human and assistant messages are stitched together.
llava/model/llava_llama.py: Core multimodal model definition — where the vision encoder and language model get fused.

When I was tweaking the conversation template to match my dataset (more on that below), most of the changes happened in conversation.py. If you’re customizing prompts or building new instruction formats, that’s where you’ll likely spend time.

4. Preparing a Custom Multimodal Dataset

“This is where 90% of the pain comes from — get this part wrong, and your model will spit nonsense.”

I’ve gone through a few dataset formats that just looked right but silently broke training. Eventually, I reverse-engineered the structure from llava_instruct_150k.json and tailored my own script around it.

Dataset Format

Each sample is a JSON object with:

An image path
A list of conversation turns (from: human / from: gpt)
The <image> token as the first user message (critical for alignment)

Here’s an example that worked for me:

{
  "id": "sample_001",
  "image": "images/sample_001.jpg",
  "conversations": [
    {"from": "human", "value": "<image>\nWhat's happening in this image?"},
    {"from": "gpt", "value": "A cat is sitting on a robotic vacuum cleaner."}
  ]
}

You might be wondering: “What if my dataset is in another format?”

Same here. I had to write a quick Python script to convert mine — it wasn’t clean, but it did the job:

Dataset Converter Script (Rough but Works)

import os
import json
from PIL import Image

def convert_to_llava_format(image_dir, annotations):
    dataset = []

    for idx, ann in enumerate(annotations):
        image_file = os.path.join(image_dir, ann["image_filename"])
        question = ann["question"]
        answer = ann["answer"]

        item = {
            "id": f"sample_{idx:05d}",
            "image": image_file,
            "conversations": [
                {"from": "human", "value": f"<image>\n{question}"},
                {"from": "gpt", "value": answer}
            ]
        }
        dataset.append(item)

    return dataset

# Example usage
if __name__ == "__main__":
    with open("raw_annotations.json") as f:
        annotations = json.load(f)

    output = convert_to_llava_format("images/", annotations)

    with open("llava_custom_dataset.json", "w") as f:
        json.dump(output, f, indent=2)

A Quick Note on Image Preprocessing

You don’t need to resize or normalize the images beforehand — LLaVA does that internally via the vision encoder’s transform pipeline. But you do need to make sure the image paths in the JSON are valid relative to your training script.

I personally keep them under a flat folder structure like:

/project-root/
    ├── images/
    │     ├── sample_001.jpg
    │     ├── sample_002.jpg
    └── data/
          └── llava_custom_dataset.json

This helps avoid broken path issues, especially if you’re training across multiple nodes or moving between environments.

5. Preprocessing Images

“Don’t just throw raw JPEGs at your model and hope for the best — that’s how you end up with junk gradients.”

Here’s the deal: LLaVA uses a CLIP-based vision encoder under the hood (usually openai/clip-vit-large-patch14). It’s designed to take in a 224×224 RGB image that’s already been normalized using CLIP’s mean and std.

If you’ve used CLIP before, this should ring a bell. But if you’re building a large dataset, doing this preprocessing during training is a waste of compute.

What LLaVA Does Internally

When training, LLaVA applies a standard CLIPImageProcessor transformation — resizing, center cropping, then normalizing with:

normalize = transforms.Normalize(
    mean=[0.48145466, 0.4578275, 0.40821073],
    std=[0.26862954, 0.26130258, 0.27577711]
)

Every image goes through this pipeline on the fly. It works — but once your dataset grows beyond a few thousand samples, you’ll start to feel the slowdown.

My Solution: Pre-encode and Cache with CLIP

I’ve personally had smoother training runs by pre-encoding images using the same CLIP model LLaVA uses — and saving the outputs as .npy files. That way, you skip image processing entirely during training.

Here’s a script I wrote to do exactly that:

# preprocess_images.py
import os
import torch
import numpy as np
from PIL import Image
from tqdm import tqdm
from torchvision import transforms
from transformers import CLIPProcessor, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"

clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14").to(device)
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def encode_image(image_path):
    image = Image.open(image_path).convert("RGB")
    inputs = clip_processor(images=image, return_tensors="pt").to(device)
    with torch.no_grad():
        embeddings = clip_model.get_image_features(**inputs)
    return embeddings.squeeze().cpu().numpy()

def cache_dataset_images(image_dir, output_dir):
    os.makedirs(output_dir, exist_ok=True)
    for filename in tqdm(os.listdir(image_dir)):
        if filename.lower().endswith((".jpg", ".png", ".jpeg")):
            image_path = os.path.join(image_dir, filename)
            output_path = os.path.join(output_dir, f"{filename}.npy")
            if not os.path.exists(output_path):
                embedding = encode_image(image_path)
                np.save(output_path, embedding)

# Example usage
cache_dataset_images("images/", "encoded/")

This script will save you hours during training. The .npy files can then be loaded directly during your dataset construction — just be sure your training script knows to skip image preprocessing and use the cached embeddings instead.

6. Choosing the Right Base Model

“Choosing a base model is like choosing a sparring partner — pick one that pushes you, not one that knocks you out.”

llava-7b vs llava-13b vs Qwen-VL: What I’ve Found

Here’s how I think about it — if you’re running training on a single A100 or a 3090, stick to llava-7b or a LoRA version of it. You’ll thank yourself later.

If you’ve got the GPU headroom — 2× A100s or more — llava-13b gives better reasoning, especially on multimodal tasks where image-text grounding matters. But it’s heavy, no two ways about it.

Now, if you’re exploring outside the LLaVA family, Qwen-VL is the one I’ve experimented with that has real potential — stronger multilingual support, cleaner image understanding. But it’s also more finicky with formatting and tokenizer handling.

Pretrained Checkpoints I Recommend

For most practical use cases, I’ve had the best results with these:

"liuhaotian/llava-llama-2-7b-chat-lightning-preview"
"liuhaotian/llava-llama-2-13b-chat"
"Qwen/Qwen-VL-Chat" (if you’re fine stepping outside the LLaVA ecosystem)

You might be wondering: “How do I actually load these with the right tokenizer and vision encoder?”

Here’s the working code snippet I used for loading the 7B Lightning preview:

from llava.model.builder import load_pretrained_model
import torch

model_name_or_path = "liuhaotian/llava-llama-2-7b-chat-lightning-preview"
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_name_or_path,
    model_base=None,  # Only needed for LoRA
    load_8bit=False,  # Or True, if you want quantized
    device_map="auto"
)

If you’re doing LoRA or QLoRA, pass the model_base and adapter_name_or_path accordingly — I’ll show that when we get to the training section.

7. Training Strategy

“If fine-tuning is your hammer, knowing when to use LoRA is what keeps you from smashing screws.”

Let me get straight to the point — I’ve tried full fine-tuning, LoRA, and QLoRA on LLaVA-style models. Here’s what worked, what didn’t, and when you should use each one.

Full Fine-Tuning vs. LoRA vs. QLoRA — What I Actually Use (and Why)

I’ll be honest — I rarely go for full fine-tuning unless I’m working with a custom model on serious hardware (think 4× A100s or more). It eats memory, takes forever, and frankly, it’s just overkill most of the time.

Instead, I’ve had great results with LoRA when fine-tuning LLaVA for vision-language tasks. It gives you 90% of the performance at a fraction of the cost. QLoRA adds even more memory savings — but it comes with a bit more complexity during setup.

TL;DR:

Use full fine-tuning only if you’re changing everything, including the vision encoder.
Use LoRA if you want fast iteration, low VRAM usage, and great performance.
Use QLoRA when memory is tight — but be ready to juggle quantization configs.

Adding LoRA with PEFT — My Setup

When I used peft to integrate LoRA into the LLaVA training script, here’s how I configured it:

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)

I’ve tested a few combinations, but this config consistently gave me the best tradeoff between convergence and speed. You can tweak r and alpha based on your dataset size — for small datasets, I usually lower them to prevent overfitting.

My Actual Training Command (With All the Flags That Matter)

When fine-tuning llava-llama-2-7b-chat on a custom dataset, here’s the exact command I used with LoRA enabled:

python llava/train/train_mem.py \
    --model_name_or_path liuhaotian/llava-llama-2-7b-chat \
    --data_path data/my_custom_data.json \
    --vision_tower openai/clip-vit-large-patch14 \
    --image_folder data/images \
    --output_dir output/my_llava_lora \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --save_strategy epoch \
    --save_total_limit 2 \
    --learning_rate 5e-5 \
    --lr_scheduler_type cosine \
    --weight_decay 0.1 \
    --logging_steps 10 \
    --report_to wandb \
    --fp16 \
    --lora_r 16 \
    --lora_alpha 32 \
    --lora_dropout 0.05 \
    --lora_target q_proj,v_proj,k_proj,o_proj,gate_proj,up_proj,down_proj \
    --gradient_checkpointing \
    --logging_dir logs/ \
    --model_max_length 2048

Let me break that down for you real quick:

gradient_accumulation_steps helps if you’re tight on GPU memory — I’ve used 8 effectively on a single A100.
--gradient_checkpointing saves even more memory at the cost of speed.
--lora_target is crucial. If you skip this or use the wrong layers, your model won’t learn a thing. I had to inspect the model’s state_dict() to confirm which layers were safe to target.
I like wandb for tracking metrics, but tensorboard works fine too — just plug in --report_to tensorboard.

Pro Tips That Saved Me Hours

Checkpoint often — especially if you’re training on spot instances. I once lost 2 epochs of progress because I forgot --save_strategy epoch.
Use --fp16 or --bf16 depending on your GPU — it cuts memory usage without tanking performance.
If using QLoRA, don’t forget to set quantization properly (bnb_4bit, bnb_4bit_use_double_quant, etc.) — I’ll walk through that in the QLoRA-specific section later.

9. Running Inference

“Fine-tuning is only half the battle. If inference is slow or flaky, nobody’s using your model twice.”

This is where everything comes together. After training, I always run a few sanity checks to make sure the model behaves the way I expect — not just on benchmarks, but on edge cases I know from experience tend to trip up weaker models.

Let me walk you through the exact setup I use for running inference on a fine-tuned LLaVA model.

Loading Your Fine-Tuned Model

I’ve been using the load_pretrained_model utility from the LLaVA repo — it wraps the vision encoder, tokenizer, and language model into a clean, ready-to-use pipeline. Here’s a minimal script I wrote to test image + question pairs:

# inference.py

import torch
from PIL import Image
from llava.model.builder import load_pretrained_model
from llava.conversation import conv_templates, SeparatorStyle
from transformers import CLIPImageProcessor

# Load your fine-tuned model
model_path = "output/my_llava_lora"  # Or wherever your model checkpoint is
tokenizer, model, image_processor, _ = load_pretrained_model(
    model_path=model_path,
    model_base=None,  # Use this if you're applying LoRA
    device_map="auto"
)

# Preprocess input
image = Image.open("examples/dog.jpg").convert("RGB")
image_tensor = image_processor(image, return_tensors="pt")["pixel_values"].to(model.device)

# Prompt setup
prompt = "What breed is this dog?"
conv = conv_templates["llava_v1"].copy()
conv.append_message(conv.roles[0], prompt)
conv.append_message(conv.roles[1], None)
input_ids = tokenizer([conv.get_prompt()], return_tensors="pt").input_ids.to(model.device)

# Run inference
with torch.no_grad():
    output_ids = model.generate(
        input_ids,
        images=image_tensor,
        do_sample=False,
        temperature=0.2,
        max_new_tokens=64
    )
    output = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print("Model response:", output)

You’ll need to adjust model_base only if you’re using LoRA — in that case, make sure the base model path matches what you trained on (liuhaotian/llava-llama-2-7b-chat, for example).

Example Output

Here’s what I got when I tested a few of my finetuned models on basic image-question pairs:

> Prompt: What breed is this dog?
> Model: This looks like a Golden Retriever.

> Prompt: What is the man doing in the image?
> Model: The man is playing guitar on stage in front of a crowd.

Nothing too fancy — but this is the kind of qualitative pass I always run before running metrics. If the model can’t answer basic prompts cleanly, it’s not ready for harder downstream tasks.

Real-World Tips

I’ve found temperature=0.2 with do_sample=False gives the most stable outputs for evaluation.
Always test with unseen images — using training images is a trap I’ve seen even experienced folks fall into when debugging.
If the model gives null or short outputs, double-check your image processor and prompt formatting — these are usually the culprits.

10. Fine-Tuning Tips from Experience

“The devil’s not just in the details — it’s usually hiding in the training script.”

I’ve fine-tuned LLaVA enough times now to know where it trips up — and where small tweaks make a big difference. Below are a few field-tested tips I wish someone had handed me earlier. No fluff, just what’s actually helped me save compute, fix bugs, and squeeze better performance out of smaller setups.

Which Layers to LoRA?

If you’re using LoRA (which I usually do unless I have access to obscene amounts of compute), here’s what’s worked for me:

LoRA on the language model only: I typically apply LoRA to the q_proj, k_proj, and v_proj layers inside the transformer blocks. For LLaVA-7B or 13B, that’s enough to steer the model without touching the vision side.
Leave the vision tower alone: Personally, I avoid putting LoRA on the CLIP vision encoder unless I’m doing something drastically different from LLaVA’s original setup. It slows down training and — in most cases — the gain isn’t worth it.

You might be wondering: “What if I really need visual grounding improvements?” — In that case, you can unfreeze and LoRA the last 2–3 CLIP layers, but keep it scoped and don’t do it unless necessary.

Speeding Things Up (Without Wrecking Performance)

This might surprise you: You can cut down training time by half without sacrificing much, if you:

Use mixed precision (bf16) – I’ve seen 20–30% speedups on A100s.
Reduce image resolution – For many datasets, resizing images from 448 to 336 doesn’t hurt as much as you think.
Start from a checkpoint – If you’re iterating on prompts or training for new tasks, it’s often faster to fine-tune an already fine-tuned LLaVA checkpoint instead of starting from scratch.

Here’s a quick config I used that strikes a good balance:

--fp16 True \
--vision_tower openai/clip-vit-large-patch14-336 \
--mm_projector_type mlp2x_gelu \
--mm_hidden_size 2048 \
--mm_use_im_start_end True

When to Freeze the Vision Encoder

If your visual domain is similar to the pretraining data (e.g., photos, screenshots, everyday scenes), freeze it.

Personally, I only unfreeze CLIP if:

My dataset has highly non-standard images (e.g., medical scans, diagrams, satellite).
I’m fine-tuning on very small data and need to squeeze every ounce of adaptability.

Otherwise, keep the vision encoder frozen — it reduces overfitting and saves a chunk of GPU memory.

Debugging Outputs That Make You Go “Huh?”

This one’s important. Sometimes you’ll get hallucinations like:

Prompt: What’s in the image?
Output: There is a red balloon floating near the ceiling.
[But the image has no balloon.]

Here’s what I do when that happens:

Check your prompt format. The model is sensitive to the conversation template. I’ve had entire runs go sideways because I forgot to add ### User: and ### Assistant: markers.
Inspect the image tensor. Use image_tensor.mean() and image_tensor.std() to confirm the image wasn’t corrupted or scaled badly.
Look at training loss curves. If the loss is too flat early on, something’s not training — double-check that gradients are flowing to the LoRA layers.
Validate your tokenizer + image processor are aligned with the training phase. A mismatch there will silently ruin your outputs.

When you’ve done this a few times, these kinds of checks become second nature. But if it’s your first or second LLaVA fine-tune, trust me — these are the bugs you’ll likely hit.

11. Conclusion

So, if you’ve followed everything up to this point — congrats. You’re now sitting on a fine-tuned LLaVA model tailored to your own multimodal dataset. And no, it’s not just “another checkpoint.” You’ve got a system that understands vision-language cues specific to your domain — not just generic COCO-style captions.

I’ve used this setup myself in production-grade use cases, and it’s been a game-changer. Whether it’s understanding dense diagrams, UI screenshots, or product catalog images — once you fine-tune, the model finally gets it.

What’s Next?

Now that your model’s trained and validated, here’s where I usually go from here:

Quantization

Unless you’ve got racks of GPUs lying around, quantizing your model is the next logical step. I’ve had success using:

bitsandbytes for 4-bit inference on consumer GPUs
auto-gptq for cleaner quantization-ready outputs (especially if you’re planning to export)

You’ll shave off 60–70% of the memory footprint and still keep respectable accuracy, especially if your prompts are short-form.

Deployment (Triton, ONNX)

For production, I’ve deployed LLaVA through:

Triton Inference Server – Great if you’re managing scale or want to batch across modalities.
ONNX Runtime – Bit trickier due to multimodal input, but once you get it right, inference speed is unmatched.

Quick tip: Wrap the preprocessor and model in a simple FastAPI app first — this helps you iron out any quirks before you jump into containerized deployment.

Pipeline Integration

This might be the part where things get fun. Once your model is ready, you can plug it into:

Human-in-the-loop tools — Like tagging UI elements or annotating charts
RAG pipelines — Where the model “sees” a document and answers questions about it
Voice assistants with vision — I’ve wired mine to take screenshots and ask follow-up questions with GPT-4-style clarity

Honestly, once it’s fine-tuned and fast, the use cases open up quickly.

Let’s call it here — but you and I both know this is just the beginning. If you’ve made it this far, you’ve already done the hardest part. The next steps? Pure creativity.

Amit Yadav

I’m a Data Scientist.