How to Fine-tune LLaMA 3 and Export to Ollama?

1. Intro

“The only real way to understand these models is to break them, fine-tune them, and then run them yourself.”

I’ve worked with a lot of open-weight models, but when I started experimenting with LLaMA 3, I realized there wasn’t a solid, no-nonsense guide that covered everything—from fine-tuning to exporting it in a format that actually runs fast on local machines.

That’s what this guide is about. No broad theory, no generic summaries. Just what I did, what worked, and the exact steps you can follow to:

  • Fine-tune LLaMA 3 using LoRA with Hugging Face tools
  • Export it to GGUF format for compatibility with llama.cpp
  • Deploy it locally with Ollama, so you can run it in seconds without spinning up GPUs or relying on the cloud

By the end of this, you’ll have your own fine-tuned LLaMA 3 model running natively on your machine—lightweight, customized, and fully under your control.


2. Pre-Requisites (Only the Essentials)

This isn’t a beginner walkthrough, so I’m skipping the obvious stuff. But there are a few hard requirements I ran into while setting this up.

Hardware

Personally, I tested this on two setups:

  • Local dev box: 1x NVIDIA RTX 4090 (24GB VRAM) — worked well for LoRA fine-tuning with bnb/int8
  • Cloud instance: A100 (40GB+) — useful when training on larger instruction datasets

If you’ve got less than 16GB VRAM, I’d strongly recommend sticking with 4-bit quantization and using LoRA. Full fine-tuning is out of the question without serious hardware.

Software Versions (That Actually Worked)

After a bit of trial and error, here’s the stack that gave me the most stability:

  • transformers: v4.39.1
  • peft: v0.10.0
  • datasets: v2.18.0
  • bitsandbytes: v0.42.0
  • accelerate: v0.27.2
  • torch: v2.2.1 (with CUDA 11.8)

Make sure you install xformers if you want memory-efficient attention—especially on smaller GPUs.

pip install transformers==4.39.1 peft==0.10.0 bitsandbytes==0.42.0 datasets==2.18.0 accelerate xformers

Tip: If you’re running into cryptic CUDA errors, mismatched versions of torch and your local CUDA install are usually the culprit. I had to explicitly install torch with the correct CUDA tag:

pip install torch==2.2.1+cu118 --index-url https://download.pytorch.org/whl/cu118

Model Access (No Skipping This Step)

You’ll need access to LLaMA 3 weights via Meta. When I requested access, the approval took around 48 hours. If you haven’t done this yet:

  1. Head to Meta’s official request page.
  2. Submit your use case details (they care more about safety than your credentials).
  3. Once approved, you’ll get a Hugging Face access token or download link.

After that, you can pull the model via transformers or download it manually and convert it to HF format. I’ll walk through that step in the next section.


3. Environment Setup

“Set your environment up like it’s going to break—because it probably will if you don’t lock things down properly.”

I’ve gone through this enough times to know that a sloppy setup will cost you hours down the line. So let me walk you through the exact environment setup that worked for me—not a general checklist, but what I actually used when fine-tuning LLaMA 3 with LoRA.

Use a Clean Virtual Environment

I used virtualenv, but if you’re more comfortable with conda, go for it. The key is isolation. Don’t install anything globally—you’ll end up chasing ghost dependencies later.

python3 -m venv llama3-env
source llama3-env/bin/activate

Or with conda:

conda create -n llama3-env python=3.10 -y
conda activate llama3-env

Install the Right Packages (And the Right Versions)

Here’s what worked for me after multiple rounds of dependency conflicts:

pip install torch==2.2.1+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.39.1 peft==0.10.0 bitsandbytes==0.42.0 datasets==2.18.0 accelerate xformers

Note: xformers isn’t optional if you’re on lower VRAM. I found it made a noticeable difference in memory usage when attention layers kicked in.

Also, install scipy early. I’ve seen bitsandbytes fail silently if it’s missing.

pip install scipy

CUDA Compatibility — Don’t Ignore This

I’ve personally tested this setup with:

  • CUDA: 11.8
  • cuDNN: 8.9
  • GPU drivers: 535+

If you mismatch torch and CUDA, things will break in weird ways—usually during forward passes. You’ll see errors like:
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED
That’s almost always a version issue.

Here’s how I double-check compatibility:

nvcc --version
nvidia-smi

Then compare that against PyTorch’s compatibility table. I recommend locking all versioning in a requirements.txt or even better—freezing with pip freeze once it’s stable.

Optional but Useful: Docker Setup

If you’re deploying this repeatedly or collaborating across machines, a Dockerfile will save your sanity. Here’s one I put together for LLaMA 3 + LoRA:

FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y python3 python3-pip git && \
    pip3 install --upgrade pip

RUN pip install torch==2.2.1+cu118 --index-url https://download.pytorch.org/whl/cu118
RUN pip install transformers==4.39.1 peft==0.10.0 bitsandbytes==0.42.0 datasets==2.18.0 accelerate xformers scipy

WORKDIR /workspace

I keep all my model files and scripts mounted into /workspace so I don’t rebuild every time I update a training script.


4. Getting the LLaMA 3 Weights

“Getting the model itself is probably the only part you can’t hack your way through.”

Here’s the deal: Meta still gates LLaMA 3 access, and there’s no legal workaround. If you haven’t done it yet, go to Meta’s request form and submit your use case. I had to fill in organization details, but my approval came within two days.

Once you have access, you’ll typically get a Hugging Face link. If you’re using the transformers-cli, it’s even easier to pull down.

Authenticate with Hugging Face

If your model access is tied to your Hugging Face account, log in like this:

huggingface-cli login

After that, you can use the transformers interface to load the model directly:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Meta-Llama-3-8B"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

Folder Structure You’ll Need Later

When you download the model, make sure your directory looks something like this. Trust me—if you’re exporting to GGUF or fine-tuning with PEFT, a clean structure saves a lot of confusion.

llama-3-8b/
├── config.json
├── tokenizer_config.json
├── tokenizer.model / tokenizer.json
├── generation_config.json
├── pytorch_model-00001-of-00002.bin
├── pytorch_model-00002-of-00002.bin
├── model.safetensors (optional)

This will become your model_path in the training and export scripts later.

Optional: Manual Conversion

In my case, I downloaded the raw weights and used Meta’s conversion script to get it into Hugging Face format. If you’re doing the same, be aware: the conversion script expects a very specific folder hierarchy, and tokenizers can silently mismatch if you’re not careful. I’ll go over those edge cases in the troubleshooting section later on.


5. Dataset Preparation (Advanced Use)

“Garbage in, garbage out” has never been more literal when fine-tuning LLMs.

I’ve tried a few different dataset formats when tuning LLaMA models, but what worked best for me—especially with instruction tuning using LoRA—was a clean instruction–input–output JSONL format, similar to the Alpaca or OpenAssistant structure.

Dataset Format I Used

Here’s the basic format I’ve personally used, and it works well with Hugging Face tokenizers + LoRA-based supervised fine-tuning (SFT):

{"instruction": "Translate to French", "input": "I love apples", "output": "J'aime les pommes"}

Stick to JSONL or datasets.DatasetDict if you’re planning to use Hugging Face’s datasets library.

Important: Always include both instruction and input. LLaMA models don’t inherently know what part of the text is a task vs. a user message unless you structure the prompt cleanly.

Preprocessing Script: What I Actually Use

This is a simplified version of the tokenizer + formatting script I built. It handles prompt structuring, truncation, and tokenization:

from transformers import AutoTokenizer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

def format_prompt(example):
    prompt = f"### Instruction:\n{example['instruction']}\n\n"
    if example.get("input"):
        prompt += f"### Input:\n{example['input']}\n\n"
    prompt += f"### Response:\n{example['output']}"
    return {"text": prompt}

def tokenize(sample):
    return tokenizer(sample["text"], truncation=True, padding="max_length", max_length=1024)

dataset = load_dataset("json", data_files="your_dataset.jsonl", split="train")
dataset = dataset.map(format_prompt)
dataset = dataset.map(tokenize, batched=True)

Splitting the Dataset — No Randomness

You want deterministic splits, especially for repeatable experiments. Here’s what I used:

split_dataset = dataset.train_test_split(test_size=0.05, seed=42)
train_data = split_dataset["train"]
eval_data = split_dataset["test"]

I’ve made the mistake of using random splits without a seed and ended up chasing performance differences I couldn’t explain later. Lock it down.


6. Fine-Tuning LLaMA 3 with LoRA (Minimal RAM/GPU Footprint)

“You don’t need a DGX box to train LLMs—just a focused config and LoRA done right.”

I’ll say this upfront: unless you’ve got 8 A100s lying around (and if you do, good for you), full fine-tuning of LLaMA 3 isn’t worth the cost. That’s why I stuck with LoRA (Low-Rank Adaptation). It gave me solid results without melting my local GPU.

Why LoRA?

LoRA trains only a small set of parameters by injecting trainable rank-decomposed matrices into the model. It’s fast, memory-efficient, and most importantly—it actually works if you tune it right.

I personally fine-tuned LLaMA 3 (8B) with LoRA on a single 24GB 4090 using 4-bit quantized weights and never hit OOM.

LoRA Config That Worked for Me

Here’s a LoRA config I settled on after a few rounds of experimentation:

lora_config = {
    "r": 64,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "bias": "none",
    "target_modules": ["q_proj", "v_proj"],
    "task_type": "CAUSAL_LM"
}

You might be tempted to LoRA every layer, but I found that targeting just q_proj and v_proj layers gave me the best trade-off between performance and speed.

Setting Up the PEFT Model

This is what I used with Hugging Face’s peft and transformers libraries:

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    load_in_4bit=True,
    device_map="auto",
    torch_dtype="auto"
)

base_model = prepare_model_for_kbit_training(base_model)

peft_config = LoraConfig(**lora_config)
model = get_peft_model(base_model, peft_config)

Training Script (Modular, Minimal)

This isn’t a bloated training pipeline. I used Hugging Face’s Trainer with gradient checkpointing, mixed precision, and warmup:

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    evaluation_strategy="steps",
    logging_steps=20,
    save_steps=100,
    save_total_limit=2,
    num_train_epochs=3,
    learning_rate=2e-4,
    warmup_steps=50,
    fp16=True,
    logging_dir="./logs",
    output_dir="./output"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data
)

trainer.train()

Optional: Deepspeed for Multi-GPU

If you’re training on multiple GPUs, I’d recommend this Deepspeed config (ds_config.json):

{
  "fp16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 2
  },
  "train_batch_size": 16,
  "gradient_accumulation_steps": 4
}

Add this to your TrainingArguments:

deepspeed="ds_config.json"

7. Evaluating the Fine-tuned Model

“The model’s only as good as the questions you throw at it.”

When I reached this stage, I didn’t want to just rely on metrics that look good on paper but fall apart in the real world. I evaluated the fine-tuned LLaMA 3 model by throwing actual use-case prompts at it—things like data science FAQs, technical reasoning tasks, and some open-domain generation.

Inference Script I Used

Before worrying about ROUGE or BLEU, I ran this script to get a feel for the model’s output. The goal? Gut-check it for tone, structure, hallucination, and instruction alignment.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("./output")
model = AutoModelForCausalLM.from_pretrained("./output", device_map="auto")

prompt = """### Instruction:
Translate this to German

### Input:
I like solving hard problems

### Response:
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

This kind of hands-on testing gave me a much better feel than staring at evaluation metrics alone.

Metric-Based Evaluation

That said, I did run metrics to make the evaluation reproducible. For that, I used a stripped-down version of BLEU and ROUGE scoring via evaluate. Here’s how I ran it:

from evaluate import load
from datasets import load_dataset

bleu = load("bleu")
rouge = load("rouge")

dataset = load_dataset("json", data_files="eval_data.jsonl")["train"]

predictions = []
references = []

for item in dataset:
    prompt = f"### Instruction:\n{item['instruction']}\n\n### Input:\n{item['input']}\n\n### Response:"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    output = model.generate(**inputs, max_new_tokens=100)
    decoded = tokenizer.decode(output[0], skip_special_tokens=True)

    predictions.append(decoded.strip())
    references.append(item["output"].strip())

bleu_score = bleu.compute(predictions=predictions, references=[[ref] for ref in references])
rouge_score = rouge.compute(predictions=predictions, references=references)

print("BLEU:", bleu_score)
print("ROUGE:", rouge_score)

One tip from my experience: don’t blindly trust BLEU for generative tasks. It’s sensitive to phrasing. I found ROUGE-L and manual reviews far more helpful.

Optional: Logging with WandB

If you’re planning to track multiple experiments (which I usually do), WandB integrates cleanly with Hugging Face Trainer. All you need is this:

from transformers.integrations import WandbCallback

trainer.add_callback(WandbCallback)

Then log in with:

wandb login

Done. Now you’ve got visual comparisons across runs.


8. Merging LoRA Weights (If Necessary)

“Training with LoRA is great. Deploying with LoRA? Not always ideal.”

Once you’re happy with the model, there’s a high chance you’ll want to merge the LoRA adapters back into the base weights. Why? Running inference with the merged model is lighter, simpler, and more portable—especially if you plan to quantize it to GGUF later for Ollama.

Personally, I only merge once I’ve finalized everything. During development, I keep the LoRA layers separate so I can experiment without polluting the base model.

When Should You Merge?

Here’s the deal:

  • ✅ If you’re deploying the model or quantizing it → merge.
  • ❌ If you’re still iterating or fine-tuning → don’t merge yet.

Code to Merge LoRA into Base Model

Here’s how I merged the LoRA weights:

from peft import PeftModel

model = PeftModel.from_pretrained(base_model, "./output/checkpoint-lora")
model = model.merge_and_unload()

model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")

Once merged, the model behaves like a regular Hugging Face AutoModelForCausalLM. No PEFT dependency required.

Heads-up: After merging, make sure to test the output again. I’ve seen edge cases where generation changes slightly due to numerical precision shifts.

Model Size After Merge

This is something that tripped me up the first time: merging doesn’t make the model smaller. It just folds the LoRA weights back into the full model. If you want smaller files, you’ll need to quantize it to 4-bit (more on that in the Ollama section).


9. Exporting to GGUF Format (Ollama-Compatible)

“Compatibility is a hidden tax. If you’re deploying LLMs locally, GGUF is the currency that actually works.”

Let me be honest—when I first tried getting LLaMA 3 to work with Ollama, the most confusing part wasn’t fine-tuning or LoRA. It was figuring out this whole GGUF format. If you haven’t dealt with it before, think of it as a streamlined, quantized, inference-friendly format used by llama.cpp (which Ollama relies on under the hood).

Why GGUF?

You might be wondering: Why not just stick with the Hugging Face format? Well, Ollama doesn’t speak Transformers natively. It runs on llama.cpp, and GGUF is what it expects—fully packed and optimized.

Conversion: From HF to GGUF

Here’s the actual command I used to convert the merged model to GGUF. I used convert.py from llama.cpp with the merged FP16 weights as input.

python3 convert.py \
  --outfile llama-3-gguf/llama3-q4_K_M.gguf \
  --model-dir ./merged_model \
  --outtype q4_K_M \
  --vocab-type sentencepiece
  • --outtype decides the quantization level.
    Personally, I found q4_K_M to be the best balance of performance and quality for local inference.
  • The tokenizer is usually tokenizer.model or tokenizer.json, and GGUF expects it to match exactly.

Pro tip: If you used Hugging Face tokenizer and it’s in JSON format, make sure llama.cpp’s version supports it—some builds expect .model.

What You Need for Ollama

Once the conversion’s done, here’s what your folder should include:

llama-3-gguf/
├── llama3-q4_K_M.gguf         # Core model
├── tokenizer.json             # Tokenizer file
├── config.json                # Hugging Face-style config (required!)

Make sure these files are present. Ollama checks for them during ollama create.


10. Creating a Custom Model in Ollama

“Think of the Modelfile like a Dockerfile—for LLMs.”

Once you’ve got your GGUF model ready, the next step is wrapping it into a custom Ollama model. I’ll walk you through exactly how I set mine up.

Folder Structure Ollama Expects

Here’s how I organize the custom model directory:

my-llama3-model/
├── Modelfile
├── llama3-q4_K_M.gguf
├── tokenizer.json
├── config.json

Pretty barebones—but it has everything Ollama needs.

Writing the Modelfile (Annotated)

Here’s the actual Modelfile I used, with inline notes:

FROM llama3

# Give your model a name and description
NAME my-llama3
DESCRIPTION "Fine-tuned LLaMA 3 for instruction-following"

# Set the GGUF file to use
PARAMETER file=llama3-q4_K_M.gguf

# If you want to modify system prompts or templates, do it here
TEMPLATE """{{ .Prompt }}"""

You can tweak the TEMPLATE section if you want custom system prompts or chat formatting. I kept mine minimal to avoid side effects.

Building the Model

This part’s straightforward. Run the following from the folder containing the Modelfile:

ollama create my-llama3 -f Modelfile

It’ll build and register the model under that name locally.

Running and Testing

You can now test it interactively:

ollama run my-llama3

Or programmatically using a curl request or Python client:

curl http://localhost:11434/api/generate -d '{
  "model": "my-llama3",
  "prompt": "Explain how LoRA works in simple terms."
}'

Personally, I used this to run batch evals and side-by-side comparisons with other models I had on Ollama.


11. Deployment + Usage Tips

“A model that only runs in a notebook is a science experiment. You want it in production? Make it usable—fast, light, and local.”

Now that the model’s fine-tuned, quantized, and wrapped in Ollama, let’s talk deployment. I’ve run LLaMA 3 models on both high-end GPUs and modest machines, and let me tell you—local inference is absolutely doable if you’re smart about it.

Running Inference with Ollama (Locally)

Once you’ve done the ollama create, inference is dead simple. Here’s a Python request I often use when testing response behavior after fine-tuning:

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "my-llama3",
        "prompt": "Write a short tweet about LoRA fine-tuning.",
        "stream": False
    }
)
print(response.json()["response"])

Or if you’re into curl, just:

curl http://localhost:11434/api/generate -d '{
  "model": "my-llama3",
  "prompt": "Explain gradient checkpointing in 2 lines."
}'

These scripts are my go-to when validating prompt formatting, especially right after fine-tuning.

Performance Benchmarking (What Actually Matters)

Let me be real here: most people just look at whether the model runs. That’s not enough. I wanted low latency and stable memory consumption. Here’s how I track that:

  • Latency: Add timestamps in your inference script.
  • Memory usage: I use nvtop and htop while running live inference. You can easily spot memory spikes or leaks.
  • Ollama metrics (CLI):
ollama run --verbose my-llama3

This gives a live breakdown of tokens/sec and memory used.

With q4_K_M, I consistently hit ~20-30 tokens/sec on a 3090 and stay under 8GB VRAM.

Hosting Ollama via REST (Behind a Proxy or Not)

You can absolutely put Ollama behind a lightweight FastAPI wrapper if you’re integrating it into a bigger app. But the built-in REST endpoint at localhost:11434 is solid for local experimentation.

Prompting Tips (Post-Fine-tuning)

Here’s what I learned the hard way: prompt formatting makes or breaks your model performance.

What works best for me post-finetune:

### Instruction:
Summarize the following paragraph in 2 sentences.

### Input:
{Your text here}

### Response:

This consistent formatting stabilizes outputs. I usually encode this directly into the tokenizer preprocessing script as a static template.


12. Troubleshooting & Edge Cases (Very Important)

“Your code will break. It’s how you debug it that defines your workflow.”

This section is for the stuff that actually burned me—and how I worked around it.

CUDA OOM Errors

If you’re seeing this:

CUDA out of memory. Tried to allocate ...

…don’t just bump batch size down. I fixed this by:

  • Reducing r and alpha in LoRA config.
  • Enabling gradient checkpointing (gradient_checkpointing=True) during training.
  • And forcing torch_dtype=torch.bfloat16 where supported.

Also, ensure nothing else is hogging GPU memory (check with nvidia-smi).

LoRA Mismatch During Load

One of the nastiest bugs I hit:

RuntimeError: Mismatched LoRA configuration...

This usually happens when the target modules don’t match between training and inference. Make sure your target_modules list is identical in both training and loading scripts.

GGUF Conversion Issues

One common pitfall: the tokenizer doesn’t match during conversion. llama.cpp expects a specific tokenizer structure.

  • If you’re using a JSON tokenizer, make sure it’s converted properly and explicitly passed during conversion:
--vocab-type sentencepiece

Also, check the tokenizer.json aligns with the one used during fine-tuning, or you’ll get garbage outputs.

Tokenizer Alignment Hell

This one cost me hours. My fine-tuned model worked fine with Hugging Face but produced nonsense in Ollama.

The fix? Explicitly copy tokenizer_config.json, tokenizer.json, and special_tokens_map.json into your Ollama folder before GGUF conversion. The default ones from base models often cause silent mismatches.

Version & Reproducibility Nightmares

What bit me once:

  • PyTorch 2.0 vs 2.1 minor API shifts
  • peft breaking changes after v0.6
  • transformers tokenizer format changing silently

What I do now: I freeze everything via requirements.txt with exact versions. Here’s a snippet from mine:

transformers==4.38.2
peft==0.7.1
torch==2.1.2
accelerate==0.27.2

Trust me—future you will thank you.


Conclusion: What Actually Matters When Fine-tuning LLaMA 3

“Shipping beats perfection. But clean code, reproducibility, and model behavior? That’s what defines a solid fine-tune.”

If there’s one thing I’ve learned working with LLaMA 3 — especially across LoRA, GGUF, and Ollama — it’s this: tools and frameworks evolve, but your workflow should be predictable, auditable, and modular. I didn’t just follow tutorials. I broke things. I rebuilt pipelines. I’ve run into obscure bugs that only show up on a Sunday night when a dependency silently updates.

And through all of it, here’s what I’d recommend:

  • Own your stack — from virtualenv to GPU compatibility, don’t blindly install.
  • Make dataset prep deterministic — randomness at this stage will bite you during eval.
  • Treat LoRA like a plugin, not a shortcut — minimal GPU footprint doesn’t mean minimal responsibility.
  • Track everything — metrics, configs, weights. You’ll thank yourself a week later.
  • Build locally, deploy locally — Ollama makes that easier than ever, and you stay in control.

And maybe most importantly? Don’t chase perfection. Ship something reproducible first. Then iterate.

Leave a Comment