Fine-Tuning TinyLlama

1. Why TinyLlama?

“Sometimes, smaller isn’t just faster — it’s smarter.”

I’ve fine-tuned a bunch of models over the past few months — Mistral, Phi, even the newer LLaMA variants. But when I stumbled upon TinyLlama, it hit a sweet spot I didn’t expect. If you’re working with constrained resources — say, a single A100 or even a decently powered Colab Pro instance — this thing flies.

Why did I go for TinyLlama?

Here’s the deal: for one of my projects — a lightweight code-generation chatbot that needed to run locally with low latency — I needed something fast, memory-efficient, and still good enough to generate decent completions. TinyLlama delivered.

Mistral is solid, but it’s heavy. Phi is more focused on reasoning and long context. TinyStories is just… tiny in more ways than one. TinyLlama? It’s got that 1.1B sweet spot: not too small to be dumb, not too large to be clunky.

Trade-offs I’ve run into:

  • Speed: Lightning fast on inference, especially with 4-bit quant. Latency feels almost real-time.
  • Capacity: It can struggle with multi-turn reasoning or long-form answers — you’ll feel it.
  • Memory usage: TinyLlama can load in 4-bit on a 16GB GPU with room to spare for PEFT training.
  • Fine-tuning time: Quick. I’ve seen epoch times that would take hours with larger models drop to minutes here.

If you’re building tools that don’t need LLaMA-2-grade reasoning, but still want something more useful than GPT-2 or distilled models, TinyLlama is probably what you’re looking for.


2. Prerequisites: What You Need Before Starting

Let me save you some trial-and-error here — I’ve broken enough training scripts to know what actually works with TinyLlama today.

Hardware Setup

I personally ran most of my fine-tuning experiments on a single A100 (80GB), but I’ve also tested this on Colab Pro with T4s and even a local RTX 4090 (24GB). As long as you keep your batch size low and stick with LoRA + 4-bit quantization, you’ll be fine.

Minimum working setup I’ve tested: Colab Pro with T4 GPU + 4-bit quant + PEFT + gradient accumulation = ✅

If you’re going full 16-bit or plan to fine-tune without LoRA (why though?), you’ll need more memory.

Framework & Library Versions

These versions actually work as of April 2025. Anything older tends to break silently or throw weird dtype mismatches:

transformers==4.39.3
datasets==2.18.0
peft==0.10.0
accelerate==0.29.2
bitsandbytes==0.42.0

I strongly recommend pinning these in your environment file. And make sure triton is up-to-date if you’re using 4-bit quantization.

Dataset Format That Worked for Me

For instruction tuning, I used a JSONL format like this:

{"text": "### Instruction: Write a Python function to reverse a list.\n### Response: def reverse_list(lst): return lst[::-1]"}

You don’t need anything fancy — just ensure it’s a clean string under a "text" field and that your tokenizer can handle it without choking. I ran a quick map() after tokenization to check for empty outputs or overly long samples.

Pro tip: keep your max sequence length under 512 for TinyLlama unless you really need long context. It doesn’t handle 2K tokens gracefully like Mistral does.


3. Loading TinyLlama the Right Way

“The model isn’t slow — you’re just loading it wrong.”

I’ll be honest, the first time I loaded TinyLlama, I tanked my Colab runtime. GPU OOM’d in seconds, and I hadn’t even touched a dataset yet. Turns out, how you load the model matters more than people give it credit for — especially with smaller models like this where your memory budget is tight, but not bottomless.

So here’s what’s actually worked for me.

Load it with the right precision and device map

Use torch_dtype=torch.float16 and device_map="auto" from day one. That combo alone took my memory usage down by nearly 30% on a 24GB VRAM setup.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,  # crucial for avoiding OOM
    device_map="auto"           # smart placement on multi-GPU or single GPU
)

Quick tip: If you’re on CPU or don’t have CUDA set up properly, device_map="auto" can silently do weird things. Always check model.hf_device_map after loading.

Adapter or full model? Here’s the difference.

If you’re using PEFT or LoRA (which, let’s be real, you probably are), you’ll need to decide whether to load the full base model or attach adapters. Here’s what that looks like.

Without Adapters (Plain base model):

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

With LoRA Adapters:

Assuming you’ve fine-tuned and saved adapters somewhere:

from peft import PeftModel

# Load base model first
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Attach LoRA adapter
model = PeftModel.from_pretrained(base_model, "path/to/adapter")

Personally, I always separate my base model load and adapter attach steps. It gives me more flexibility when testing across different LoRA variants.

This might surprise you:

If you skip torch_dtype or let it default to float32, you’ll hit OOM errors even on an A100 when training with batch sizes above 1. Not because TinyLlama is big — but because your tensors are unnecessarily fat.

I’ve also found that mixing float16 with 4-bit quantized LoRA adapters gives a crazy-good tradeoff between memory usage and quality. I’ll cover the 4-bit setup later, but if you’re already comfortable with bitsandbytes, you’re gonna like that section.


4. Dataset Preparation: What Actually Works for TinyLlama

“Garbage in, garbage out — and when you’re working with tiny models, even slightly messy input is enough to tank performance.”

I’ve run instruction tuning and pure causal pretraining on TinyLlama, and I can tell you — the format really matters. This model doesn’t have the extra capacity to “figure things out” from loose or unstructured text. You’ve got to spoon-feed it.

Let me show you what worked for me.

Real-World Example: Instruction-Tuning Format

For my chatbot use case, I stuck to a clean format like this:

{
  "text": "### Instruction: Write a Python function to check for prime numbers.\n### Response: def is_prime(n):\n    if n <= 1: return False\n    for i in range(2, int(n ** 0.5) + 1):\n        if n % i == 0: return False\n    return True"
}

You might be wondering: Why not use ChatML or Alpaca format? I tried — but TinyLlama doesn’t come pretrained on that style, and results were worse. This plain ### Instruction: / ### Response: format gave me noticeably better outputs.

Tokenization That Doesn’t Suck

This part is critical — and easy to screw up.

from datasets import load_dataset

dataset = load_dataset("your/custom-dataset")

def tokenize(example):
    return tokenizer(example["text"], truncation=True, max_length=512)

tokenized = dataset.map(tokenize, batched=True)

A few practical things I learned the hard way:

  • Truncation is essential. Without it, you’ll hit shape mismatches during training when some examples go over max length. I default to max_length=512 for TinyLlama — it performs noticeably worse above that.
  • Set tokenizer.padding_side = "right" for causal tasks. Padding on the left only makes sense if you’re using position-dependent prompts like in translation or multi-turn chat. For single-turn instructions? Right padding is cleaner.
  • Watch out for empty token outputs. I ran a simple len(input_ids) == 0 filter pass after tokenization — it caught a few bad apples in scraped data that would’ve silently broken batching later.

Pre-Tokenized Caching = Sanity

If you’re iterating fast, save yourself hours by caching tokenized datasets to disk:

tokenized.save_to_disk("data/tokenized-tinyllama")

Then just reload like this whenever you restart:

from datasets import load_from_disk
dataset = load_from_disk("data/tokenized-tinyllama")

Here’s the deal: If you’re running multiple training experiments (different LoRA configs, dataset slices, etc.), this step alone saves you from re-tokenizing 100k+ samples every single time.


5. PEFT + LoRA: The Only Way to Fine-Tune This Model Efficiently

“If you’re fine-tuning TinyLlama without PEFT, you’re basically fighting a dragon with a teaspoon.”

That might sound dramatic, but I learned it the hard way. When I tried full fine-tuning the first time (just to test the waters), it ran out of memory faster than I could say CUDA out of memory. That’s where LoRA with PEFT comes in — not as an optimization, but as a necessity.

Why LoRA Has to Be Used

TinyLlama might be small compared to 7B+ models, but when you’re working with 4-bit quantized weights on a single A100 or even 24GB local setups, you quickly hit memory ceilings. Even running a short training loop with AdamW and full gradients was a struggle.

I personally went with LoRA + 4-bit using the bitsandbytes library. Why? Because it let me:

  • Train on 100k+ examples with minimal memory usage
  • Iterate fast — no hour-long warmups or checkpoint overhead
  • Avoid modifying the entire model — just a few attention heads

The PEFT + LoRA Config That Worked Best for Me

You might be wondering: What LoRA config actually gives results instead of just running? Here’s what I used — and why:

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

Why These Settings?

Let me break it down:

  • r=16 and alpha=32 felt like the sweet spot. I ran ablations with r=8, r=32, and 16 gave me solid generalization without exploding GPU usage.
  • Targeting q_proj and v_proj instead of all_linear kept training focused — I didn’t need to LoRA every dense layer.
  • dropout=0.05 gave me stability without requiring label smoothing or extra regularization tricks.

I tested all this using a small dataset of ~50k instruction pairs before scaling to a larger run — and I recommend you do the same. It’ll save you hours.

Quick Note on BitsAndBytes

This might surprise you: Not all bnb_config settings are equal.

For example, setting bnb_4bit_quant_type="nf4" instead of "fp4" gave me much better stability. I also avoided int8 because it ended up using more memory with worse convergence in my case.


6. Training Loop with transformers Trainer (Or What Actually Works)

“The best training setup isn’t the most powerful — it’s the one that finishes without crashing and gives you checkpoints that mean something.”

Let me tell you what I learned fine-tuning TinyLlama with LoRA and transformers.Trainer: small tweaks save hours. I’m talking gradient accumulation, logging steps, and knowing exactly when to save checkpoints. I didn’t land on this config overnight — I burned through a few runs (and a couple coffees) before dialing in something stable, efficient, and reproducible.

My TrainingArguments Setup for Fast & Cheap Fine-Tuning

I was working on a single A100 (you could pull this off with 24–40GB cards too), and the goal was simple: finish an epoch in under an hour without wasting memory on logging or checkpoint clutter.

Here’s the setup that worked:

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    logging_steps=10,
    save_steps=200,
    save_total_limit=2,
    fp16=True,
    bf16=False,  # Only turn this on if your GPU supports BF16
    gradient_checkpointing=True,
    output_dir="./tinyllama-lora-finetuned",
    report_to="none",  # Avoid logging to W&B/HuggingFace unless needed
    logging_first_step=True,
    logging_dir="./logs"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"]
)

trainer.train()

What Actually Saved Me Hours

Let’s break down what wasn’t obvious but made a big difference:

  • gradient_checkpointing=True: This was a game-changer. I was able to bump my batch size by ~1.5x with it on. Memory footprint dropped instantly.
  • gradient_accumulation_steps=4: Instead of using a larger batch size (which my GPU couldn’t handle anyway), I simulated one.
  • fp16=True: Not optional. If you’re not using mixed precision, you’re probably bottlenecking.
  • save_steps=200: I used to save every 500–1000 steps — big mistake. Saving more frequently with save_total_limit=2 gave me better control during resume, especially for long jobs.

Checkpoint Strategy That Didn’t Burn Me

This might surprise you: even with Trainer, you’re not protected unless your resume logic is clean. Here’s what I did:

  1. Checkpoint folder structure — I always used the default output_dir/checkpoint-* pattern.
  2. Explicit resume — On restart, I’d always pass the last checkpoint path manually:
trainer.train(resume_from_checkpoint="./tinyllama-lora-finetuned/checkpoint-1200")

3. Avoid eval steps — During LoRA fine-tuning, I didn’t care about eval_dataset. It just slowed things down, and early evaluation on small batches gave misleading metrics.

One Final Tip

If you’re using Trainer, always test your config on 100 samples first. I built a script that slices the dataset, runs one epoch, and checks for crashes/logging/saving behavior. Doing that up front saved me hours later when I was scaling up.


    7. Evaluation That Actually Means Something

    “If your model is generating junk and your metric says it’s great, you’re not evaluating — you’re lying to yourself.”

    I’ve been down that road — looking at BLEU scores and thinking, “Cool, I guess it improved?” — only to test the model and see it struggle to finish a basic function. So I started treating evaluation like debugging: qualitative + quantitative, not just chasing a number.

    Here’s what I do now — and what’s actually given me insight into whether the fine-tuning worked.

    Perplexity Isn’t Useless — But It’s Not Enough

    Perplexity is decent for a quick sanity check — especially if you’re working on next-token prediction tasks. Personally, I always compare the perplexity before and after fine-tuning just to catch any major regressions. But I don’t stop there.

    When I fine-tuned TinyLlama for coding tasks, I also tried BLEU and ROUGE — they’re okay if you have a strict reference to compare against. But they fall apart fast when there’s more than one correct output — like with natural language generation.

    So what do I rely on now?

    Manual Before/After Generation Is Underrated

    You might be surprised, but some of the best insights come from a simple before/after diff. For one recent run, I tested on a handful of real prompts — the kind users would actually input. Not toy examples.

    Here’s how I set that up:

    input_text = "Write a Python script that sends an email"
    
    inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=100)
    
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    

    I ran the same prompt on the base model and the fine-tuned one. The difference? The base model spit out generic code snippets. After fine-tuning, the model generated complete, functional scripts — with imports, correct usage of smtplib, and proper error handling.

    That’s not something BLEU is going to capture well.

    Task-Specific Metrics — But With Caution

    For classification-style tasks, I use accuracy, F1, etc. — but even then, I treat them as directional. What mattered most in my experience? Does the model’s output solve the actual user intent? If it’s for code generation, does the script run? If it’s summarization, does it capture key facts?

    Here’s a trick I use: I’ll create a set of gold prompts + expected behaviors and score each with a pass/fail checklist. It’s basic, but in practice, it surfaces flaws that metrics won’t.


    8. Saving + Pushing to Hugging Face Hub

    “If it’s not on the Hub, did it even happen?”

    I’ve messed this up before — uploading full models with huge weight files, losing adapter weights, breaking inference links. Now I keep it clean, fast, and reproducible.

    Saving Only Adapter Weights

    When you’re using PEFT + LoRA, you only need to save the adapter weights, not the entire model. This is what I do:

    model.save_pretrained("./tinyllama-lora-adapter")
    tokenizer.save_pretrained("./tinyllama-lora-adapter")
    

    Make sure your tokenizer is saved too, especially if you made any tweaks. You’ll need both during inference.

    Push to Hub (The Right Way)

    Here’s how I avoid the common pitfalls:

    from huggingface_hub import login
    from peft import PeftModel
    
    # Login to your HF account
    login(token="hf_...")
    
    # Push adapter weights
    model.push_to_hub("your-username/tinyllama-lora-adapter")
    tokenizer.push_to_hub("your-username/tinyllama-lora-adapter")
    

    Don’t forget to add a model card — even a short one. I usually include:

    • What task the adapter is fine-tuned for
    • Base model used
    • LoRA config
    • A sample generation output

    How I Use the Adapter Later

    When loading this for inference:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    from peft import PeftModel
    
    base_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat")
    tokenizer = AutoTokenizer.from_pretrained("your-username/tinyllama-lora-adapter")
    
    model = PeftModel.from_pretrained(base_model, "your-username/tinyllama-lora-adapter")
    

    That’s it — small, fast, and portable. I can run inference from this setup without hauling around 5GB+ files.


    9. What I’d Do Differently Next Time

    “Experience is simply the name we give our mistakes.” — Oscar Wilde

    Fine-tuning TinyLlama was a wild ride, full of small wins and some interesting detours. I’ve learned a ton through trial and error, and there are definitely things I’d do differently if I could go back. Let me share a few key lessons that will save you time, effort, and resources.

    Mistakes That Set Me Back

    I’m not gonna sugarcoat it — I made a few mistakes along the way. But hey, that’s how we all get better, right?

    Wrong Tokenization

    The first hiccup I hit was with tokenization. I remember spending hours wondering why my model wasn’t learning well — I was getting poor perplexity scores and the outputs felt random. Turns out, the tokenizer I was using didn’t align well with TinyLlama’s pre-training. The base model expected text to be tokenized differently than what I was doing.

    What I should’ve done: I should have checked the tokenizer documentation for TinyLlama more carefully, especially the padding and truncation settings. I’ve learned to always test the tokenizer on sample input before proceeding.

    For example, here’s the code I would use now to make sure tokenization works as expected:

    from transformers import AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained('TinyLlama/tinyllama-1.1B')
    input_text = "Write a Python script that sends an email"
    inputs = tokenizer(input_text, truncation=True, padding=True, max_length=512)
    
    print(inputs)
    

    I test with different max_length, truncation, and padding settings to see how it affects the output. Trust me, doing this upfront saves you headaches down the line.

    Using Incompatible BitsAndBytes (bnb) Versions

    Another thing I regret? Playing fast and loose with versions. At one point, I was using a different version of the bitsandbytes library than what was compatible with my version of transformers. This caused all kinds of issues — crashes, slowdowns, and, worst of all, the model not loading properly.

    What I should’ve done: Always check compatibility between transformers, bnb, and torch versions. These libraries move fast, and sometimes small version changes can cause big headaches. Here’s how I now keep everything synced:

    pip install torch==1.12.1
    pip install bitsandbytes==0.38.0
    pip install transformers==4.25.1
    

    Best Practices That Saved Time & Compute

    Now, onto the things that actually worked. These are the tricks I wish I’d known from the beginning.

    Gradient Checkpointing

    If you’re fine-tuning a model with limited resources, gradient checkpointing is a must. It saves memory and lets you train with larger batch sizes or deeper models without running into memory bottlenecks. Enabling it was a game-changer for me.

    model.config.gradient_checkpointing = True
    

    FP16 and Mixed Precision

    Using fp16 (half precision) is one of the quickest ways to speed up training. It reduces memory usage and can accelerate training without much impact on performance. I’m always sure to set fp16=True in my TrainingArguments:

    training_args = TrainingArguments(
        fp16=True,
        ...
    )
    

    Smaller, Smarter Batches

    When I first started training TinyLlama, I tried large batch sizes hoping to speed things up. But that caused frequent out-of-memory errors and longer training times because the GPU was bogged down. Smaller batches with gradient accumulation turned out to be the sweet spot:

    training_args = TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        ...
    )
    

    This way, you don’t need to deal with those out-of-memory crashes, and you still get the benefit of larger effective batch sizes.

    Mistakes to Avoid for TinyLlama

    You might be wondering: What are the critical mistakes to avoid, especially with small models like TinyLlama?

    Here’s the deal: Don’t try to treat TinyLlama like a bigger, more resource-heavy model. Small models are more sensitive to overfitting, so I had to be careful about the number of epochs I ran.

    Overfitting

    I quickly learned that less is more with TinyLlama. With such a small model, it’s easy to overfit if you don’t monitor training closely. I started by running too many epochs, thinking I’d get better results, but the model began memorizing the data instead of generalizing.

    What I do now: I limit epochs to 3-4 and keep a close eye on validation loss. If it stops improving, I stop training. Here’s how I set it up:

    training_args = TrainingArguments(
        num_train_epochs=3,
        ...
    )
    

    Also, monitor your model’s validation performance to spot overfitting early.

    Ignoring Data Augmentation

    One of the biggest improvements I made came from using data augmentation. TinyLlama is already a compact model, so augmenting the training data (e.g., using synonym replacement or paraphrasing) helped it generalize better. Without this, it often failed to generate diverse outputs. Here’s an example of what I did:

    from datasets import Dataset
    
    def augment_data(example):
        # Simple synonym replacement for data augmentation
        example['text'] = example['text'].replace('email', 'message')
        return example
    
    dataset = dataset.map(augment_data)
    

    What I’d Do Next Time

    1. Tokenizer Testing First: Test your tokenizer upfront.
    2. Version Consistency: Keep track of library versions, especially for bitsandbytes.
    3. Gradient Checkpointing: Always use it for large models.
    4. Smarter Batch Sizes: Stick to smaller batches and use gradient accumulation.
    5. Shorter Training Cycles: Overfitting is easy to hit with small models.
    6. Augment Your Data: Use lightweight data augmentation to help the model generalize.

    I wouldn’t call these “mistakes” anymore — just part of the learning process. Trust me, once you internalize these practices, your fine-tuning will be faster, smoother, and a whole lot more successful.


    Conclusion: Wrapping It All Up

    When it comes to fine-tuning models like TinyLlama, it’s a journey that requires careful balancing of resources, technical know-how, and patience. But with the right tools and a solid understanding of the best practices, you can transform a modest model into something powerful, without burning through time or compute.

    Here’s what I want you to walk away with:

    • Efficiency is Key: Techniques like PEFT (LoRA), gradient checkpointing, and FP16 mixed precision are your best friends when working with limited resources. Don’t waste time trying to fine-tune on massive models that aren’t suited for your use case. TinyLlama is small, but its performance can be impressive if handled correctly.
    • Testing First, Always: Tokenization and version compatibility can make or break your fine-tuning process. Don’t skip these initial steps — trust me, I’ve been there.
    • Smarter Strategies: Use smaller batches, accumulate gradients, and keep epochs in check. Overfitting is sneaky, and it’ll creep up faster on smaller models.
    • Learn from Mistakes: Every mistake is an opportunity to improve. I’ve made plenty of them — from choosing the wrong version of a library to tokenization blunders. With experience, I’ve built a set of strategies that help me avoid those pitfalls.

    In the end, fine-tuning is about finding that sweet spot between optimizing performance and managing resources. You’ll get faster with each attempt, and your models will start to behave more like the magic you envisioned when you first started.

    So, whether you’re diving into TinyLlama or experimenting with a different model, remember that it’s all part of the learning curve. And with the techniques I’ve shared, you’ll save yourself a lot of time, frustration, and unnecessary compute.

    Good luck — and happy fine-tuning!

    Leave a Comment