How to Fine-Tune Code Llama on Custom Code Tasks?

1. Introduction

I’ve fine-tuned Code Llama on a bunch of real-world tasks—everything from auto-generating docstrings to translating legacy Python 2 code into modern idiomatic Python 3. And here’s the thing: prompt engineering just didn’t cut it when I needed consistency, reliability, and lower token overhead.

Fine-tuning gave me a level of control that prompting simply couldn’t. For example, when I trained the 13B variant on a custom dataset of Python method definitions and their corresponding docstrings, hallucinations dropped by ~35%, and the model started generating highly structured, context-aware completions. No prompt tricks, no retries—just clean outputs.

So if you’re looking to specialize Code Llama for your in-house codebase or a niche programming task, you’re in the right place. I’m going to walk you through how I got this running on my own hardware, the pitfalls I ran into, and what actually worked—not just what looks good in theory.

2. Prerequisites & Setup

This isn’t one of those sections where I tell you to install Python and sign up for Hugging Face. If you’re here, I’m assuming you already know your way around. Let’s skip straight to the environment that worked for me.

Hardware I Used

For fine-tuning Code Llama 13B, I ran with:

2 × A100 (80GB) on a dedicated box
For a lighter experiment, I also tested 4 × RTX 4090 (24GB each) with QLoRA—tight but doable with careful batching

You can go even leaner with the 7B version, especially if you’re testing ideas or fine-tuning on smaller curated datasets. But for production-level results, I found the 13B model hit a sweet spot between quality and cost.

Choosing Between 7B vs 13B vs 34B

Here’s what I learned the hard way:

7B is fast and great for lightweight use-cases like code formatting, small completions, or assistant-style prompting.
13B worked best for more semantic tasks—like translating pseudocode to actual functions or generating context-aware docstrings.
34B gives impressive results, but the hardware demands are steep. Unless you’re working on infrastructure-level code modeling or commercial-scale dev tools, 13B is probably your best bet.

Required Packages

These versions worked best for me and avoided half the typical dependency issues:

pip install transformers==4.39.3 datasets accelerate bitsandbytes peft trl
huggingface-cli login

Don’t skip the login step—if you haven’t already requested access to the Code Llama weights on Hugging Face, you’ll hit a wall here.

Also, if you’re running multiple GPUs, I highly recommend setting up accelerate properly. It’ll save you from manually messing with torch.distributed configs.

3. Preparing the Dataset

Let me be blunt: the biggest performance gains I saw didn’t come from fancy optimizers or longer training runs. They came from how I structured the data.

What Code Llama Actually Likes

Code Llama is tuned for causal language modeling—so it naturally excels at tasks like:

Code completion
Inline comment or docstring generation
Bug fixing (when framed as masked edits or before-after pairs)
Translating code between formats (Python to Bash, for instance)

With that in mind, I shaped my dataset to follow a strict prompt-completion structure. I wasn’t trying to do anything exotic—just clean, consistent pairs of input and expected output. And honestly, that made all the difference.

You might be wondering: How should I format my own dataset to mimic that structure? Here’s exactly what worked for me.

Prompt-Completion Format

Most of my data lives in JSONL format, where each line looks something like this:

{"prompt": "### Translate this Python snippet to Bash:\n\n", "completion": "echo $((3 + 4))"}

Or for docstring generation:

{"prompt": "### Add a docstring:\ndef add(a, b):\n    return a + b", "completion": "\"\"\"Add two numbers and return the result.\"\"\""}

Notice how the prompt sets up context, and the completion does exactly one thing. No extra text. No noise.

You can also load this into Hugging Face’s datasets library without converting formats, which was a nice time-saver.

Preprocessing & Tokenization

Here’s the preprocessing block I used. Short, clean, and tailored to how Code Llama expects inputs:

from datasets import load_dataset

# Load your JSONL or Hugging Face dataset
dataset = load_dataset("path/to/your/dataset")

# Format to <s> prompt completion </s> — as Code Llama prefers
def format(example):
    return {
        "text": f"<s> {example['prompt']} {example['completion']} </s>"
    }

# Apply formatting
dataset = dataset.map(format)

Yes, Code Llama understands <s> and </s> as special tokens (especially when working with Hugging Face tokenizers). And trust me, adding those consistently helped reduce weird truncations and tokenization edge cases during training.

A Tip That Saved Me Hours

Don’t forget to run a dataset = dataset.shuffle(seed=42) after formatting if your samples are even remotely clustered by task type. When I skipped this, my training loss dropped quickly—but generalization suffered.

4. Loading the Model with QLoRA (or Full Precision if You’ve Got the GPUs)

At this point, you’ve got the dataset ready and formatted cleanly. So let’s talk about getting the model up and running—without blowing up your VRAM.

I’ve run Code Llama both ways: full-precision on A100s and 4-bit quantized with QLoRA on 4090s. And honestly? If you’re not running at least 80GB per GPU, QLoRA is your best friend.

Here’s the deal:

You don’t want to waste time tweaking dtype and device_map combinations by hand. Just use AutoModelForCausalLM with a proper BitsAndBytesConfig, and let device_map="auto" figure it out. It works reliably with the Hugging Face transformers stack.

Here’s the exact code I use to load the 13B model in 4-bit QLoRA mode:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "codellama/CodeLlama-13b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-13b-hf")
tokenizer.pad_token = tokenizer.eos_token

I’ve seen a few folks forget to set the pad_token, and it ends up breaking batching later when you push samples through the Trainer or SFTTrainer. Save yourself the headache—just map it to the EOS token early on.

When I Skip QLoRA

Personally, I only go full-precision when:

I’m doing long-form training runs (>10k steps)
I want to evaluate subtle differences in gradients (e.g. analyzing loss spikes or overfitting behavior)
I have A100s on hand

Even then, the benefits aren’t always worth the extra memory unless you’re doing complex multi-task training. For most code-related fine-tuning (like doc gen, code edits, etc.), QLoRA performs surprisingly close to full finetuning, especially if you prep your dataset properly (which you already did in the last section).

5. Applying QLoRA with PEFT

If you’re fine-tuning Code Llama and not using PEFT, you’re probably wasting compute. I’ve gone through the painful full-finetune path—and unless you’re retraining on massive multi-language corpora, it’s overkill.

Here’s the thing: with QLoRA and PEFT, you only touch a small number of trainable parameters—yet the results often get you 90–95% of the gains you’d see with full finetuning. I’ve personally used this setup to fine-tune the 13B model on code repair and inline comment generation, and it holds up shockingly well.

Let’s jump straight into the config that worked for me:

LoRA Config (Tailored for Code Llama)

from peft import LoraConfig, get_peft_model, TaskType

peft_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

Let’s unpack this just a bit:

target_modules=["q_proj", "v_proj"]: I’ve tested other combinations (like k_proj, o_proj, even gate_proj) but sticking to q_proj and v_proj hits the best trade-off for size vs impact, at least on Code Llama.
r=64: I started with 8 and 16, but saw noticeable jumps when I moved to 64. Especially on 13B, that extra capacity gives the adapter room to actually learn the task.
lora_dropout=0.1: This was critical when training on small datasets—helped avoid overfitting without hurting convergence.

Once you apply get_peft_model, you can call print_trainable_parameters() and confirm that only the adapter weights are being trained. That was my checkpoint before launching any serious job—just to make sure I hadn’t messed up the config.

6. Training Configuration

If you’re running Code Llama on consumer-grade GPUs (like I am on 4 × 4090s), getting the training config right isn’t optional—it’s survival.

I’ve spent a lot of time testing combinations of Trainer vs SFTTrainer (from the trl library), and for most custom code-gen tasks, SFTTrainer has been the smoother experience. It gives you more flexibility when integrating PEFT, plus native support for packing, which is a big deal when optimizing VRAM usage on longer sequences.

That said, here’s the training config I keep coming back to when fine-tuning Code Llama 13B on a dual-GPU setup (2 × A100s or 4 × 4090s):

My Go-To TrainingArguments Setup

from transformers import TrainingArguments

args = TrainingArguments(
    per_device_train_batch_size=2,              # Works well for 13B with LoRA
    gradient_accumulation_steps=8,              # Simulates effective batch size of 16
    learning_rate=2e-5,                         # Sweet spot for adapter tuning
    num_train_epochs=3,
    fp16=True,                                  # Or use bf16=True on A100s
    logging_steps=10,
    save_strategy="epoch",
    output_dir="./codellama-ft",
)

Let me explain a few of the choices here, just from what I’ve seen firsthand:

Batch size & accumulation: I initially tried a batch size of 4 with lower accumulation, but that caused frequent out-of-memory errors on the 4090s—especially with sequences over 1024 tokens. Keeping batch size at 2 with higher accumulation was a stable workaround.
Learning rate: I’ve played with everything from 5e-6 to 1e-4, but 2e-5 consistently delivered strong convergence without overshooting. Especially when paired with LoRA, this rate gives the adapter enough room to learn without destabilizing things.
Precision: I usually go with fp16 on 4090s, but on A100s, bf16=True is just better—faster and numerically more stable for longer training runs. One thing I learned the hard way: mixing the two in a multi-GPU setup can cause silent divergence issues.

7. Running the Fine-Tuning

At this point, you’ve wired up your model with QLoRA, dialed in your training arguments, and everything looks good on paper. Now it’s time to actually run the fine-tuning loop.

When I’m training models like Code Llama, I don’t overcomplicate the trainer setup. In fact, 90% of the time, this minimal block of code just works—clean and reliable:

from transformers import Trainer

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"] if "validation" in dataset else None,
)

trainer.train()

That’s it. No magic. No hacks. If your datasets and tokenizer are prepared correctly, this just runs.

What I Watch While It Trains

Here’s the deal: a lot of people focus too much on loss curves and not enough on what those curves actually mean in context.

Personally, I keep a close eye on the training loss, especially around the 500–1000 step mark. With code models, I’ve found that a loss in the 1.5 to 2.5 range usually indicates that the model is learning meaningful patterns—assuming your dataset is clean and reasonably tokenized.

If you’re seeing it plateau early or dip below 1.0 too fast, chances are you’re overfitting. I’ve seen this happen especially when fine-tuning on domain-specific codebases with low entropy. The model memorizes structure fast, but that doesn’t translate to generalization.

Also, small tip from my experience: if your eval loss is zig-zagging all over the place while training loss is smooth, double-check your tokenizer alignment. I learned this the hard way on one of my earlier runs where tokenizers weren’t synced, and the eval metrics were basically garbage.

So yeah—don’t overthink the loop. Set it up, monitor loss in context, and make sure you’re watching what really matters. Once you’ve got a stable setup, scaling to larger datasets or longer epochs becomes way easier.

8. Evaluation

“Not everything that counts can be counted…” — but in code generation, you better be counting the right things.

Once training wraps up, I don’t just glance at the final loss and call it a day. That’s fine for toy models, but when you’re tuning something like Code Llama for serious tasks—code completion, synthesis, or repair—you need to go deeper.

I typically run a few targeted evaluations depending on the use case:

Exact Match: Useful when you’re working with templates or completion tasks where the output has a clear canonical answer.
BLEU: Personally, I use this when comparing structural similarity across generations, especially when the output isn’t strictly deterministic.
CodeEval / Functional Tests: This is where things get interesting. For function-level synthesis, I write unit tests or use a sandboxed exec() setup to check whether the generated function actually works. And yes, I’ve used actual exec()—but only when I can fully isolate the environment. Don’t run that on your local dev box unless you like surprises.

Here’s the script I often use to run quick prediction checks right after training:

inputs = tokenizer("def quicksort(", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

You might be wondering: Is this enough for production eval? No, but it’s fast and catches obvious regressions. For more rigorous setups, I run eval on curated prompt-response pairs and validate functional correctness across edge cases.

In short, eval isn’t just a metric—it’s where the model proves whether it actually understands code or is just mimicking syntax.

9. Saving and Inference (LoRA or Merged)

Here’s the deal: once the model’s trained, you’ve got two ways to wrap things up—save just the LoRA adapters (lean and portable), or merge and save the full model (standalone, deploy-ready).

If you’re sticking with LoRA:

model.save_pretrained("codellama-lora-adapter")
tokenizer.save_pretrained("codellama-lora-adapter")

This is the approach I prefer during iteration—especially when I’m still tuning or experimenting across datasets. Small footprint, fast reloads.

If you’re ready to deploy or test the full model end-to-end:

model = model.merge_and_unload()
model.save_pretrained("codellama-finetuned")

Once merged, the adapters are baked into the base weights. This makes it easier to ship the model to inference pipelines without needing PEFT or LoRA-specific code.

Just one thing I learned the hard way: always double-check the outputs before and after merging. Sometimes the merge introduces tiny numerical differences that affect generation—especially if you’re pushing the model on tight constraints like token budgets or greedy decoding.

10. Troubleshooting & Common Pitfalls

“The model trains… until it doesn’t.”

I’ve spent enough late nights staring at dead logs to know: fine-tuning always breaks the first time. Here’s what I’ve run into—and what I double-check when things go sideways.

Model not training at all?

First thing I check: Are the LoRA layers actually attached?
With peft, it’s surprisingly easy to misconfigure and end up freezing everything. I’ve done this myself—trained for 3 hours and realized the adapter layers weren’t even connected. Do a quick model.print_trainable_parameters() to confirm you’re not training thin air.

CUDA OOM?

If your GPU’s screaming for help, drop the batch size or switch on gradient_checkpointing.
Personally, I keep a config template handy that includes this:

args.gradient_checkpointing = True
args.per_device_train_batch_size = 2

Sometimes, reducing max_seq_length can also give you breathing room if your code samples are unnecessarily long.

Output is garbage?

Here’s where it gets tricky. Most of the time, when I see completely unusable generations, it’s a tokenization mismatch. Either the tokenizer wasn’t properly aligned with the base model, or there’s some strange preprocessing bug.

Pro tip: Always log a few tokenized inputs to visually inspect what’s going in. I’ve caught issues where special tokens were missing or whitespace got nuked in preprocessing.

Validation loss flat or rising?

This might surprise you: it’s not always the optimizer’s fault.

If you’re overfitting: your dataset is probably too small, or too repetitive.
If you’re underfitting: maybe you’re not training long enough, or your learning rate’s too low.

Sometimes, I intentionally train longer on a smaller dataset just to overfit on purpose and validate that the model can actually learn the pattern. If it can’t, the problem is structural.

11. Conclusion

Fine-tuning Code Llama, especially with LoRA, has been surprisingly efficient for me.
What I like most? You get targeted improvements on real-world tasks—without needing to retrain the entire 7B+ stack from scratch.

If you’re building serious code generation systems, I can’t stress this enough: custom fine-tuning changes everything. You go from “sort of works” to “this actually understands my task.”

And once you’ve got that dialed in… the output speaks for itself.

Amit Yadav

I’m a Data Scientist.

Get Data Science Roadmap For Your First Data Science Job!