Fine-Tuning Language Models from Human Preferences

1. Introduction

You already know the theory behind language models. You’ve read the papers, experimented with transformers, maybe even fine-tuned a few.

But when it comes to actually aligning these models with human preferences—ranking outputs, training reward models, and using DPO or PPO—it’s easy to get lost in vague tutorials or bloated theory.

I’ve been down that road. I’ve fine-tuned models with noisy feedback, crashed GPUs with PPO instability, and wasted time reading guides that skip the hard parts.

This post is different.
I’ll walk you through exactly how I fine-tuned LLMs using human preference data, covering:

How I trained reward models from scratch
My real setup for Direct Preference Optimization (DPO)
Using PPO with trl the right way (without blowing up your memory)
How I handled quantized models, LoRA adapters, and scaling on consumer-grade hardware

It’s hands-on, and everything here is production-ready. Let’s dive in.

2. Setup and Requirements

Let’s not waste time on the basics. If you’re here, I’ll assume:

You know how transformers work
You’ve fine-tuned models before (maybe even with LoRA)
You want to build an actual RLHF pipeline, not just toy around in Colab

This section covers everything I personally set up before training models with human feedback—no fluff, just the exact stack that worked for me.

Tools I Used

Here’s what I installed for PPO, DPO, and reward model fine-tuning:

pip install transformers trl accelerate peft bitsandbytes
# Optional but helpful:
pip install wandb datasets

Why each one?

transformers: base models and tokenizers
trl: PPOTrainer, DPOTrainer — core of RLHF training
accelerate: easy multi-GPU training and fp16/bf16 support
peft: LoRA adapter training
bitsandbytes: 4-bit/8-bit quantization (yes, it works for RLHF too)

If you’re serious about logging and debugging, I highly recommend integrating Weights & Biases. Personally, I don’t run any RLHF experiments without it—it’s saved me more than once.

Hardware Requirements

This might surprise you: I trained DPO with a 7B model on one A100 (40GB) using LoRA + 4-bit quantization.
PPO, on the other hand, is trickier—you’ll want at least 2 x A10s or 1 x A100, depending on your batch sizes and rollout length.

If you’re stuck on smaller GPUs, don’t worry—I’ll show how I used LoRA and quantization to make it work without killing my system.

Environment Setup

This is what I run before any RLHF training—clean, fast, and tested:

# Optional: Create a new conda environment
conda create -n rlhf python=3.10
conda activate rlhf

# Install dependencies
pip install -r requirements.txt

requirements.txt example:

transformers==4.39.3
trl==0.7.4
peft==0.10.0
accelerate==0.27.2
bitsandbytes==0.43.1
datasets
wandb

Then I configure accelerate:

accelerate config

Here’s my typical config for A100:

compute_environment: LOCAL_MACHINE
mixed_precision: bf16
multi_gpu: true

If you’re running on CUDA 12+ with A100/H100, I highly recommend enabling Flash Attention 2 (install via pip install flash-attn --no-build-isolation) to get faster training during PPO and DPO—especially with larger context windows.

3. Step 1: Supervised Fine-Tuning (SFT) – Optional, But It Helps

I’ll be honest—while SFT isn’t strictly required for RLHF, I almost always start with it. Why? Because in my experience, running PPO or DPO on a raw pretrained model (like a base LLaMA or Mistral) often leads to unstable behavior—especially if your prompts are diverse or domain-specific.

So even though some guides skip this step, I don’t. I fine-tune with a small SFT pass first to give the model some grounding in the task. Think of it as giving it some intuition before asking it to optimize for preference.

Dataset Format

You’ll want something like this for PPO or reward model training:

{"prompt": "Why is the sky blue?", "chosen": "Because molecules scatter blue light.", "rejected": "It’s just blue, trust me."}

Or if you’re planning to use DPO (which I’ll get to later), format it like:

{
  "prompt": "Why is the sky blue?",
  "response_1": "Because molecules scatter blue light.",
  "response_2": "It’s just blue, trust me.",
  "preferred": 1
}

Tokenization + Preprocessing

Here’s the actual preprocessing code I used in one of my DPO + SFT experiments with Mistral-7B:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", use_fast=True)

def preprocess(example):
    return tokenizer(
        example["prompt"] + "\n\n" + example["chosen"],
        truncation=True,
        padding="max_length",
        max_length=1024
    )

You can use DataCollatorWithPadding, but personally I go with fixed padding and truncate aggressively to avoid OOM errors during SFT, especially when using quantized models.

Code: LoRA + Trainer-based SFT

Here’s a minimal example of the exact setup I used to run SFT on a quantized 7B model with LoRA:

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
import torch

# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    load_in_4bit=True,
    device_map="auto"
)

# Apply LoRA
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir="./sft-checkpoints",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    fp16=True,
    save_strategy="epoch",
    logging_steps=10,
    num_train_epochs=3,
    warmup_steps=50,
    learning_rate=2e-5,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer
)

trainer.train()

Pro Tip: I set gradient_checkpointing=True if I’m pushing model limits on A10s. Also, using bnb_4bit_use_double_quant=True helps stabilize training.

4. Step 2: Training the Reward Model

Here’s the deal: your reward model can make or break PPO-style fine-tuning. A poorly trained reward model won’t just hurt alignment—it’ll give your PPO trainer garbage signals, and you’ll burn through hours of compute for nothing.

I’ve made that mistake before. Now, I train a clean, minimal reward model using a pairwise ranking loss and keep the architecture simple.

Dataset Format

The format here needs to be tight. I stick with JSONL that includes both chosen and rejected completions for the same prompt:

{"prompt": "Explain relativity.", "chosen": "Relativity is...", "rejected": "Einstein was a guy who..."}

You can convert your SFT dataset into this format easily if you already have upvote/downvote logs or preference annotations.

Code: Reward Model Architecture + Loss

I usually start with a distilled or small base model + scalar head for reward scoring.

Here’s what I used in a recent project:

from transformers import AutoModelForCausalLM
import torch.nn as nn

class RewardModel(nn.Module):
    def __init__(self, base_model_name):
        super().__init__()
        self.model = AutoModelForCausalLM.from_pretrained(base_model_name)
        self.value_head = nn.Linear(self.model.config.hidden_size, 1, bias=False)

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=True)
        last_hidden = outputs.hidden_states[-1]
        reward = self.value_head(last_hidden[:, -1, :])  # scalar reward for last token
        return reward.squeeze(-1)

Reward Loss (Bradley-Terry / Pairwise Ranking)

This is the exact loss I use:

import torch.nn.functional as F

def compute_loss(chosen_reward, rejected_reward):
    return -F.logsigmoid(chosen_reward - rejected_reward).mean()

This simple loss has worked consistently for me. I’ve tried alternatives like hinge loss, but logsigmoid tends to converge more smoothly, especially when the preference signal is weak or noisy.

Trainer Setup (Simplified)

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./reward-model",
    per_device_train_batch_size=4,
    num_train_epochs=3,
    learning_rate=1e-5,
    logging_steps=20,
    fp16=True
)

trainer = Trainer(
    model=reward_model,
    args=training_args,
    train_dataset=reward_dataset,
    tokenizer=tokenizer,
    compute_metrics=None  # usually reward models don't use accuracy here
)

trainer.train()

Heads-up: Be very careful with overfitting here. I always check reward score distributions on a held-out set to make sure the model isn’t just memorizing noisy preferences.

5. Step 3: Fine-Tuning via Human Feedback

“When the reward is fuzzy, the optimization better be sharp.”

Let’s not waste time with theory—you already know what PPO and DPO are. So I’ll walk you through how I actually implement both, when I choose one over the other, and the exact code I’ve used in production training runs.

Option A: PPO (Proximal Policy Optimization)

When do I use PPO?
Personally, I reach for PPO when I have a reward model that I trust—trained cleanly on real human preferences. This setup gives you more flexibility and fine-grained control over reward shaping. But heads up: PPO is slow and expensive. You’ll feel it when you’re running this on anything smaller than an A100.

Architecture Summary

Model: LoRA or 4-bit quantized causal LM
Reward: Scalar score from a pre-trained reward model
Optimizer: KL-penalized policy gradient

PPO Code (from real use)

Here’s the actual PPO loop I’ve used with TRL’s PPOTrainer. It includes KL control, reward model integration, and W&B logging:

from trl import PPOTrainer, PPOConfig
from transformers import AutoTokenizer
import torch

# Define PPO config
ppo_config = PPOConfig(
    model_name="mistralai/Mistral-7B-Instruct-v0.2",
    learning_rate=1e-5,
    batch_size=4,
    mini_batch_size=1,
    gradient_accumulation_steps=4,
    log_with="wandb"
)

# Load tokenizer + model (LoRA/4-bit recommended)
tokenizer = AutoTokenizer.from_pretrained(ppo_config.model_name)
model = AutoModelForCausalLM.from_pretrained(
    ppo_config.model_name,
    load_in_4bit=True,
    device_map="auto"
)

# Load reward model separately
reward_model = RewardModel(base_model_name="mistralai/Mistral-7B-Instruct-v0.2")
reward_model.eval()

# Define the trainer
ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    tokenizer=tokenizer,
    reward_model=reward_model,
)

# Training loop
for batch in train_dataloader:
    query_tensors = tokenizer(batch["prompt"], return_tensors="pt", padding=True).input_ids.to("cuda")
    response_tensors = model.generate(query_tensors, max_new_tokens=128)

    with torch.no_grad():
        reward_scores = reward_model(response_tensors)

    ppo_trainer.step(query_tensors, response_tensors, reward_scores)

Note: If you’re not logging KL, entropy, and rewards with W&B, you’re flying blind. Personally, I track KL divergence every 10 steps to avoid over-optimization.

Option B: DPO (Direct Preference Optimization)

You might be wondering: why mess with PPO if DPO skips the reward model entirely?

That’s the exact question I asked myself after burning too many hours wrangling reward models. And the answer is: DPO is simply faster and cleaner—if you already have human preference pairs.

When I use DPO

When I’m fine-tuning with datasets like HH-RLHF, Anthropic HH, or OpenAssistant
When I don’t want the complexity of a separate reward model
When I need faster turnaround (especially during experiments)

DPO Code That Just Works

Here’s a minimal but real DPO setup. I’ve used this with both LLaMA-2 and Mistral with smooth training curves:

from trl import DPOTrainer
from transformers import AutoTokenizer, TrainingArguments

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    learning_rate=5e-6,
    logging_steps=10,
    output_dir="./dpo_output",
    fp16=True
)

trainer = DPOTrainer(
    model=model,  # PEFT or quantized model
    args=training_args,
    beta=0.1,  # Temperature for preference strength
    train_dataset=preference_dataset,
    tokenizer=tokenizer
)

trainer.train()

Beta matters. I typically start with beta=0.1 and scale up to 0.3 if the model isn’t learning strong preferences. Think of it as sharpening the “decision boundary” between responses.

PPO vs DPO: What I Choose

Scenario	My Pick
I have upvote/downvote data only	PPO
I have (prompt, response1, response2, preferred)	DPO
Need fast iteration	DPO
Want full control over reward shaping	PPO

6. Evaluation & Logging

“Training’s cheap. Bad evaluation is expensive.”

You might’ve nailed PPO or DPO, but without solid evaluation, you’re basically tuning in the dark. I’ve personally made that mistake—trained a gorgeous-looking model (loss was dropping, rewards were rising)… only to find out later it was spitting out repetitive nonsense. Since then, I’ve stuck to a few reliable metrics that give me a clear signal if things are going off track.

Automatic Metrics I Always Track

1. Reward Score Distribution (Before vs. After Fine-Tuning)
If your median reward isn’t moving up after training, something’s broken—period. Here’s how I usually log this:

import matplotlib.pyplot as plt

def plot_reward_distribution(pre_rewards, post_rewards):
    plt.hist(pre_rewards, bins=50, alpha=0.5, label="Before FT")
    plt.hist(post_rewards, bins=50, alpha=0.5, label="After FT")
    plt.legend()
    plt.title("Reward Score Distribution Shift")
    plt.xlabel("Reward")
    plt.ylabel("Frequency")
    plt.grid(True)
    plt.show()

2. KL Divergence
This is your early-warning system. If KL starts exploding, your model is over-optimizing to rewards and drifting from the base model. Personally, I try to keep it under 0.2 per batch on average.

3. Entropy of Responses
Low entropy = repetitive outputs. I run this as a sanity check for response diversity:

import torch

def compute_entropy(logits):
    probs = torch.nn.functional.softmax(logits, dim=-1)
    log_probs = torch.nn.functional.log_softmax(logits, dim=-1)
    entropy = -(probs * log_probs).sum(-1).mean()
    return entropy.item()

4. Task-Specific Metrics (if applicable)
If I’m fine-tuning for summarization, I’ll still run ROUGE and BLEU—but only to catch regressions, not as the main judge.

Human Eval: The Gold Standard

This might surprise you: even noisy human preferences outperform nearly all automatic metrics when it comes to alignment. I’ve used two approaches here:

Custom UIs (simple Flask or Gradio apps where annotators rank responses)
OpenAI Evals-style JSON format for crowdworker annotation

Even just 50 manually ranked examples can uncover failure cases that your reward model misses.

7. Scaling Tips & Tricks

“You don’t need a cluster if you use your memory wisely.”

Let me share what’s actually helped me squeeze performance out of modest setups—without babysitting the GPU every 10 minutes.

Quantization (bitsandbytes)

I almost always fine-tune 7B models in 4-bit using bitsandbytes. Here’s how I load them:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", quantization_config=bnb_config)

It cuts memory use drastically—you can get 7B on a single A100 with room to spare.

Flash Attention

If you’re not using flash-attn, you’re wasting cycles. On Mistral and LLaMA-2, enabling flash-attn easily shaved 25–35% off my training time.

Make sure your CUDA setup is compatible and pass attn_implementation="flash_attention_2" during model loading.

Gradient Checkpointing + LoRA

Here’s the combo I rely on when I’m forced to train on 24GB or 40GB cards:

model.gradient_checkpointing_enable()
model.enable_input_require_grads()

peft_config = LoraConfig(
    r=64, lora_alpha=16, lora_dropout=0.05,
    bias="none", task_type="CAUSAL_LM"
)
model = get_peft_model(model, peft_config)

This lets me train full 7B models with long prompts on A10s without running out of memory every epoch.

Log Everything

If there’s one piece of advice I’d drill into every RLHF pipeline, it’s this:

Log early. Log often. Log everything.

I log:

GPU memory usage (nvidia-smi every N steps)
Reward progression (rolling average)
KL divergence
Entropy
Number of degenerate outputs (e.g., blank or token-stuck completions)

You don’t want to wake up after a 10-hour run just to see your model forgot how to speak.

8. Bonus: Deploying the Fine-Tuned Model

“A model that can’t be served is just a fancy .bin file.”

You’ve gone through SFT, tuned the reward model, survived PPO or DPO, ran your evaluations—and now it’s time to actually use the thing. Here’s how I usually deploy fine-tuned models without turning it into a DevOps nightmare.

Exporting the Model (LoRA or Fully Merged)

If you’re using LoRA, you’ve got two options:

Option 1: Merge adapters (for standalone inference)
Merging helps if you want portability or to avoid PEFT during inference.

from peft import PeftModel

# Load base + LoRA adapters
model = PeftModel.from_pretrained(base_model, "path_to_lora")
model = model.merge_and_unload()
model.save_pretrained("merged_model_dir")

Option 2: Keep it LoRA (for lightweight, faster loading)
Personally, I keep LoRA for internal use where I control the inference script.

Quantized Inference with `auto-gptq` or `vLLM`

If you care about fast response times and GPU efficiency, quantization is a must.

For auto-gptq:

from auto_gptq import AutoGPTQForCausalLM

model = AutoGPTQForCausalLM.from_quantized(
    "merged_model_dir",
    use_triton=True,
    device="cuda"
)

For vLLM:
If you’re scaling out inference with multiple concurrent users, vLLM is a game-changer. I’ve run 7B models on a single A100 and served 30+ concurrent users with sub-second latency.

python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --tokenizer mistralai/Mistral-7B-Instruct-v0.2 \
    --quantization awq  # or gptq

You’ll get an OpenAI-compatible endpoint (/v1/completions, /v1/chat/completions), ready to plug into your existing apps.

Serving with FastAPI

If I’m not using vLLM, I often wrap my model in a simple FastAPI server for internal tools or testing.

from fastapi import FastAPI
from transformers import TextGenerationPipeline, AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("merged_model_dir")
model = AutoModelForCausalLM.from_pretrained("merged_model_dir")
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)

app = FastAPI()

@app.post("/generate")
def generate(prompt: str):
    output = pipeline(prompt, max_new_tokens=200)[0]["generated_text"]
    return {"response": output}

Throw that behind a reverse proxy or Docker container and you’re ready to go.

9. Conclusion: Pulling It All Together

Let’s recap what we covered—this is the RLHF stack I’ve used across multiple projects:

Supervised Fine-Tuning (SFT) — Get your base behavior right.
Reward Model — Teach the model what you care about.
PPO or DPO — Align it to human preferences.
Evaluation — Trust nothing, verify everything.
Deployment — Serve your model like a product.

Every project I’ve run with this pipeline had its own flavor—sometimes the reward model needed extra tuning, other times DPO alone got me 90% of the way there. What I’ve learned is: you need to adapt this to your data and users. There’s no universal config file that fits all.

Amit Yadav

I’m a Data Scientist.

1. Introduction

2. Setup and Requirements

Tools I Used

Hardware Requirements

Environment Setup

3. Step 1: Supervised Fine-Tuning (SFT) – Optional, But It Helps

Dataset Format

Tokenization + Preprocessing

Code: LoRA + Trainer-based SFT

4. Step 2: Training the Reward Model

Dataset Format

Code: Reward Model Architecture + Loss

Reward Loss (Bradley-Terry / Pairwise Ranking)

Trainer Setup (Simplified)

5. Step 3: Fine-Tuning via Human Feedback

Option A: PPO (Proximal Policy Optimization)

Architecture Summary

PPO Code (from real use)

Option B: DPO (Direct Preference Optimization)

When I use DPO

DPO Code That Just Works

PPO vs DPO: What I Choose

6. Evaluation & Logging

Automatic Metrics I Always Track

Human Eval: The Gold Standard

7. Scaling Tips & Tricks

Quantization (bitsandbytes)

Flash Attention

Gradient Checkpointing + LoRA

Log Everything

8. Bonus: Deploying the Fine-Tuned Model

Exporting the Model (LoRA or Fully Merged)

Quantized Inference with auto-gptq or vLLM

Serving with FastAPI

9. Conclusion: Pulling It All Together

Leave a Comment Cancel reply

Quantized Inference with `auto-gptq` or `vLLM`