1. Introduction
You already know the theory behind language models. You’ve read the papers, experimented with transformers, maybe even fine-tuned a few.
But when it comes to actually aligning these models with human preferences—ranking outputs, training reward models, and using DPO or PPO—it’s easy to get lost in vague tutorials or bloated theory.
I’ve been down that road. I’ve fine-tuned models with noisy feedback, crashed GPUs with PPO instability, and wasted time reading guides that skip the hard parts.
This post is different.
I’ll walk you through exactly how I fine-tuned LLMs using human preference data, covering:
- How I trained reward models from scratch
- My real setup for Direct Preference Optimization (DPO)
- Using PPO with
trl
the right way (without blowing up your memory) - How I handled quantized models, LoRA adapters, and scaling on consumer-grade hardware
It’s hands-on, and everything here is production-ready. Let’s dive in.
2. Setup and Requirements
Let’s not waste time on the basics. If you’re here, I’ll assume:
- You know how
transformers
work - You’ve fine-tuned models before (maybe even with LoRA)
- You want to build an actual RLHF pipeline, not just toy around in Colab
This section covers everything I personally set up before training models with human feedback—no fluff, just the exact stack that worked for me.
Tools I Used
Here’s what I installed for PPO, DPO, and reward model fine-tuning:
pip install transformers trl accelerate peft bitsandbytes
# Optional but helpful:
pip install wandb datasets
Why each one?
- transformers: base models and tokenizers
- trl: PPOTrainer, DPOTrainer — core of RLHF training
- accelerate: easy multi-GPU training and fp16/bf16 support
- peft: LoRA adapter training
- bitsandbytes: 4-bit/8-bit quantization (yes, it works for RLHF too)
If you’re serious about logging and debugging, I highly recommend integrating Weights & Biases. Personally, I don’t run any RLHF experiments without it—it’s saved me more than once.
Hardware Requirements
This might surprise you: I trained DPO with a 7B model on one A100 (40GB) using LoRA + 4-bit quantization.
PPO, on the other hand, is trickier—you’ll want at least 2 x A10s or 1 x A100, depending on your batch sizes and rollout length.
If you’re stuck on smaller GPUs, don’t worry—I’ll show how I used LoRA and quantization to make it work without killing my system.
Environment Setup
This is what I run before any RLHF training—clean, fast, and tested:
# Optional: Create a new conda environment
conda create -n rlhf python=3.10
conda activate rlhf
# Install dependencies
pip install -r requirements.txt
requirements.txt
example:
transformers==4.39.3
trl==0.7.4
peft==0.10.0
accelerate==0.27.2
bitsandbytes==0.43.1
datasets
wandb
Then I configure accelerate:
accelerate config
Here’s my typical config for A100:
compute_environment: LOCAL_MACHINE
mixed_precision: bf16
multi_gpu: true
If you’re running on CUDA 12+ with A100/H100, I highly recommend enabling Flash Attention 2 (install via pip install flash-attn --no-build-isolation
) to get faster training during PPO and DPO—especially with larger context windows.
3. Step 1: Supervised Fine-Tuning (SFT) – Optional, But It Helps
I’ll be honest—while SFT isn’t strictly required for RLHF, I almost always start with it. Why? Because in my experience, running PPO or DPO on a raw pretrained model (like a base LLaMA or Mistral) often leads to unstable behavior—especially if your prompts are diverse or domain-specific.
So even though some guides skip this step, I don’t. I fine-tune with a small SFT pass first to give the model some grounding in the task. Think of it as giving it some intuition before asking it to optimize for preference.
Dataset Format
You’ll want something like this for PPO or reward model training:
{"prompt": "Why is the sky blue?", "chosen": "Because molecules scatter blue light.", "rejected": "It’s just blue, trust me."}
Or if you’re planning to use DPO (which I’ll get to later), format it like:
{
"prompt": "Why is the sky blue?",
"response_1": "Because molecules scatter blue light.",
"response_2": "It’s just blue, trust me.",
"preferred": 1
}
Tokenization + Preprocessing
Here’s the actual preprocessing code I used in one of my DPO + SFT experiments with Mistral-7B:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", use_fast=True)
def preprocess(example):
return tokenizer(
example["prompt"] + "\n\n" + example["chosen"],
truncation=True,
padding="max_length",
max_length=1024
)
You can use DataCollatorWithPadding
, but personally I go with fixed padding and truncate aggressively to avoid OOM errors during SFT, especially when using quantized models.
Code: LoRA + Trainer-based SFT
Here’s a minimal example of the exact setup I used to run SFT on a quantized 7B model with LoRA:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
import torch
# Load 4-bit quantized model
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.2",
load_in_4bit=True,
device_map="auto"
)
# Apply LoRA
lora_config = LoraConfig(
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias="none",
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
training_args = TrainingArguments(
output_dir="./sft-checkpoints",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
fp16=True,
save_strategy="epoch",
logging_steps=10,
num_train_epochs=3,
warmup_steps=50,
learning_rate=2e-5,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
tokenizer=tokenizer
)
trainer.train()
Pro Tip: I set gradient_checkpointing=True
if I’m pushing model limits on A10s. Also, using bnb_4bit_use_double_quant=True
helps stabilize training.
4. Step 2: Training the Reward Model
Here’s the deal: your reward model can make or break PPO-style fine-tuning. A poorly trained reward model won’t just hurt alignment—it’ll give your PPO trainer garbage signals, and you’ll burn through hours of compute for nothing.
I’ve made that mistake before. Now, I train a clean, minimal reward model using a pairwise ranking loss and keep the architecture simple.
Dataset Format
The format here needs to be tight. I stick with JSONL that includes both chosen and rejected completions for the same prompt:
{"prompt": "Explain relativity.", "chosen": "Relativity is...", "rejected": "Einstein was a guy who..."}
You can convert your SFT dataset into this format easily if you already have upvote/downvote logs or preference annotations.
Code: Reward Model Architecture + Loss
I usually start with a distilled or small base model + scalar head for reward scoring.
Here’s what I used in a recent project:
from transformers import AutoModelForCausalLM
import torch.nn as nn
class RewardModel(nn.Module):
def __init__(self, base_model_name):
super().__init__()
self.model = AutoModelForCausalLM.from_pretrained(base_model_name)
self.value_head = nn.Linear(self.model.config.hidden_size, 1, bias=False)
def forward(self, input_ids, attention_mask):
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=True)
last_hidden = outputs.hidden_states[-1]
reward = self.value_head(last_hidden[:, -1, :]) # scalar reward for last token
return reward.squeeze(-1)
Reward Loss (Bradley-Terry / Pairwise Ranking)
This is the exact loss I use:
import torch.nn.functional as F
def compute_loss(chosen_reward, rejected_reward):
return -F.logsigmoid(chosen_reward - rejected_reward).mean()
This simple loss has worked consistently for me. I’ve tried alternatives like hinge loss, but logsigmoid
tends to converge more smoothly, especially when the preference signal is weak or noisy.
Trainer Setup (Simplified)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./reward-model",
per_device_train_batch_size=4,
num_train_epochs=3,
learning_rate=1e-5,
logging_steps=20,
fp16=True
)
trainer = Trainer(
model=reward_model,
args=training_args,
train_dataset=reward_dataset,
tokenizer=tokenizer,
compute_metrics=None # usually reward models don't use accuracy here
)
trainer.train()
Heads-up: Be very careful with overfitting here. I always check reward score distributions on a held-out set to make sure the model isn’t just memorizing noisy preferences.
5. Step 3: Fine-Tuning via Human Feedback
“When the reward is fuzzy, the optimization better be sharp.”
Let’s not waste time with theory—you already know what PPO and DPO are. So I’ll walk you through how I actually implement both, when I choose one over the other, and the exact code I’ve used in production training runs.
Option A: PPO (Proximal Policy Optimization)
When do I use PPO?
Personally, I reach for PPO when I have a reward model that I trust—trained cleanly on real human preferences. This setup gives you more flexibility and fine-grained control over reward shaping. But heads up: PPO is slow and expensive. You’ll feel it when you’re running this on anything smaller than an A100.
Architecture Summary
- Model: LoRA or 4-bit quantized causal LM
- Reward: Scalar score from a pre-trained reward model
- Optimizer: KL-penalized policy gradient
PPO Code (from real use)
Here’s the actual PPO loop I’ve used with TRL’s PPOTrainer
. It includes KL control, reward model integration, and W&B logging:
from trl import PPOTrainer, PPOConfig
from transformers import AutoTokenizer
import torch
# Define PPO config
ppo_config = PPOConfig(
model_name="mistralai/Mistral-7B-Instruct-v0.2",
learning_rate=1e-5,
batch_size=4,
mini_batch_size=1,
gradient_accumulation_steps=4,
log_with="wandb"
)
# Load tokenizer + model (LoRA/4-bit recommended)
tokenizer = AutoTokenizer.from_pretrained(ppo_config.model_name)
model = AutoModelForCausalLM.from_pretrained(
ppo_config.model_name,
load_in_4bit=True,
device_map="auto"
)
# Load reward model separately
reward_model = RewardModel(base_model_name="mistralai/Mistral-7B-Instruct-v0.2")
reward_model.eval()
# Define the trainer
ppo_trainer = PPOTrainer(
config=ppo_config,
model=model,
tokenizer=tokenizer,
reward_model=reward_model,
)
# Training loop
for batch in train_dataloader:
query_tensors = tokenizer(batch["prompt"], return_tensors="pt", padding=True).input_ids.to("cuda")
response_tensors = model.generate(query_tensors, max_new_tokens=128)
with torch.no_grad():
reward_scores = reward_model(response_tensors)
ppo_trainer.step(query_tensors, response_tensors, reward_scores)
Note: If you’re not logging KL, entropy, and rewards with W&B, you’re flying blind. Personally, I track KL divergence every 10 steps to avoid over-optimization.
Option B: DPO (Direct Preference Optimization)
You might be wondering: why mess with PPO if DPO skips the reward model entirely?
That’s the exact question I asked myself after burning too many hours wrangling reward models. And the answer is: DPO is simply faster and cleaner—if you already have human preference pairs.
When I use DPO
- When I’m fine-tuning with datasets like HH-RLHF, Anthropic HH, or OpenAssistant
- When I don’t want the complexity of a separate reward model
- When I need faster turnaround (especially during experiments)
DPO Code That Just Works
Here’s a minimal but real DPO setup. I’ve used this with both LLaMA-2 and Mistral with smooth training curves:
from trl import DPOTrainer
from transformers import AutoTokenizer, TrainingArguments
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
training_args = TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=3,
learning_rate=5e-6,
logging_steps=10,
output_dir="./dpo_output",
fp16=True
)
trainer = DPOTrainer(
model=model, # PEFT or quantized model
args=training_args,
beta=0.1, # Temperature for preference strength
train_dataset=preference_dataset,
tokenizer=tokenizer
)
trainer.train()
Beta matters. I typically start with beta=0.1
and scale up to 0.3 if the model isn’t learning strong preferences. Think of it as sharpening the “decision boundary” between responses.
PPO vs DPO: What I Choose
Scenario | My Pick |
---|---|
I have upvote/downvote data only | PPO |
I have (prompt, response1, response2, preferred) | DPO |
Need fast iteration | DPO |
Want full control over reward shaping | PPO |
6. Evaluation & Logging
“Training’s cheap. Bad evaluation is expensive.”
You might’ve nailed PPO or DPO, but without solid evaluation, you’re basically tuning in the dark. I’ve personally made that mistake—trained a gorgeous-looking model (loss was dropping, rewards were rising)… only to find out later it was spitting out repetitive nonsense. Since then, I’ve stuck to a few reliable metrics that give me a clear signal if things are going off track.
Automatic Metrics I Always Track
1. Reward Score Distribution (Before vs. After Fine-Tuning)
If your median reward isn’t moving up after training, something’s broken—period. Here’s how I usually log this:
import matplotlib.pyplot as plt
def plot_reward_distribution(pre_rewards, post_rewards):
plt.hist(pre_rewards, bins=50, alpha=0.5, label="Before FT")
plt.hist(post_rewards, bins=50, alpha=0.5, label="After FT")
plt.legend()
plt.title("Reward Score Distribution Shift")
plt.xlabel("Reward")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()
2. KL Divergence
This is your early-warning system. If KL starts exploding, your model is over-optimizing to rewards and drifting from the base model. Personally, I try to keep it under 0.2 per batch on average.
3. Entropy of Responses
Low entropy = repetitive outputs. I run this as a sanity check for response diversity:
import torch
def compute_entropy(logits):
probs = torch.nn.functional.softmax(logits, dim=-1)
log_probs = torch.nn.functional.log_softmax(logits, dim=-1)
entropy = -(probs * log_probs).sum(-1).mean()
return entropy.item()
4. Task-Specific Metrics (if applicable)
If I’m fine-tuning for summarization, I’ll still run ROUGE and BLEU—but only to catch regressions, not as the main judge.
Human Eval: The Gold Standard
This might surprise you: even noisy human preferences outperform nearly all automatic metrics when it comes to alignment. I’ve used two approaches here:
- Custom UIs (simple Flask or Gradio apps where annotators rank responses)
- OpenAI Evals-style JSON format for crowdworker annotation
Even just 50 manually ranked examples can uncover failure cases that your reward model misses.
7. Scaling Tips & Tricks
“You don’t need a cluster if you use your memory wisely.”
Let me share what’s actually helped me squeeze performance out of modest setups—without babysitting the GPU every 10 minutes.
Quantization (bitsandbytes)
I almost always fine-tune 7B models in 4-bit using bitsandbytes
. Here’s how I load them:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", quantization_config=bnb_config)
It cuts memory use drastically—you can get 7B on a single A100 with room to spare.
Flash Attention
If you’re not using flash-attn
, you’re wasting cycles. On Mistral and LLaMA-2, enabling flash-attn easily shaved 25–35% off my training time.
Make sure your CUDA setup is compatible and pass attn_implementation="flash_attention_2"
during model loading.
Gradient Checkpointing + LoRA
Here’s the combo I rely on when I’m forced to train on 24GB or 40GB cards:
model.gradient_checkpointing_enable()
model.enable_input_require_grads()
peft_config = LoraConfig(
r=64, lora_alpha=16, lora_dropout=0.05,
bias="none", task_type="CAUSAL_LM"
)
model = get_peft_model(model, peft_config)
This lets me train full 7B models with long prompts on A10s without running out of memory every epoch.
Log Everything
If there’s one piece of advice I’d drill into every RLHF pipeline, it’s this:
Log early. Log often. Log everything.
I log:
- GPU memory usage (
nvidia-smi
every N steps) - Reward progression (rolling average)
- KL divergence
- Entropy
- Number of degenerate outputs (e.g., blank or token-stuck completions)
You don’t want to wake up after a 10-hour run just to see your model forgot how to speak.
8. Bonus: Deploying the Fine-Tuned Model
“A model that can’t be served is just a fancy
.bin
file.”
You’ve gone through SFT, tuned the reward model, survived PPO or DPO, ran your evaluations—and now it’s time to actually use the thing. Here’s how I usually deploy fine-tuned models without turning it into a DevOps nightmare.
Exporting the Model (LoRA or Fully Merged)
If you’re using LoRA, you’ve got two options:
Option 1: Merge adapters (for standalone inference)
Merging helps if you want portability or to avoid PEFT during inference.
from peft import PeftModel
# Load base + LoRA adapters
model = PeftModel.from_pretrained(base_model, "path_to_lora")
model = model.merge_and_unload()
model.save_pretrained("merged_model_dir")
Option 2: Keep it LoRA (for lightweight, faster loading)
Personally, I keep LoRA for internal use where I control the inference script.
Quantized Inference with auto-gptq
or vLLM
If you care about fast response times and GPU efficiency, quantization is a must.
For auto-gptq
:
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(
"merged_model_dir",
use_triton=True,
device="cuda"
)
For vLLM
:
If you’re scaling out inference with multiple concurrent users, vLLM
is a game-changer. I’ve run 7B models on a single A100 and served 30+ concurrent users with sub-second latency.
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--tokenizer mistralai/Mistral-7B-Instruct-v0.2 \
--quantization awq # or gptq
You’ll get an OpenAI-compatible endpoint (/v1/completions
, /v1/chat/completions
), ready to plug into your existing apps.
Serving with FastAPI
If I’m not using vLLM
, I often wrap my model in a simple FastAPI
server for internal tools or testing.
from fastapi import FastAPI
from transformers import TextGenerationPipeline, AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("merged_model_dir")
model = AutoModelForCausalLM.from_pretrained("merged_model_dir")
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
app = FastAPI()
@app.post("/generate")
def generate(prompt: str):
output = pipeline(prompt, max_new_tokens=200)[0]["generated_text"]
return {"response": output}
Throw that behind a reverse proxy or Docker container and you’re ready to go.
9. Conclusion: Pulling It All Together
Let’s recap what we covered—this is the RLHF stack I’ve used across multiple projects:
- Supervised Fine-Tuning (SFT) — Get your base behavior right.
- Reward Model — Teach the model what you care about.
- PPO or DPO — Align it to human preferences.
- Evaluation — Trust nothing, verify everything.
- Deployment — Serve your model like a product.
Every project I’ve run with this pipeline had its own flavor—sometimes the reward model needed extra tuning, other times DPO alone got me 90% of the way there. What I’ve learned is: you need to adapt this to your data and users. There’s no universal config file that fits all.

I’m a Data Scientist.