1. Introduction
“Open models won’t replace closed ones overnight — but they will quietly take over the corners that matter most.”
When I first got my hands on Dolly v2, I wasn’t expecting much. Another open-source LLM in a sea of many, right? But after some deep dives and fine-tuning experiments, I realized Dolly is actually just right for certain use cases — especially when you’re working in domains where full control over data, behavior, and deployment is non-negotiable.
What makes Dolly worth your time? For starters, it’s MIT-licensed — which means you can actually use it commercially without jumping through legal hoops. It’s instruction-tuned out of the box using the databricks-dolly-15k
dataset, and more importantly, it’s lightweight enough that you don’t need a monster GPU cluster to get started.
I’ve personally fine-tuned Dolly for a financial Q&A assistant as well as a clinical summarization task — and in both cases, it handled domain-specific instructions surprisingly well, especially after a bit of targeted tuning. That’s what this guide is about.
Let me be crystal clear though — this isn’t some generic walkthrough. You won’t find fluffy definitions or long-winded lectures here. I’m not going to explain what a transformer is or how fine-tuning works. If you’re here, you already know that stuff.
What I will walk you through is how I fine-tuned Dolly, end to end — from setting up the environment to actually running inference on a tuned model that behaves the way I want it to.
2. Environment Setup
Before we even touch the model, let’s get the environment sorted — because trust me, if your setup isn’t solid, you’re going to run into silent failures that are frustrating to debug mid-training.
Frameworks I Used
Here’s what I used for my own fine-tuning pipeline:
transformers
— for model and tokenizer handlingdatasets
— for loading and preprocessing the training datapeft
— to apply parameter-efficient tuning (LoRA)accelerate
— for managing training across devicesbitsandbytes
— for 4-bit quantization (optional but very helpful for limited VRAM)
Hardware Considerations
I’ve done this on both a single A100 (40GB VRAM) and a more modest RTX 3090 (24GB). You can get away with less if you’re using QLoRA — I’ll show that in a bit. For RAM, 32GB is the floor if you’re working with multi-million row datasets. CPU doesn’t bottleneck unless you’re tokenizing huge datasets without multiprocessing.
If you’re thinking about trying this on a Colab or other free-tier GPU: it’s possible, but only with aggressive quantization and a small dataset.
Python Setup
I always recommend isolating environments to avoid dependency hell. I used venv
for this, but conda
works just as well.
Here’s the package list that worked for me — all version-pinned to avoid weird compatibility issues:
pip install transformers==4.39.1 \
datasets==2.19.0 \
peft==0.10.0 \
accelerate==0.27.2 \
bitsandbytes==0.43.0
Optional: QLoRA Setup for Low-End GPUs
If you’re working with a GPU under 24GB VRAM, I highly recommend using 4-bit QLoRA. It dramatically cuts memory usage, letting you fine-tune a 7B model on setups that otherwise wouldn’t cut it.
You’ll need:
pip install auto-gptq
And make sure you load your model with the appropriate quantization config (don’t worry, I’ll show you the code when we get to the fine-tuning part).
3. Choosing the Right Dolly Model
“Pick the wrong foundation and the whole house leans, no matter how fancy the roof looks.”
I’ve tested both Dolly v1 and Dolly v2 in real workflows — and here’s the short version: v1 is a no-go for anything commercial. It’s trained on ChatGPT data, which makes it legally murky. Great for experiments, but the moment you’re building anything that touches a real product or client-facing app, you’re stepping into GPL-license quicksand.
So what did I go with?
Dolly v2 — specifically, databricks/dolly-v2-7b
. It’s clean, commercially viable (MIT-licensed), and trained on Databricks’ own open instruction dataset. I’ve used it as a base for instruction-following models in finance, healthcare, and even internal tooling — and it consistently performs well after fine-tuning. Out of the box, it’s decent, but not ChatGPT-level. That’s fine. We’re here to teach it what you care about.
You might be wondering: “Why the 7B version?”
Honestly? It’s the sweet spot. Here’s how I see it:
- The 3B model is faster and lighter — great for experiments or tight environments, but it starts to break down on anything nuanced.
- The 12B version is powerful but heavy — I tried loading it on a 24GB GPU and it needed serious memory optimization gymnastics.
The 7B model, on the other hand, gives you solid generalization and runs well on a single high-end GPU (like a 3090 or A100). That balance makes it ideal for LoRA or QLoRA fine-tuning without cutting corners.
A quick note on tokenizer compatibility:
One thing I’ve seen folks overlook — Dolly v2 uses the same tokenizer architecture as LLaMA (they both use sentencepiece). If you’re preprocessing data with a mismatched tokenizer, you won’t always get errors… but you will get garbage outputs during inference. I learned that the hard way. Always load the tokenizer directly from databricks/dolly-v2-7b
and check for EOS padding behavior.
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "databricks/dolly-v2-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto")
4. Preparing Your Dataset
I’ll say this upfront — getting the dataset right is half the job. I’ve spent more time cleaning and formatting data than actually training the model. And in my experience, Dolly’s performance is ridiculously sensitive to prompt structure, so you want to get this part dialed in from the start.
Format Expectations
Dolly expects each example to follow an instruction-tuned schema. Think of it like a structured conversation between user and model. Here’s what I use:
{
"instruction": "Summarize the following legal contract...",
"context": "Legal contract text here...",
"response": "Here is the summary..."
}
Personally, I keep it simple and consistent. I’ve found that even small inconsistencies — like adding a period in one instruction but not the next — can introduce noise in smaller fine-tunes.
Real Examples (Not Just Dummy Data)
When I was prepping my own data, here’s the kind of sample I worked with:
{
"instruction": "Translate this policy document into simple English.",
"context": "Pursuant to Section 8.2 of the Act, the lessee...",
"response": "According to Section 8.2, the person renting the property must..."
}
And another:
{
"instruction": "Generate a customer support reply for the following complaint.",
"context": "I’ve been waiting for two weeks and still no response on my refund...",
"response": "Hi there, I truly apologize for the delay. Let me fix this immediately..."
}
You get the idea. Keep it tight, and always mirror the kind of interaction you’re optimizing for.
Preprocessing Steps I Actually Use
Here’s the pipeline I usually follow before fine-tuning:
- Edge case filtering: I remove any empty responses or instructions that are too vague. If it wouldn’t make sense to a human, it’s not going to help the model.
- Truncation logic: For long inputs, especially legal and medical texts, I truncate context to a safe max length. You don’t want it silently getting chopped off in the middle of a sentence.
- Tokenizer choice: You might be tempted to assume any tokenizer works — I learned the hard way that using the wrong one (e.g., something mismatched with Dolly’s base) will completely break your formatting. For Dolly, I’ve stuck with
LlamaTokenizer
variants and haven’t had issues.
Tokenization in Practice
Here’s what I typically run before fine-tuning:
from datasets import load_dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-3b")
# Load custom dataset
dataset = load_dataset("json", data_files="your_dataset.jsonl")
# Prompt formatting and tokenization
def tokenize(example):
prompt = f"### Instruction:\n{example['instruction']}\n\n### Context:\n{example['context']}\n\n### Response:\n"
inputs = tokenizer(prompt, truncation=True, max_length=512)
targets = tokenizer(example['response'], truncation=True, max_length=512)
inputs["labels"] = targets["input_ids"]
return inputs
tokenized_dataset = dataset.map(tokenize)
If you’re experimenting with longer contexts, tweak the max_length
carefully — and don’t forget to test how much your model can actually handle at inference. I’ve been burned by silent truncation more than once.
Optional But Worth Testing
Depending on your dataset:
- Data balancing — I once ran into a case where 90% of the data were “summarize”-type instructions. The model overfit hard to that task. A quick sampling fix solved it.
- Few-shot formatting — On one project, I got better results by injecting 1–2 prior examples in the context (few-shot style). Doesn’t always help, but worth trying if your model’s responses feel shallow.
5. Fine-Tuning Dolly with PEFT (QLoRA or LoRA)
“Sometimes, less is not just more — it’s smarter.”
Full model fine-tuning? I’ve been down that road. It’s GPU-hungry, overkill for most instruction tasks, and frankly, not worth the time when you’re working with solid base models like Dolly. That’s why I now default to parameter-efficient fine-tuning — especially LoRA and QLoRA — for nearly all my Dolly workflows.
I’ll walk you through the exact config I’ve used to get good results without melting my GPU stack.
When I use QLoRA vs. LoRA
- If I’m working on my local 24GB 3090, QLoRA is my go-to. I can squeeze Dolly-7B fine-tuning into ~12GB of VRAM with
bnb_4bit
. - On cloud setups (A100s or T4s), I lean toward standard LoRA — faster, cleaner logs, and no quantization headaches.
If you’re just prototyping or iterating on task-specific prompts, LoRA gives you that rapid turnaround without needing a quantized base model. But if memory is your bottleneck, QLoRA can be a game-changer.
PEFT Configuration I Personally Use
This config has worked well across tasks — from financial document summarization to internal support chatbot tuning.
from peft import LoraConfig, get_peft_model, TaskType
peft_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, peft_config)
r=8
is a solid default. I’ve tested values up to 16, but didn’t notice much gain unless I had a ton of task-specific data.lora_alpha=32
usually balances learning speed with stability.lora_dropout=0.05
helps generalize better, especially if your dataset is noisy or diverse.
I’ve also had good results selectively targeting q_proj
and v_proj
. You can go deeper (e.g., include k_proj
, o_proj
), but I’d only recommend that if you’re seeing underfitting during eval.
Training Arguments That Have Actually Worked for Me
Here’s a setup I used recently while training on a dataset of 50k custom instructions:
from transformers import TrainingArguments
training_args = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
warmup_steps=100,
max_steps=1000,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_steps=200,
output_dir="./dolly-lora-out",
save_total_limit=2,
report_to="none"
)
I tune max_steps
rather than num_epochs
— gives me more control when scaling across datasets of different sizes.
gradient_accumulation_steps
helps simulate a larger batch size — especially critical on 1-GPU setups.
learning_rate=2e-4
has been a sweet spot for me with LoRA on Dolly. Lower and it stagnates, higher and it spikes loss.
Trainer vs. SFTTrainer (from trl
)
I’ve used both. Here’s my take:
- If you’re sticking to causal LM objectives with prompt-response pairs, the regular
Trainer
does the job. - But if you want smoother handling of structured formatting (especially with SFT-like tasks),
SFTTrainer
fromtrl
simplifies a lot.
Here’s what it looked like for one of my runs using SFTTrainer
:
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
args=training_args,
tokenizer=tokenizer,
packing=True # I usually keep this True for faster training
)
trainer.train()
One thing I’ve learned: setting
packing=True
inSFTTrainer
speeds up training a lot — especially if your samples are short or uneven in length. But keep an eye on truncation if you’re dealing with long-form tasks.
6. Training Deep-Dive
“The training loop is where good models go to thrive… or quietly fall apart.”
Over the last few months, I’ve fine-tuned Dolly on everything from customer support flows to internal knowledge bases. If there’s one thing I’ve learned the hard way, it’s this: training isn’t just about fitting a model to data — it’s about setting up the right scaffolding to see what’s going wrong before it’s too late.
Let me walk you through how I’ve set up logging, precision, checkpointing, and how I handled large-scale datasets — all based on actual runs, not theory.
Logging: I Never Train Without It
I’ve used both Weights & Biases and TensorBoard, but these days I lean heavily on wandb — especially for anything longer than a few hundred steps. I want real-time plots of loss curves, LR schedules, and GPU utilization without digging through logs.
Here’s how I plug it in:
pip install wandb
import wandb
wandb.init(project="dolly-lora", name="run-001")
Then just pass report_to="wandb"
in your TrainingArguments
:
training_args = TrainingArguments(
...,
report_to="wandb",
logging_steps=10,
)
Pro tip: I always log the tokenizer version and dataset hash as wandb.config
so I can trace weird training behavior later. Saved me more than once.
Mixed Precision (fp16 vs bf16)
I usually default to fp16, especially on consumer GPUs like the 3090 — the memory savings are massive. But when I’m on A100s or H100s, I switch to bf16 — it’s more stable, less likely to blow up during backprop.
Just toggle with:
fp16=True # or bf16=True if your hardware supports it
One thing I’ve noticed: bf16 tends to handle very large datasets and longer context lengths more gracefully. If you’re seeing NaNs
in loss out of nowhere on A100s, try switching from fp16 to bf16.
Checkpointing Like a Pro (Save Only the Best)
Here’s the deal: early on, I used to save checkpoints every N steps. But when fine-tuning with LoRA, you’re usually chasing small, incremental gains — and saving every checkpoint just clogs up disk space.
Instead, I now monitor validation loss and save only the best-performing LoRA adapter:
TrainingArguments(
...,
save_strategy="steps",
save_steps=200,
save_total_limit=2,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
)
And yeah — I always push to the Hub once I’m happy with the adapter weights. It’s just easier to version and share.
trainer.push_to_hub("dolly-finetuned-support-bot-v1")
Large Dataset Training Tricks
When I trained Dolly on 500k+ examples, I ran into the usual suspects: out-of-memory errors, endless epochs, exploding loss.
Here’s how I got around it:
1. Gradient Accumulation
I kept the batch size small (per_device_train_batch_size=4
) and stacked gradient_accumulation_steps=8
. It’s not just about memory — it actually improved generalization for longer contexts.
TrainingArguments(
...,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
)
2. Streaming Datasets
This might surprise you, but I started using Hugging Face’s streaming datasets for ultra-large corpora — no local preprocessing, no RAM bottlenecks.
from datasets import load_dataset
dataset = load_dataset("json", data_files="s3://my-huge-dataset.jsonl", streaming=True)
I usually apply filtering and tokenization on-the-fly with a generator pipeline. It slows down training slightly, but it keeps memory low and code clean.
Side note: If your dataset is sharded across files, use the
iterable
arrow format with a cache directory. Otherwise, you’ll feel it on the I/O side.
7. Evaluation: Did It Actually Improve?
This is where things get real.
I’ve seen too many people skip proper evaluation after fine-tuning — and just assume the model’s better because “it trained for 3 epochs and loss went down.” But that’s just not enough. If you’re deploying anything remotely serious, you need to see the gains, especially in your actual domain.
Metrics That Actually Help
For most of my projects, especially summarization or translation-style tasks, I’ve used:
- ROUGE: decent for catching content overlap, especially for summaries.
- BLEU: okay for structured translations or templated outputs.
- Custom string match scores: I’ve built task-specific metrics for things like checklist completion or keyword coverage.
- Manual inspection: This still matters a lot. I always set aside 30–50 examples and go through them side by side.
That last one has saved me more than once — I’ve caught hallucinations, incomplete answers, and awkward formatting that wouldn’t have shown up on any automated metric.
Base vs. Fine-Tuned: A Side-by-Side Reality Check
You might be wondering — how much of a difference does fine-tuning really make? Here’s a quick example from one of my internal evaluations (support bot use case):
Prompt:
### Instruction:
Respond to this customer complaint politely.
### Context:
I placed an order two weeks ago and still haven’t received anything. No updates either. Very disappointed.
### Response:
Base Dolly output:
Sorry about the inconvenience. Your order will arrive soon.
Fine-tuned Dolly output (after 2 epochs on a domain-specific customer support dataset):
Hi there, I’m really sorry to hear about the delay. I completely understand your frustration. Let me check on your order status right away and get this resolved for you.
The second one sounds more human, includes empathy, and fits the domain tone. That’s not something you’ll see in a loss curve — you have to test it with real examples like this.
Manual Eval Template I Use
When I’m in evaluation mode, I’ll often just spin up a quick notebook or use a Google Sheet with:
Prompt | Base Output | Fine-tuned Output | Score (1–5) | Notes |
---|---|---|---|---|
Summarize terms | Legalese copy | Clean, clear points | 4.5 | Big improvement |
Generate reply | Robotic | Natural and warm | 5 | Perfect |
Translate tech doc | Meh | Still a bit off | 3 | Needs more tuning |
I’ll do this across 20–50 varied prompts — it gives a much better sense of how the model generalizes.
Zero-shot vs. Fine-tuned: What I’ve Seen
In my experience, base Dolly does okay on short, simple prompts — especially if you structure them cleanly. But once you move into domain-specific tasks (like summarizing contracts, customer service replies, healthcare Q&A), zero-shot starts to fall short fast.
Fine-tuning doesn’t just improve the accuracy — it brings consistency. Fewer hallucinations, more predictable structure, and better alignment with tone and format. That’s the difference that makes it usable in real-world workflows.
8. Inference Pipeline
“Training gets you the brains. Inference shows whether they’ve actually learned something useful.”
After fine-tuning Dolly with PEFT, the first thing I want is to see what it actually learned. I’m not talking about accuracy metrics — I want to see how it handles my task-specific prompts in the wild.
Here’s the exact inference setup I use to test outputs — from model loading to structured prompt formatting to batching.
Load Your Fine-Tuned Dolly (LoRA or Full)
I always save and load just the adapter weights when using LoRA. It’s cleaner and much faster for deployment.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-7b", device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-7b")
# Load fine-tuned LoRA weights
model = PeftModel.from_pretrained(base_model, "./dolly-lora-out")
model.eval()
If I’m using the full fine-tuned model (rare these days), I just skip PeftModel
and load directly with from_pretrained()
on my fine-tuned folder.
Simple Inference Script (With Batching)
During QA testing, I typically batch 4–8 inputs at a time. Here’s the core of my inference loop:
from torch.utils.data import DataLoader
prompts = [
"### Instruction:\nSummarize this contract.\n\n### Context:\nContract starts here...\n\n### Response:",
"### Instruction:\nExtract all deadlines from the document.\n\n### Context:\nDocument here...\n\n### Response:",
# Add more prompts here
]
inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=False,
temperature=0.7,
top_p=0.9
)
responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for r in responses:
print(r)
I usually disable sampling for production-like inference (do_sample=False
), unless I’m deliberately testing creativity or open-endedness.
Prompt Formatting Best Practices
This part makes or breaks Dolly.
You can throw whatever data you want at training — but if your prompts aren’t structured consistently, Dolly won’t generalize properly.
Here’s the format I personally use for all instruction tuning:
prompt = """### Instruction:
Summarize this contract.
### Context:
<insert contract text here>
### Response:"""
Why this format?
- Consistency: During fine-tuning, I always use this structure — Instruction → Context → Response.
- Clear delimiters: These headers act as “anchors” — they help the model latch onto the task even with complex inputs.
- Easy to template: I’ve reused this format across domains — legal, healthcare, software logs — and it just works.
You might be tempted to get fancy with roleplay-style prompts. I’ve tested them too, but Dolly performs best with clear instruction-response framing — especially when using LoRA.
9. Tips from Experience
“Everything works… until it doesn’t. That’s when your real debugging skills kick in.”
I’ve fine-tuned Dolly enough times now to spot the usual landmines — and believe me, there are a few that can quietly break your training while looking perfectly fine on the surface. Here’s what I’ve learned from my own experiments:
Watch Out For
Token Limit Truncation
If your prompts or responses are long, Dolly (especially the smaller variants) will start clipping text without warning. You might not notice it during training — but during inference, outputs will look… incomplete.
What’s helped me: setting an explicit max_length
during tokenization and verifying input/output token counts with tokenizer(prompt, return_length=True)
.
Overfitting to Short Responses
This one bit me early on. I fine-tuned with a dataset that had a lot of short, one-liner answers — and Dolly started copying that behavior, even when prompts deserved richer outputs.
What I do now: I always inspect the response length distribution before training. If needed, I’ll augment with longer examples or penalize short completions with a custom loss wrapper.
Training Instability
Every once in a while, especially with mixed-precision or aggressive learning rates, Dolly just… spirals. Loss spikes, outputs go off the rails.
Fixes that worked for me:
- Use
gradient_checkpointing=True
- Keep
lr
around2e-5
for LoRA - Gradually warm up instead of starting cold
Lessons I’ve Learned (The Hard Way)
Good Prompts = Better Model
This might sound obvious — but I’ve seen huge quality jumps by just reworking prompt formats. Adding clear Instruction
, Context
, and Response
blocks trained Dolly to give more structured outputs.
Example prompt I use consistently:
prompt = """### Instruction:
What are the legal implications of breach of contract?
### Context:
The agreement specifies...
### Response:
"""
Even small tweaks like “###” markers made a noticeable difference in generation quality.
Smaller, Cleaner Datasets Sometimes Win
There was a point where I threw 80K noisy samples at Dolly and got worse results than with 4K hand-curated ones. I’ve learned that quality trumps size, especially when fine-tuning on smaller models with limited context windows.
If your data is messy, trimming down might actually help more than expanding.
The Tokenizer Really Matters
You might be wondering: how much can a tokenizer screw things up? Answer: a lot.
One time, I loaded the wrong tokenizer (a generic GPT one) instead of Dolly’s specific tokenizer — training didn’t throw errors, but inference was nonsense.
Now, I always sanity-check tokenization with something like:
tokens = tokenizer(prompt)["input_ids"]
print(tokenizer.convert_ids_to_tokens(tokens[:10]))
If those tokens don’t make sense, stop right there.
10. Deploying Your Fine-Tuned Dolly
“A model that can’t serve isn’t a model — it’s just a GPU souvenir.”
Once I’ve got a solid fine-tuned Dolly model, deployment becomes the next battlefield. And trust me, you’ve got options — I’ve gone from lightweight local inference to pushing full setups into production-grade APIs. Here’s how I approach it.
Convert to GGML/GGUF for Local Inference with llama.cpp
(Optional)
Now, this step is optional — but if you’re running inference on laptops or edge devices, quantization + llama.cpp
is a game-changer. I’ve personally used it to run Dolly on a MacBook Pro with surprisingly decent performance.
Step 1: Convert to GGUF (via transformers
+ convert.py
)
Install the conversion tools:
pip install huggingface_hub
git clone https://github.com/ggerganov/llama.cpp
Then run the conversion script (use your LoRA merged model):
python3 llama.cpp/convert.py \
--model-path ./merged-dolly \
--outtype q4_0 \
--outfile ./dolly.gguf
You might need to merge the LoRA adapter first (I usually do this with peft.merge_and_unload()
).
Step 2: Run with llama.cpp
Once you have the .gguf
file, Dolly can run like a charm on llama.cpp
. Just remember: performance will depend heavily on quantization level and system RAM.
Push to Hugging Face Hub (with LoRA Adapters Too)
This might surprise you: you don’t need to upload the full model weights. If you’re using LoRA, uploading just the adapter suffices.
Here’s what I usually push:
# Save tokenizer and adapter
tokenizer.save_pretrained("./dolly-lora-hf/")
model.save_pretrained("./dolly-lora-hf/")
Then login and push:
from huggingface_hub import notebook_login
notebook_login()
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("./dolly-lora-hf/")
tokenizer.push_to_hub("your-username/dolly-lora")
model.push_to_hub("your-username/dolly-lora")
This lets others (or you) pull the LoRA adapter and use it with PeftModel.from_pretrained()
— lightweight and reproducible.
Simple REST API with FastAPI or text-generation-inference
When I need to expose Dolly to downstream apps or dashboards, I usually wrap it with FastAPI. It’s dead simple and does the job.
Here’s a minimal FastAPI endpoint I’ve used:
from fastapi import FastAPI, Request
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
app = FastAPI()
model = AutoModelForCausalLM.from_pretrained("your-username/dolly-lora").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("your-username/dolly-lora")
@app.post("/generate")
async def generate(request: Request):
data = await request.json()
prompt = data["prompt"]
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
return {"response": tokenizer.decode(outputs[0], skip_special_tokens=True)}
Or if you’re scaling, look into text-generation-inference
— I’ve used it for higher-traffic setups. Just know that Dolly’s tokenizer and quantization can throw subtle issues in TGI unless you pin versions carefully.
11. Conclusion: What Worked, What Didn’t, and What’s Next
So — here’s what we walked through.
I started with an off-the-shelf Dolly model, ran a LoRA fine-tuning pipeline, troubleshot some of the typical quirks (tokenization, overfitting, unstable gradients), and ended up with a custom-tuned model that actually performs well on my target tasks.
If you’re wondering whether Dolly is production-ready — the answer is: it depends.
When Not to Use Dolly
Let me be honest — Dolly has its limits.
If you’re dealing with very nuanced generative tasks (e.g., legal reasoning, code generation, long-form summarization), or if your project absolutely demands top-tier fluency and depth, models like LLaMA 2, Mistral, or Mixtral will outperform it. They’re just trained on larger, more diverse corpora and benefit from newer architectures.
Personally, I’ve used Dolly for use cases like:
- FAQ-style bots
- Short-form knowledge assistants
- Structured prompt-response formats
And in these scenarios, it holds its own — especially when the prompt formatting is sharp and the dataset is well-defined.
What You Should Try Next
Here’s the deal: if you’re holding back on experimenting with your own domain-specific datasets, don’t. The setup isn’t as heavy as people think — especially with parameter-efficient fine-tuning like LoRA.
A few parting suggestions:
- Try fine-tuning Dolly on your internal support tickets, documentation, or chat logs.
- Experiment with prompt-engineering tricks before you jump into full-scale training.
- If you’re deploying, test both full-model and adapter-based inference pipelines — sometimes you can save serious GPU time without giving up performance.
Let’s Keep It Real
This guide wasn’t meant to romanticize Dolly — just to give you a real-world look at what it can and can’t do. Hopefully, this gives you a clearer picture than most of the vague tutorials floating around.
If you’ve made it this far — I’d seriously encourage you to fire up a Colab or spin up your local setup and try this on something that matters to your workflow.
That’s where the real insights come.

I’m a Data Scientist.