Fine-Tune the Donut Model: A Practical Guide

1. Why Fine-Tune Donut in the First Place?

“All models are wrong, but some are useful.” — George Box probably wasn’t thinking about Donut when he said this, but the idea still holds.

I’ve used the naver-clova-ix/donut-base model across multiple real-world projects, and while it’s impressive out-of-the-box, it doesn’t generalize well to custom document layouts.

If your data looks anything like government-issued IDs from different countries, receipts with localized tax rules, or vendor-specific invoices, you’ve probably seen this too — the pre-trained model just starts hallucinating fields or misinterpreting layouts entirely.

In my case, I hit walls when dealing with:

  • Receipts in non-English formats (e.g., Japanese + English mix),
  • Invoices with unique fields like “Job ID” or multi-line service descriptions,
  • Scanned ID cards with poor contrast or skewed text blocks.

What really pushed me to fine-tune was the structured output. Donut is OCR-free, which is a blessing — but it means the model learns to generate your target fields.

If your document format deviates even slightly from the training data, you’ll start seeing inconsistent or incorrect field mappings. It’s not just a drop in accuracy — it’s a semantic mismatch.

And let’s be real: post-processing Donut’s incorrect outputs is a mess. It’s far more efficient to spend time fine-tuning than trying to patch it up downstream.


2. Setup: Your Local + GPU or Cloud Environment

Let me save you some time — Donut isn’t lightweight. You’re going to need a decent GPU setup if you want to fine-tune it without running into memory bottlenecks.

Personally, I’ve run most of my training on a single A100 or 2x V100s. But I’ve also tested it on consumer-grade GPUs like the 3090 — it works, but you’ll need to dial down the batch size aggressively.

Here’s what worked best for me:

  • Python: 3.9
  • PyTorch: 2.0+ (1.13 also worked but I had weird CUDA issues occasionally)
  • Environment: I prefer conda, especially when juggling multiple vision + NLP dependencies.
  • Donut requires: transformers, datasets, accelerate, and of course, donut.
# Core setup for Donut fine-tuning
pip install git+https://github.com/huggingface/transformers.git
pip install git+https://github.com/clovaai/donut.git
pip install datasets torchvision accelerate

Heads-up from my experience:

  • If you’re using CUDA 11.8+, make sure your PyTorch + torchvision versions are compatible. I had annoying crashes from mismatched versions.
  • On some setups, donut tries to install its own version of transformers, which can conflict with recent Hugging Face features — pin versions deliberately.

One thing I always do: test inference on the base model before starting fine-tuning. It helps confirm that your setup is CUDA-compatible, and everything loads correctly before you burn time on dataset prep.


3. Understand Your Data Format: Donut Needs JSON + PNG

“If you feed Donut the wrong ingredients, don’t expect a perfect bake.”

That might sound a bit dramatic, but here’s the deal — Donut doesn’t work like traditional OCR-based models. You’re not just throwing in an image and hoping it reads the text line by line.

Instead, it’s closer to a structured image-to-sequence task. Think of it like captioning a document — except the caption is a JSON string representing your fields.

From my experience, the format of the data is where most people trip up. I did too, early on.

Images

Your images should be:

  • In PNG format (Donut expects it, and it’s less lossy than JPEG),
  • Resized to 960px width while maintaining aspect ratio (this is baked into the processor),
  • Clean — that means deskewed, cropped, and ideally without extra padding.

Labels (This is where the magic — and frustration — happens)

Labels aren’t just keys and values. They need to be:

  • Serialized JSON representing the document structure,
  • Written in a consistent schema that your model will learn to generate.

Let me give you a real example. I had to fine-tune on Indian GST invoices — totally different format than what the original model saw. Here’s what one of my structured labels looked like (after a few iterations of trial and error):

{
  "vendor": "ABC Distributors",
  "gst_number": "29ABCDE1234F2Z5",
  "invoice_date": "2024-10-04",
  "items": [
    {"description": "Rice 25kg", "qty": "2", "price": "1250"},
    {"description": "Wheat Flour", "qty": "3", "price": "900"}
  ],
  "total": "3400"
}

Tip from the trenches:

You’ll want to flatten this JSON into a string so Donut can learn to decode it token by token. That’s what I do in preprocessing.

Here’s a sample Python function I used to turn annotated data into Donut-compatible labels:

import json

def build_donut_label(annotation_path):
    with open(annotation_path, "r") as f:
        ann = json.load(f)

    # This is where you flatten structured data into a prompt-like string
    # You can also tokenize keys explicitly if needed
    structured = {
        "vendor": ann.get("vendor_name", ""),
        "gst_number": ann.get("gstin", ""),
        "invoice_date": ann.get("date", ""),
        "items": [],
        "total": ann.get("grand_total", "")
    }

    for item in ann.get("line_items", []):
        structured["items"].append({
            "description": item.get("name", ""),
            "qty": item.get("quantity", ""),
            "price": item.get("amount", "")
        })

    return json.dumps(structured, ensure_ascii=False)

Make sure the keys remain consistent across all samples. If one field is missing, include it with an empty string. I learned this the hard way — inconsistent keys lead to decoding errors or token mismatch issues during fine-tuning.

You might be wondering: what about token overflow?

That’s a real issue. Donut’s tokenizer has a max_length (usually 512). If your JSON is too verbose (especially with nested or repeated fields), it’ll get truncated — and your labels won’t match the input. My workaround was to:

  • Shorten field names ("description""desc", etc.)
  • Remove optional verbose fields ("address", "terms", etc.)
  • Keep token count under control with a tokenizer preview before training

Here’s how I’d quickly check if a label is too long:

from transformers import DonutProcessor

processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base")

text = build_donut_label("path/to/sample.json")
tokens = processor.tokenizer(text, return_tensors="pt", truncation=True)

print("Token length:", len(tokens["input_ids"][0]))

If that prints anything close to 512 — you’re on thin ice. Either simplify the schema or increase model context (which means deeper surgery).


4. Tokenization + OCR-Free Architecture: How to Not Mess Up

“A model with no OCR is like a monk with no eyes — it sees everything, but only if you feed it right.”

That’s not a proverb. That’s my lived experience with Donut.

This might surprise you: Donut doesn’t rely on any external OCR engine. That’s what makes it elegant — and fragile.

All the magic comes from pairing raw document images with structured text strings during training. So if you mess up the tokenization pipeline, the whole thing breaks — silently.

I’ve hit this wall myself. You run training, loss doesn’t move, and you’re left wondering if your GPU’s just generating heat for fun. Nine times out of ten, it’s because your input and output don’t line up.

Let me walk you through how I now handle tokenization religiously when working with Donut.

Start with the Right Processor

Donut uses its own DonutProcessor, which wraps both the image preprocessor and the tokenizer.

from transformers import DonutProcessor
from PIL import Image

processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base")

# Load your image (already resized to 960px width, etc.)
image = Image.open("sample_invoice.png").convert("RGB")

This part’s straightforward. But here’s where things can get sneaky.

Tokenize with Caution

You’re not just tokenizing labels — you’re tokenizing structured JSON as a string. That string is what the model learns to output, so formatting matters a lot.

# Assume you've already created your structured JSON label string
# e.g. {"vendor": "ABC Co", "invoice_date": "2024-12-01", ...}

text_label = build_donut_label("path/to/annotation.json")

# Append the EOS token manually — this is important!
if not text_label.endswith(processor.tokenizer.eos_token):
    text_label += processor.tokenizer.eos_token

# Tokenize the label text
labels = processor.tokenizer(
    text_label,
    add_special_tokens=False,
    max_length=512,
    truncation=True,
    return_tensors="pt"
)["input_ids"].squeeze(0)

That .eos_token is critical. I’ve seen models fail to converge just because that token was missing. Without it, the decoder doesn’t know when to stop generating — or worse, it misaligns during teacher forcing.

Process the Image

Meanwhile, you’ve got to get the image into pixel_values. Don’t overthink it — the processor handles resizing and normalization.

encoding = processor(image, return_tensors="pt")
pixel_values = encoding["pixel_values"].squeeze(0)  # shape: (3, 960, ?)

Now you’re ready to pair things up.

Final Dict You Feed to Donut

sample = {
    "pixel_values": pixel_values,
    "labels": labels
}

That’s your gold standard. I wrap this into a custom dataset class so I can return it in __getitem__ during training.

You might be wondering: do I need to tokenize targets every time?

For inference — no. But for training — yes. I cache them when I preprocess datasets to save GPU cycles. You’ll thank yourself later.


5. Training Pipeline: HuggingFace Trainer or Custom Loop?

“The right training loop can save your GPU. The wrong one can waste a weekend.”

Let me start with a confession: I wanted to use the HuggingFace Trainer for Donut. Who wouldn’t? It’s plug-and-play, logs everything to TensorBoard, handles mixed precision, and lets you keep your sanity.

But here’s the deal — Donut isn’t your average text classification or token classification model. It’s a VisionEncoderDecoderModel, which means things like padding, attention masks, and loss masking behave a bit differently. I learned that the hard way.

So let me walk you through how I made this work — both with and without Trainer.

Dataset Class (That Actually Works)

I’ve seen people try to stuff everything into a map() transform, but for Donut, you need full control. I had to write a proper PyTorch dataset class to handle image loading, processing, and label tokenization.

from torch.utils.data import Dataset
from PIL import Image
import json

class DonutDataset(Dataset):
    def __init__(self, image_paths, label_paths, processor):
        self.image_paths = image_paths
        self.label_paths = label_paths
        self.processor = processor

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image = Image.open(self.image_paths[idx]).convert("RGB")

        with open(self.label_paths[idx], "r") as f:
            label_str = json.dumps(json.load(f)) + self.processor.tokenizer.eos_token

        pixel_values = self.processor(image, return_tensors="pt").pixel_values.squeeze(0)
        labels = self.processor.tokenizer(label_str, return_tensors="pt", padding="max_length",
                                          truncation=True, max_length=512).input_ids.squeeze(0)
        
        return {
            "pixel_values": pixel_values,
            "labels": labels
        }

This worked well — especially because I could debug things more easily at the dataset level. If anything goes wrong here, your model will silently suffer.

Custom Collate Function (You Will Need One)

Trainer doesn’t magically know how to pad your pixel values or labels. Here’s the one I use:

def donut_collate_fn(batch):
    pixel_values = torch.stack([item["pixel_values"] for item in batch])
    labels = torch.stack([item["labels"] for item in batch])
    return {"pixel_values": pixel_values, "labels": labels}

I’ve kept it dead simple, but you can extend it if you’re dealing with variable-length labels and want to apply attention masks. Just don’t forget that Donut masks out the loss using the special tokens internally — you don’t need to overengineer it.

HuggingFace Trainer Setup

Once I had the dataset and collator ready, Trainer worked surprisingly well. Here’s how I wired it up:

from transformers import VisionEncoderDecoderModel, TrainingArguments, Trainer

model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base")

training_args = TrainingArguments(
    output_dir="./donut-finetuned",
    per_device_train_batch_size=2,
    num_train_epochs=5,
    logging_dir="./logs",
    save_strategy="epoch",
    fp16=True,  # If you're using A100s or 3090s
    gradient_checkpointing=True,  # This saved me from constant OOMs
    logging_steps=50,
    evaluation_strategy="epoch",
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=processor.tokenizer,
    data_collator=donut_collate_fn
)

Personally, I prefer the Trainer when I’m experimenting fast — hyperparameter tuning, logging, evaluation hooks. But for production-grade control (or if you’re trying LoRA or PEFT later), a custom loop might make more sense.

You might be wondering: should you go custom?

Here’s my take — if you’re just fine-tuning on a small domain-specific set (say, 500–2000 documents), the Trainer is more than enough. But if you’re trying to inject adapter layers, freeze encoder blocks, or do any surgical intervention… go manual.

“Trainer gets you 80% there. Your edge cases live in the other 20%.”


6. Evaluation: Don’t Just Use Loss

“Loss is just a number — your users don’t care if it’s 1.2 or 0.3. They care if the total amount is correct.”

When I started fine-tuning Donut, I quickly realized something: the built-in Trainer evaluation loop is basically useless for real-world document tasks. All it gives you is loss. But for structured document extraction — especially invoices or receipts — a low loss doesn’t mean your dates and totals are accurate.

So I had to build a better way to evaluate. Let me show you what worked for me.

BLEU Score for Structured Output

It’s not perfect, but for quick sanity checks, BLEU is actually decent. Especially when the labels are deterministic and the JSON is consistent across samples. Here’s how I wired it up:

from nltk.translate.bleu_score import sentence_bleu
from transformers import DonutProcessor

def compute_bleu(preds, labels):
    scores = []
    for pred, ref in zip(preds, labels):
        pred_tokens = pred.split()
        ref_tokens = [ref.split()]
        scores.append(sentence_bleu(ref_tokens, pred_tokens))
    return sum(scores) / len(scores)

This helped catch major syntax-level errors early on. But once I got past that…

Field-Level Accuracy: Way More Useful

Here’s the deal — if you’re extracting structured data like this:

{
  "vendor": "Staples",
  "total_amount": "89.10",
  "invoice_date": "2023-03-05"
}

…then BLEU won’t tell you if the total_amount is wrong by one digit.

So I wrote a field-level evaluator that parses both prediction and ground truth JSONs and checks exact string matches per field:

import json

def evaluate_structured_json(preds, labels, fields=["vendor", "total_amount", "invoice_date"]):
    scores = {field: 0 for field in fields}
    total = len(preds)

    for pred, gt in zip(preds, labels):
        try:
            pred_json = json.loads(pred)
            gt_json = json.loads(gt)
            for field in fields:
                if pred_json.get(field) == gt_json.get(field):
                    scores[field] += 1
        except Exception:
            continue  # Skip broken samples

    return {field: round(scores[field] / total, 3) for field in fields}

This gave me a clear, actionable metric. I could literally see, “Okay, model’s 91% accurate on total_amount, but only 76% on invoice_date.”

Decoding Outputs Back to JSON

This part is trickier than it looks. You’ll often get malformed JSON or partial outputs — especially if you didn’t append the EOS token during label generation. Here’s how I decoded outputs during evaluation:

outputs = model.generate(pixel_values)
preds = processor.batch_decode(outputs, skip_special_tokens=True)

# Optional: force-close brackets if malformed
def try_fix_json(text):
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        try:
            return json.loads(text + "}")  # naive fix for missing brackets
        except:
            return {}

Also: if you’re wondering whether Donut hallucinated a field, the answer is probably yes. I had to manually write validators to flag outputs where total_amount was completely missing or had garbage values like "totlal_amunt": "...".

Let’s move into inference.


7. Post-Processing & Inference Pipeline

“Training gets you a model. Inference gets you value.”

Once fine-tuning was done, I had to design an inference pipeline that could handle real-world documents — scanned images, mobile captures, PDFs, even receipts with coffee stains. Here’s how I made it work end-to-end.

Image → Token Output → JSON

It’s tempting to just loop over a folder of images and call .generate() on each one. But once you scale past a few dozen samples or want batch inference on GPU, you’ll need more structure. Here’s what I settled on:

from PIL import Image
import torch

def run_inference(image_paths, processor, model, batch_size=4):
    model.eval()
    outputs = []

    with torch.no_grad():
        for i in range(0, len(image_paths), batch_size):
            batch_images = [
                processor(Image.open(p).convert("RGB"), return_tensors="pt").pixel_values
                for p in image_paths[i:i+batch_size]
            ]
            pixel_values = torch.cat(batch_images).to(model.device)
            gen_outputs = model.generate(pixel_values)
            decoded = processor.batch_decode(gen_outputs, skip_special_tokens=True)
            outputs.extend(decoded)
    return outputs

Edge Case Handling

You’ll inevitably run into:

  • Missing fields: e.g., model skips "vendor" completely
  • Truncated output: happens if your max_length is too low
  • Invalid JSON: especially with dates, float numbers, or embedded newlines

My advice: create a postprocessing script that handles fallbacks. If invoice_date is missing, log it and skip; don’t let the whole pipeline crash.

Batch Inference & Latency Tips

A few things that helped me squeeze out better throughput in production:

  • Mixed precision (torch.float16): reduced GPU memory load
  • ONNX export: worked for some encoder parts, though decoding was trickier
  • Pre-tokenize inputs if your model expects static input shapes

Also — if you’re serving Donut in real time (say, on a SaaS API), you will want to limit input image resolution. I found 960px width to be the sweet spot between speed and OCR-free performance.


9. Exporting, Saving, and Reusing the Fine-Tuned Model

This might sound basic, but saving Donut the right way actually tripped me up early on. The usual save_pretrained works, yes, but there are caveats—especially if you’ve tweaked the tokenizer or customized the processor mid-finetune.

Here’s the export boilerplate I use at the end of any training script:

# Save the fine-tuned model and processor
model.save_pretrained("my-donut-model")
processor.save_pretrained("my-donut-processor")

Sounds simple, right? But here’s a tip that’ll save you some serious head-scratching later:

If you’ve added special tokens or modified the tokenizer config, make sure you explicitly save that too.

I’ve been burned by this. The tokenizer looked fine at save time, but failed to decode correctly on reload. You can check if your tokenizer’s special tokens got updated like this:

print(processor.tokenizer.special_tokens_map)

If you’ve added something like a [PAD] token for batching or decoding stability, do this before saving:

processor.tokenizer.save_pretrained("my-donut-processor")

When loading it back:

from transformers import VisionEncoderDecoderModel, DonutProcessor

model = VisionEncoderDecoderModel.from_pretrained("my-donut-model")
processor = DonutProcessor.from_pretrained("my-donut-processor")

And just like that, you’ve got a reloadable Donut pipeline that can be deployed, fine-tuned further, or served via an API.


10. Final Thoughts: When Not to Fine-Tune Donut

“When all you’ve got is a hammer, everything looks like a nail. But Donut is a pretty fancy hammer — don’t use it to swat flies.”

Let me be blunt: Donut isn’t always the right tool. I’ve seen teams waste compute and weeks of engineering time fine-tuning it for problems that could’ve been solved faster, cheaper, and more robustly.

Here’s where I don’t use Donut — even though I could:

When Prompt-Tuning or Synthetic Augmentation Makes More Sense

If your structured outputs are predictable and the input layout is consistent, consider this route: generate synthetic documents using templated text + randomized fields, and train a small T5 or FLAN model with few-shot prompts. It’s 10x faster and usually good enough for narrow tasks.

Personally, I’ve used this approach for expense reports and delivery slips, where the format was basically locked-in. No need to bring in heavy OCR-free models.

When Layout-Aware Models Outperform Donut

Here’s the deal: Donut is layout-blind. It sees pixels, not bounding boxes.

So for document types where layout matters more than text style — like forms, W-2s, or bank statements — I’ve had better results with LayoutLMv3 or StrucTexT. These models combine visual features and layout tokens, giving you richer spatial context.

In one project involving tax forms, Donut just couldn’t consistently pick up field associations — like matching “Total Deductions” with the number next to it. Switching to LayoutLMv3 improved field-level F1 by over 20%.

My Candid Thoughts from the Trenches

Donut’s architecture is elegant. OCR-free is the future. But it’s not plug-and-play, and it’s not magic. In my experience, it shines in these scenarios:

  • Multilingual documents
  • Visually noisy inputs (e.g., receipts, scanned papers)
  • Semi-structured templates with subtle variations

But if your documents are already text-layer PDFs or cleanly OCRable, you’re probably better off with something lighter and more layout-aware.

So before you fine-tune: ask yourself if the task actually needs vision-based modeling. If it doesn’t, Donut might be overkill.

Leave a Comment