Fine-Tuning BERT for Named Entity Recognition (NER)

1. Why Fine-Tune BERT for NER Instead of Using Off-the-Shelf Models

“A model trained on everything usually understands nothing deeply.” That’s something I learned the hard way the first time I tried plugging a generic pre-trained BERT into a legal domain use case.

Off-the-shelf NER models like bert-base-cased or even spaCy’s en_core_web_trf are decent for generic tasks — people, organizations, locations — the usual suspects. But the moment you throw in financial lingo, medical entities, or custom taxonomies, they start missing things that should’ve been obvious.

I’ve personally seen this happen in a biotech project — standard models would consistently miss gene names and protein symbols. Not because they were inherently bad, but because they were never trained to recognize those patterns in the first place.

So, when do I go for fine-tuning?

  • Custom entities: If you’re working with non-standard labels — things like PART_NUMBER, COURT_NAME, or CHEMICAL — you need your model to speak your domain.
  • Annotation styles: Pretrained models expect data labeled in a specific format. If your team uses a different BIO scheme, or nested tags, you’ll run into compatibility issues.
  • Label distribution: I’ve had projects where 80% of entities were one class. That completely threw off pre-trained decoders. Fine-tuning helped balance things.

Sure, fine-tuning takes time, GPU cycles, and a bit of trial-and-error. But in my experience, it always pays off when you’re working in domains where missing an entity isn’t just a bug — it’s a business problem.


2. Dataset Format: What You Need and How to Prepare It

Before you even touch the model, your dataset has to be in shape — and I’m not talking about clean CSVs or tidy JSONs. I mean token-level labeling that BERT can digest without choking.

Here’s the format I’ve consistently used:

Apple      B-ORG  
released   O  
the        O  
new        O  
iPhone     B-PRODUCT  

This is a standard CoNLL-style format. If you’re not working with .conll files, that’s fine. I’ve used everything from annotated CSVs to custom JSON exports from Prodigy or Label Studio. The trick is: convert everything to a token + tag format, then align it with BERT’s tokenizer (we’ll handle alignment later).

You might be wondering: Why not just use spaCy or Flair’s formats? Honestly, I’ve tried them — but when you want full control, especially with HuggingFace’s datasets library, it’s better to build it yourself.

Here’s what I typically do:

from datasets import Dataset, DatasetDict

def read_conll_data(file_path):
    tokens, ner_tags = [], []
    temp_tokens, temp_tags = [], []

    with open(file_path, "r") as f:
        for line in f:
            line = line.strip()
            if not line:
                if temp_tokens:
                    tokens.append(temp_tokens)
                    ner_tags.append(temp_tags)
                    temp_tokens, temp_tags = [], []
                continue
            token, tag = line.split()
            temp_tokens.append(token)
            temp_tags.append(tag)

    return Dataset.from_dict({"tokens": tokens, "ner_tags": ner_tags})

dataset = read_conll_data("data/train.conll")

And if you’re using HuggingFace’s built-in datasets (like conll2003 or wnut_17), loading is even easier:

from datasets import load_dataset

dataset = load_dataset("conll2003")

A few battle-tested tips from my experience:

  • Label consistency is key — make sure no stray tag like I-PRODUCT appears without a B-PRODUCT before it.
  • Watch for nested entities — if your annotation tool supports nesting, flatten them before training unless you’re using models that can handle span-based labeling.
  • Subword mess — I’ve seen poorly preprocessed datasets ruin the model’s performance because the labels didn’t align with how BERT splits tokens. We’ll fix that in the next section.

3. Tokenization and Label Alignment

“If you’ve ever trained a token classification model and ended up with garbage F1 scores, this is probably where it went wrong.”
I’ve made that mistake early on — trusted the tokenizer to “just work” and didn’t bother aligning the labels properly. Big mistake.

Here’s the deal: BERT doesn’t tokenize text the way your annotated dataset does. It breaks words into subwords using WordPiece — and unless you align your labels to those subwords exactly, your model will end up learning nonsense.

Let me show you what I do in production setups. This handles alignment, subwords, and pads correctly.

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
        padding="max_length",  # I usually set max_length manually elsewhere
        return_offsets_mapping=False,
    )

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_labels = []
        prev_word_idx = None

        for word_idx in word_ids:
            if word_idx is None:
                aligned_labels.append(-100)
            elif word_idx != prev_word_idx:
                aligned_labels.append(label[word_idx])
            else:
                # You can choose to replicate the label or use a special tag for subwords
                aligned_labels.append(label[word_idx])
            prev_word_idx = word_idx
        labels.append(aligned_labels)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

I’ve seen people ignore this and get strange behaviors like models predicting I- tags at the beginning of spans or entirely skipping entities. Misaligned labels cause silent bugs — no crash, just bad performance.

Quick tip from experience: Always print a few token-label pairs after preprocessing. It saves you from hours of debugging later.

Another edge case that hits harder than you’d think: truncation. BERT has a token limit (usually 512), and if you’re working with long sentences or documents, you’ll lose trailing tokens — and their labels — silently.

I personally use a sliding window strategy with overlaps when handling longer sequences. You don’t always need it, but when you do, it’s non-negotiable.


4. Model Setup: Choosing the Right Variant

“Not all BERTs are created equal — pick the wrong one, and you’ll waste a week training something that’ll never perform well.”
I’ve personally used at least a dozen variants depending on the domain, the size of the dataset, and the type of entities I’m tagging.

Let’s break this down:

Which BERT variant to choose?

  • bert-base-cased: My go-to default. Good for English datasets where casing helps distinguish entities (like Apple the company vs apple the fruit).
  • bert-large-cased: I reach for this when the dataset is big enough and compute isn’t a bottleneck. It has a stronger attention backbone but is noticeably slower.
  • Domain-specific variants: I’ve had the best results using:
    • BioBERT for biomedical papers (protein names, gene symbols, etc.)
    • FinBERT for financial texts (tickers, currency codes, etc.)
    • LegalBERT for legal corpora (acts, case numbers, court references)

If you’re working in a specialized field, use the domain-specific model — general-purpose BERT just won’t cut it, especially for low-resource classes.

Loading the model for token classification

You’ll need to specify the number of labels based on your dataset. Here’s the clean way to set it up:

from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-cased",
    num_labels=len(label_list),
)

I usually pass the label2id and id2label dicts explicitly too, especially if I plan to export the model later for inference:

model.config.label2id = label2id
model.config.id2label = id2label

Freezing layers — does it help?

In my experience, yes, but only for smaller datasets or when I want to avoid catastrophic forgetting.

I sometimes freeze the embedding and first few encoder layers to speed up training and focus updates on the final few layers:

for name, param in model.named_parameters():
    if name.startswith("bert.embeddings") or name.startswith("bert.encoder.layer.0"):
        param.requires_grad = False

One thing I’ve noticed: freezing too much can make the model plateau early. So I usually do it for the first 1–2 epochs, then unfreeze everything.


5. Training Loop with HuggingFace Trainer API

“There’s nothing more frustrating than realizing your model underperformed because your training loop was the bottleneck — not the data, not the architecture.”
I’ve been there. I’ve tried writing custom PyTorch loops, fiddled with weird LR schedules, and even forgot to enable gradient clipping once (don’t ask). Eventually, I settled on the HuggingFace Trainer — not because it’s trendy, but because it actually works, if you set it up right.

Here’s a base setup I often start with for token classification tasks like NER:

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./ner-model",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    learning_rate=3e-5,
    warmup_steps=500,
    logging_dir="./logs",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="f1",  # We'll define this metric next
    greater_is_better=True,
)

I usually experiment with warmup_steps and learning_rate more than anything else — they tend to have the biggest impact on convergence speed.

Now here’s the full Trainer instantiation — again, nothing magical, but it’s bulletproof:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

A few things I always include in real runs:

  • Gradient clipping: Prevents exploding gradients on noisy datasets. By default, HuggingFace clips to 1.0, which is usually enough. I rarely change it unless I’m fine-tuning massive models.
  • Learning rate scheduler: The built-in linear scheduler with warmup is surprisingly effective. With NER, warmups help stabilize training early when label sparsity is a concern.
  • Weight decay vs. layer freezing: I personally prefer light weight decay (0.01) and unfrozen layers. But if I’m dealing with very few samples, freezing the embeddings or the first few encoder layers can stabilize training. It really depends on dataset size and noise.

And yes — load_best_model_at_end=True has saved me from checkpoint regret so many times.


6. Custom Metrics: Precision/Recall/F1 at the Entity Level

“Accuracy is a lie in NER.”
I say this every time I teach or review a project. Here’s why: let’s say you’re tagging a medical text. If your model predicts B-Disease instead of B-Symptom, and the rest of the sentence is right, it’ll still show high accuracy. But your output is semantically useless.

What actually matters is span-level F1 — whether the model correctly identifies the full entity boundaries and their types.

I use seqeval for this. It’s consistent, mature, and gives you proper precision, recall, and F1 across all entity types. Here’s a custom metric function I plug into the Trainer:

from seqeval.metrics import classification_report, precision_score, recall_score, f1_score

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    return {
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
    }

One tip I’ve learned the hard way: always double-check your label_list alignment with the dataset. If your label-to-id mapping is off by one, your metric scores will be garbage and you won’t know why.

Optional but powerful:

If you’re working outside of BIO (e.g., using BILOU, or your dataset has overlapping spans), you’ll need to roll your own evaluator. I’ve done this in projects where entities nested within each other (legal contracts and medical literature mostly).

If you go that route, seqeval might not cut it, and you’ll probably end up writing span-matching logic with set operations over token indices.

But for 90% of standard use cases? The function above will give you solid, reliable metrics.


7. Evaluation: What Really Matters

“You can’t fix what you don’t understand — and you definitely can’t improve what you don’t measure right.”
This part is non-negotiable for me. I’ve seen plenty of teams celebrate high token-level accuracy without realizing their span-level performance was… well, unusable. Especially in NER, metrics can lie if you’re not looking at the right ones.

Let me walk you through how I usually approach evaluation after training:

Span-Level vs Token-Level Metrics

  • Token-level metrics are fast and easy to compute, but they’re misleading. Predicting just one token of a multi-token entity doesn’t count in real-world use.
  • Span-level metrics (via seqeval, as we did earlier) actually reflect whether your model caught the full entity with the right label.

In my projects, I always use span-level F1 as the gold standard — and only glance at token accuracy if something is really off and I’m hunting for a clue.

Qualitative Debugging (The Real Goldmine)

You might be wondering: what do I do when the F1 drops and I don’t know why?
Here’s a trick I use — printing side-by-side predictions and gold labels:

for tokens, true, pred in zip(tokenized_inputs["tokens"], true_labels, predictions):
    for t, tr, pr in zip(tokens, true, pred):
        print(f"{t:15} | true: {tr:10} | pred: {pr}")
    print("-" * 50)

This has saved me more times than I can count. Misaligned subwords, off-by-one shifts, or even weird tokenization artifacts — they all show up here, plain and clear.

Domain-Specific Failure Modes

Let me give you an example:
In a finance domain project, my model kept misclassifying “Apple Inc.” as a product, not an organization. Why? Because “Apple” was overrepresented as a consumer brand in the pretraining data. I had to oversample certain cases during fine-tuning to balance this out.

So yes, bias from pretraining corpora is real, and you’ll feel it hard if your dataset is small or underrepresents the edge cases.


8. Pushing to Production

“Models don’t live in notebooks — they run in services.”
I learned this the hard way when I built my first NER pipeline that crashed on GPU because I forgot to bundle the tokenizer. So now, I make deployment part of the pipeline from day one.

Exporting the Model (for Speed and Portability)

When latency matters, I move from PyTorch to ONNX. It’s not just a checkbox — I’ve seen inference times drop by 30-40% when done right.

optimum-cli export onnx --model ./ner-model ./onnx-model

This converts both the model and config into ONNX format. If you’re doing batch inference or microservice deployment, this is a solid upgrade.

Bundling the Tokenizer + Model

Whatever you do, don’t forget to include the tokenizer. I always export the tokenizer using:

tokenizer.save_pretrained("./onnx-model")

And I bundle both in my container or service zip. Without the tokenizer, your deployment pipeline is basically flying blind.

Lightweight Inference Script with Confidence Scores

Here’s a stripped-down inference loop I’ve used in production APIs:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained("onnx-model")
model = AutoModelForTokenClassification.from_pretrained("onnx-model")

def predict_ner(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs)
    probs = F.softmax(outputs.logits, dim=-1)
    confs, preds = torch.max(probs, dim=-1)
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    return list(zip(tokens, preds[0].tolist(), confs[0].tolist()))

I usually attach a confidence threshold to filter weak predictions. In one case, I set min_confidence=0.85 to cut false positives by half without hurting recall.


9. Lessons Learned from Real-World Fine-Tuning

“Theory ends where the real-world mess begins.”

This section is personal — because I’ve burned days on bugs that weren’t bugs at all, just subtle issues no textbook warned me about. Let me walk you through the traps I’ve seen (and sometimes fallen into).

Overfitting on Small Datasets: It’s Sneaky

Overfitting in NER isn’t always obvious. I remember a case where my validation F1 looked rock-solid — but once I ran inference on unseen docs, half the entities were either hallucinated or missed entirely.

What helped me catch this:

  • Monitoring confidence scores: consistent 0.99 across predictions? Probably memorized.
  • Evaluating on multiple validation sets, preferably pulled from different sources or time periods.
  • Keeping an eye on span diversity — if your dev set has 100 “New York” tags and nothing else, you’re not testing generalization.

Mitigation tips:

Freeze lower layers if you suspect the model is “too powerful” for your dataset.

Early stopping + weight decay help, but mixing in pseudo-labeled data or weak supervision can do wonders when labeled data is scarce.

Preprocessing Pitfalls That Kill Performance Silently

This one’s a classic: I once had two identical-looking datasets give wildly different results. Turns out the difference was newline characters in the input. Another time, it was whitespace in token lists.

So now, I always:

  • Normalize whitespace during both training and inference.
  • Consistently lowercase or cased-match — especially with bert-base-cased.
  • Validate entity spans with visual inspection before training. If the labels don’t align with tokens visually, you’re wasting GPU hours.

Long Document Inference (Don’t Skip This)

Real-world data isn’t always tweet-length. I’ve worked on legal and medical datasets where a single document is 2,000+ tokens.

Here’s what I do:

  • Sliding window strategy with overlap (usually 50 tokens).
  • Merge overlapping predictions — if spans disagree, I keep the one with higher confidence.

Here’s a simplified version of what I use:

def sliding_window_inference(text, window_size=512, stride=256):
    tokens = tokenizer(text, return_offsets_mapping=True, truncation=False)
    input_ids = tokens['input_ids']
    outputs = []
    
    for i in range(0, len(input_ids), stride):
        window = input_ids[i:i+window_size]
        if len(window) < window_size:
            break
        input = tokenizer.decode(window)
        outputs.append(predict_ner(input))  # Your NER function here

    return merge_predictions(outputs)

This trick alone made inference feasible on legal briefs that were previously throwing max length errors.

Tokenizer Mismatch Between Training and Inference

This might surprise you: your model can crash or hallucinate if you forget to ship the tokenizer config.
I learned this the hard way when my tokenizer at inference time didn’t match the one used during training — slightly different vocab file, different special token handling.

So now, every time I save the model, I also save the tokenizer:

model.save_pretrained("final-ner-model")
tokenizer.save_pretrained("final-ner-model")

Bundle both — always. And test them together on at least 5 inference examples before calling it “done.”


10. Conclusion: Where Fine-Tuning Really Shines

By now, you’ve seen the full path — not just the steps, but the landmines I’ve hit along the way:

Dataset → Label Alignment → Fine-tuning → Evaluation → Deployment.

If you ask me where fine-tuning BERT for NER really shines, it’s in those high-stakes, niche domains where prebuilt models just fall short.

I’ve seen massive jumps in accuracy when switching from general-purpose NER to something fine-tuned for legal, healthcare, or even fintech terminology.

One last thing I’ll leave you with:

“Don’t fine-tune because it’s trendy. Fine-tune because your data has quirks — and only your model should know them.”

Leave a Comment