External Publication
Visit Post

When Your “Labels” Aren’t Really Labels: Dealing with Entity-Based NLP Datasets

Hugging Face Forums [Unofficial] April 26, 2026
Source

for now:


When your “labels” are really entities: how to restructure the dataset

Your current topics column should not be treated as the final supervised label column. In the examples you gave, the values are mostly entities , places , organisations , people , professions , or infrastructure names :

["Doctors", "NHS", "British Medical Association (BMA)"]
["Glasgow"]
["Sutton in Ashfield", "Annesley", "M1 motorway"]
["Elon Musk", "Tesla"]

Those values answer:

Which named things are mentioned?

But your target task asks:

What broad news domain is this article about?

Those are different annotation layers. The right move is not to train directly on the current topics field. The right move is to treat it as weak evidence for a new, cleaner field such as domain_labels.


1. Reframe the task

A better framing is:

Multi-label news-domain classification using entity-derived weak supervision.

That means:

Layer Example Role
Raw entity/topic hints NHS, Glasgow, Elon Musk, Tesla Metadata / weak evidence
Entity type organisation, person, location, road Helps interpret the hint
Candidate domain Health, Business, Technology, Transport Possible labels
Final article label Health + Labour + Politics Supervised target

So the pipeline should be:

article text
+ current entity-style topics
+ entity normalization
+ entity typing/linking
+ text/context rules
+ zero-shot/embedding suggestions
+ selective human review
→ domain_labels
→ multi-label classifier

not:

topics column
→ train classifier directly

2. Start with a taxonomy, not a model

Before mapping anything, define the target domains. Otherwise, ambiguous categories like Business, Technology, Transport, Science, and Geography will overlap inconsistently.

For news, IPTC Media Topics is a strong reference taxonomy: it is a constantly updated news-focused taxonomy of 1,200+ terms, originally based on IPTC Subject Codes, first released in 2010, and updated at least annually. (IPTC)

You do not need the full IPTC hierarchy at first. Start with a compact top-level taxonomy:

Health
Business & Economy
Technology
Politics & Government
Crime & Justice
Transport & Infrastructure
Education
Science
Environment
Sport
Arts & Entertainment
Society
Conflict & Security
Law
Labour
Religion
Other / Unclear

Important design choice: avoid automatic “Geography”

A place mention is usually a where , not a what.

Glasgow hospital waiting times rise → Health
Glasgow council budget approved → Politics & Government
Glasgow Warriors win final → Sport
Glasgow rail disruption continues → Transport & Infrastructure

So Glasgow should usually become:

{
  "locations": ["Glasgow"],
  "domain_labels": []
}

not:

{
  "domain_labels": ["Geography"]
}

Use a Geography / Places label only if the article is genuinely about geography, land use, maps, demography, regional identity, tourism geography, or places as the subject.


3. Map entities to candidate domains, not final domains

The main mistake to avoid is this:

NHS → Health
Tesla → Technology
Elon Musk → Business
Glasgow → Geography

That looks clean, but it is too rigid. A better mapping is:

entity → candidate domains
article context → final domains

Example: NHS

NHS is a strong Health signal, but not only Health.

NHS waiting lists rise → Health
NHS strike talks collapse → Health + Labour + Politics
NHS funding dispute → Health + Politics + Business/Economy
NHS cyberattack → Health + Technology + Law

So encode:

{
  "National Health Service": [
    "Health",
    "Politics & Government",
    "Labour"
  ]
}

Example: Elon Musk / Tesla

Elon Musk and Tesla are especially ambiguous:

{
  "Tesla": [
    "Business & Economy",
    "Technology",
    "Transport & Infrastructure",
    "Environment",
    "Law"
  ],
  "Elon Musk": [
    "Business & Economy",
    "Technology",
    "Media",
    "Politics & Government",
    "Law",
    "Science"
  ]
}

Then article context decides:

Article context Better labels
Tesla shares fall after weak earnings Business & Economy
Tesla announces self-driving update Technology + Transport
Tesla crash investigated by regulator Transport + Law + Technology
Musk comments on election policy Politics + Media/Technology
Tesla factory expansion approved Business + Politics + Environment

4. Recommended dataset schema

Keep the raw entity field, but separate it from the final labels.

Rich development schema

Use this while building and auditing labels:

{
  "article_id": "article_001",
  "title": "Junior doctors announce strike after NHS talks collapse",
  "body": "...",

  "entities_raw": [
    "Doctors",
    "NHS",
    "British Medical Association (BMA)"
  ],

  "entities_normalized": [
    "doctor",
    "National Health Service",
    "British Medical Association"
  ],

  "entity_types": {
    "doctor": ["profession"],
    "National Health Service": ["healthcare organisation", "public body"],
    "British Medical Association": ["professional association", "medical organisation"]
  },

  "people": [],
  "organisations": [
    "National Health Service",
    "British Medical Association"
  ],
  "locations": [],

  "candidate_domain_scores": {
    "Health": 0.96,
    "Politics & Government": 0.74,
    "Labour": 0.71
  },

  "domain_labels": [
    "Health",
    "Politics & Government",
    "Labour"
  ],

  "label_source": "rules_plus_review",
  "review_status": "reviewed",
  "taxonomy_version": "v0.1",
  "labeling_rules_version": "v0.3"
}

Training schema

For Hugging Face model training, simplify to:

{
  "text": "Junior doctors announce strike after NHS talks collapse. ...",
  "domain_labels": ["Health", "Politics & Government", "Labour"],
  "labels": [1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0]
}

The readable domain_labels list is for humans. The fixed-length labels vector is for the model.

For Hugging Face datasets, prefer scriptless CSV/JSON/Parquet storage. Hugging Face Datasets can load local or remote CSV, JSON, TXT, and Parquet files with load_dataset(), and the docs note that a custom dataset loading script is usually unnecessary for formats such as CSV, JSON, JSON Lines, text, images, audio, or Parquet. (Hugging Face)

A good repository layout:

README.md
data/train.parquet
data/validation.parquet
data/test.parquet

5. How to generate domain labels at scale

Use a hybrid weak-supervision pipeline. Do not choose only manual, only rules, or only embeddings.

Method Best role Weakness
Manual mapping Taxonomy design; top frequent entities; validation set Does not scale alone
Rules Cheap, explainable weak labels Brittle for ambiguity
Entity linking Resolving ambiguous names Can be noisy or expensive
Embeddings Clustering, similarity, suggestions Not reliable as ground truth
Zero-shot classification Cold-start label suggestions Sensitive to label wording
Human review Resolving uncertain cases Expensive
Fine-tuned classifier Final scalable prediction Needs good labels first

Weak supervision is a good fit because your current entity labels are noisy evidence. Snorkel formalizes this pattern: users write labeling functions that express heuristics, patterns, distant supervision, or other weak supervision sources, then combine and denoise their outputs into training labels. (arXiv)


6. Practical weak-labeling pipeline

Step 1: Rename the field

Do this in your code and documentation:

topics → entities_raw

Avoid calling it labels. That prevents accidental misuse.


Step 2: Normalize entity strings

Create an alias table:

raw,canonical
BMA,British Medical Association
British Medical Association (BMA),British Medical Association
the BMA,British Medical Association
NHS,National Health Service
M1,M1 motorway

This turns:

["Doctors", "NHS", "British Medical Association (BMA)"]

into:

["doctor", "National Health Service", "British Medical Association"]

Step 3: Separate entity types

Split entities into useful metadata fields:

people
organisations
locations
infrastructure
professions
products
events

Example:

{
  "entities_raw": ["Sutton in Ashfield", "Annesley", "M1 motorway"],
  "locations": ["Sutton in Ashfield", "Annesley"],
  "infrastructure": ["M1 motorway"],
  "candidate_domains": ["Transport & Infrastructure"]
}

This matters because locations, people, organisations, and roads are different kinds of evidence.


Step 4: Create entity-to-candidate-domain mappings

Example:

{
  "National Health Service": ["Health", "Politics & Government", "Labour"],
  "British Medical Association": ["Health", "Labour"],
  "doctor": ["Health", "Labour"],
  "Tesla": ["Business & Economy", "Technology", "Transport & Infrastructure", "Environment", "Law"],
  "Elon Musk": ["Business & Economy", "Technology", "Media", "Politics & Government", "Law"],
  "M1 motorway": ["Transport & Infrastructure"],
  "Glasgow": []
}

Notice that Glasgow maps to an empty candidate list. It is metadata unless the text proves a domain.


Step 5: Add article-context rules

Entity mappings alone are not enough. Add rules based on headline and body text.

Health:
hospital, doctor, patient, medical, nurse, NHS, surgery, waiting list, disease, vaccine

Business:
shares, stock, earnings, revenue, profit, investors, market, acquisition, merger

Technology:
AI, software, cybersecurity, chip, app, platform, data centre, algorithm

Transport:
motorway, road, traffic, rail, train, airport, bridge, lane closure

Politics:
minister, parliament, government, council, election, policy, regulation

Crime/Justice:
police, arrested, charged, court, investigation, sentence, stabbing, fraud

Labour:
strike, union, pay dispute, workers, staff, contract, collective bargaining

Think of these rules as votes , not final truth.


Step 6: Add zero-shot or embedding suggestions

Zero-shot classification is useful before you have gold labels. SetFit’s zero-shot guide describes using class names with pretrained Sentence Transformer models to get a baseline without training samples. (Hugging Face) Hugging Face also has a cookbook showing how SetFit suggestions can support annotation workflows, including multi-label questions. (Hugging Face)

Use it like this:

Article:
Junior doctors announce strike after NHS pay talks collapse.

Candidate labels:
Health
Business & Economy
Technology
Politics & Government
Labour
Transport
Crime & Justice

Expected high scores:

{
  "Health": 0.94,
  "Labour": 0.82,
  "Politics & Government": 0.70
}

But do not treat zero-shot output as ground truth. It is sensitive to label wording, candidate label set, thresholds, and domain mismatch.


Step 7: Use entity linking selectively

Entity linking is useful when strings are ambiguous. spaCy’s EntityLinker disambiguates named-entity mentions to unique identifiers in a knowledge base, grounding text mentions into “real-world” entities using candidate generation and local context. (spaCy)

Good candidates for entity linking:

Tesla → Tesla, Inc. vs Nikola Tesla
Apple → company vs fruit
Jordan → country vs person
Glasgow → city vs club/university context
BMA → British Medical Association vs another abbreviation

Wikidata can help enrich linked entities because its SPARQL service lets you query structured data and export results as XML, JSON, CSV, TSV, and other formats. (Wikipedia Data)

Use entity linking to improve candidate domains, not to bypass article-level context.


7. Apply the pipeline to your examples

Row 1

["Doctors", "NHS", "British Medical Association (BMA)"]

Candidate labels:

Health
Politics & Government
Labour

Final labels depend on context:

NHS waiting lists rise → Health
Junior doctors strike → Health + Labour + Politics
NHS cyberattack → Health + Technology + Law

Row 2

["Glasgow"]

Better representation:

{
  "locations": ["Glasgow"],
  "candidate_domains": [],
  "review_required": true
}

Final label comes from text:

Glasgow hospital story → Health
Glasgow football story → Sport
Glasgow council story → Politics
Glasgow rail story → Transport

Row 3

["Sutton in Ashfield", "Annesley", "M1 motorway"]

Better representation:

{
  "locations": ["Sutton in Ashfield", "Annesley"],
  "infrastructure": ["M1 motorway"],
  "candidate_domains": ["Transport & Infrastructure"]
}

Context decides whether to add more:

Crash on M1 → Transport + Crime/Justice/Public Safety
Roadworks on M1 → Transport
Planning near M1 → Politics + Transport + Business

Row 4

["Elon Musk", "Tesla"]

Better representation:

{
  "people": ["Elon Musk"],
  "organisations": ["Tesla"],
  "candidate_domains": [
    "Business & Economy",
    "Technology",
    "Transport & Infrastructure",
    "Environment",
    "Law",
    "Politics & Government"
  ],
  "requires_context": true
}

Context examples:

Tesla shares fall → Business
Autopilot update → Technology + Transport
Crash investigation → Transport + Law + Technology
Election-policy comments → Politics + Media/Technology

8. Model training after label construction

Once you have domain_labels, the final model is a normal multi-label document classifier.

Use a fixed label order:

label_names = [
    "Health",
    "Business & Economy",
    "Technology",
    "Politics & Government",
    "Crime & Justice",
    "Transport & Infrastructure",
    "Education",
    "Science",
    "Environment",
    "Sport",
    "Arts & Entertainment",
    "Society",
    "Conflict & Security",
    "Law",
    "Religion",
    "Labour",
    "Other / Unclear",
]

Example multi-hot label vector:

# Health + Politics & Government + Labour
labels = [
    1.0,  # Health
    0.0,  # Business & Economy
    0.0,  # Technology
    1.0,  # Politics & Government
    0.0,  # Crime & Justice
    0.0,  # Transport & Infrastructure
    0.0,  # Education
    0.0,  # Science
    0.0,  # Environment
    0.0,  # Sport
    0.0,  # Arts & Entertainment
    0.0,  # Society
    0.0,  # Conflict & Security
    0.0,  # Law
    0.0,  # Religion
    1.0,  # Labour
    0.0,  # Other / Unclear
]

Hugging Face sequence-classification configs support problem_type="multi_label_classification", and Hugging Face forum guidance for Trainer points to setting this problem type for multi-label classification. (Hugging Face)

Example:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/deberta-v3-base",
    num_labels=len(label_names),
    id2label={i: label for i, label in enumerate(label_names)},
    label2id={label: i for i, label in enumerate(label_names)},
    problem_type="multi_label_classification",
)

Start with a text-only classifier, then compare against a version that appends normalized entities to the input:

Title: ...
Entities: NHS; British Medical Association; doctor
Body: ...

Entity injection may help short articles, but it can also cause overfitting to famous names.


9. Splitting and evaluation

Use multi-label-aware splitting. The iterative-stratification project provides scikit-learn-compatible cross validators for stratification in multilabel data, and its code comments describe preserving label percentages across folds. (GitHub)

Evaluate with more than one metric:

micro-F1
macro-F1
per-label precision
per-label recall
per-label F1
hamming loss
precision@k
manual error analysis

scikit-learn’s F1 documentation defines F1 as the harmonic mean of precision and recall, and it supports different averaging modes such as micro, macro, weighted, samples, or per-label scoring. (scikit-learn)

Do not rely only on micro-F1. It can look good while rare labels fail badly. Always inspect per-label scores.

Also tune thresholds per label rather than blindly using 0.5:

Health: 0.42
Business & Economy: 0.55
Technology: 0.62
Politics & Government: 0.48
Transport & Infrastructure: 0.50
Labour: 0.38

These are examples; real thresholds should come from validation data.


10. Human review: where to spend effort

Do not manually review everything. Review the rows where automation is least reliable:

location-only rows
rows with no confident label
rows with many candidate labels
rows involving ambiguous high-frequency entities
rows where rules and zero-shot disagree
rows assigned Other / Unclear
rare-label examples
random audit samples

A practical target:

Minimum: 500 reviewed examples
Better: 1,000–2,000 reviewed examples
Strong: 5,000+ reviewed examples

The reviewed set is essential. Without it, you cannot tell whether the model is learning meaningful domains or merely reproducing noisy rules.


11. Useful reference dataset pattern

A relevant schema reference is DWIE on Hugging Face. It is an entity-centric document-level information extraction dataset with named entities, coreference, relations, entity linking, and document-level IPTC classification codes; the dataset card describes 23,130 entities, 311 multi-label entity tags, and entity links to Wikipedia. (Hugging Face)

The important lesson from that style of dataset is structural:

mentions/entities/entity links ≠ article-level topic labels

That is exactly the separation your dataset needs.


12. Main pitfalls

Pitfall Why it hurts Better approach
Training directly on entities Creates huge sparse label space Use entities as metadata
One entity → one domain Fails on ambiguous entities Entity → candidate domains
Location → Geography Mislabels local news Store locations separately
Too many labels too early Sparse positives, weak metrics Start with coarse domains
No reviewed validation set No trusted evaluation Review uncertain + random samples
One global threshold Over/under-predicts labels Tune thresholds per label
No label provenance Hard to debug changes Store label source and versions

13. Recommended first-month plan

Week 1 — Audit and taxonomy

Produce:

top entities
entity frequency distribution
rows with only locations
rows with only people
rows with only organisations
rows with 5+ topics
rare one-off topic rate
taxonomy_v0.md

Week 2 — Weak labels

Create:

entity_aliases.csv
entity_candidate_domains.csv
keyword_rules.py
domain_labels_weak
candidate_domain_scores
review_required

Week 3 — Review

Review 500–1,000 examples, prioritizing:

ambiguous entities
location-only rows
rule disagreements
Other / Unclear rows
rare labels

Week 4 — Train baseline

Train and compare:

text-only classifier
text + normalized entity list classifier

Evaluate with micro-F1, macro-F1, per-label metrics, threshold tuning, and manual error analysis.


Direct answers

1. Best way to map entities → domains at scale?

Use a hybrid pipeline:

entity normalization
+ entity typing/linking
+ manual mappings for frequent entities
+ rule-based weak labels
+ article-text evidence
+ zero-shot/embedding suggestions
+ human review for uncertain cases
+ final supervised classifier

The key rule is:

entity → candidate domains
article context → final domains

2. Manual, rule-based, or embedding-based?

Use all three.

Manual:
taxonomy, high-frequency entities, validation set

Rules:
explainable weak labels

Embeddings:
similarity, clustering, candidate suggestions

Zero-shot:
cold-start label suggestions

Human review:
ambiguous/high-impact examples

Fine-tuned classifier:
final scalable prediction

Embeddings and zero-shot models are useful assistants, not ground truth.


3. How to handle ambiguous entities?

Use many-to-many mappings, context resolution, confidence scores, and review queues.

{
  "entity": "Tesla",
  "candidate_domains": [
    "Business & Economy",
    "Technology",
    "Transport & Infrastructure",
    "Environment",
    "Law"
  ],
  "requires_context": true
}

The same entity can produce different labels in different articles. That is correct.


4. Is this still classification?

Yes, but only after restructuring.

Right now, the problem is:

label construction from noisy entity-style metadata

After that, it becomes:

multi-label document classification

The final model is ordinary. The hard part is building reliable labels.


Compact final recommendation

Build the dataset like this:

1. Rename current topics to entities_raw.
2. Normalize entity strings.
3. Separate people, organisations, locations, infrastructure, products.
4. Treat locations as metadata unless the article is truly about place/geography.
5. Define a small news-domain taxonomy.
6. Map entities to candidate domains, not fixed final domains.
7. Use article text to resolve ambiguity.
8. Generate weak labels with rules + zero-shot/embedding suggestions.
9. Human-review uncertain and high-impact rows.
10. Store final labels as domain_labels.
11. Convert domain_labels to fixed multi-hot float vectors.
12. Train a Hugging Face multi-label classifier.
13. Evaluate per label and tune thresholds.
14. Version taxonomy, rules, labels, and dataset files.

Short summary

  • Your current topics are mostly entities, not true labels.
  • Keep them as entities_raw; create a new domain_labels field.
  • Use IPTC-style news taxonomy thinking, but start with a smaller custom taxonomy.
  • Do not map every entity to exactly one domain.
  • Treat ambiguous entities as candidate-domain generators.
  • Treat locations as metadata, not automatic Geography.
  • Use weak supervision plus selective human review.
  • Then train a normal Hugging Face multi-label classifier.

Discussion in the ATmosphere

Loading comments...