External Publication
Visit Post

When Your “Labels” Aren’t Really Labels: Dealing with Entity-Based NLP Datasets

Hugging Face Forums [Unofficial] April 26, 2026
Source
I’m working on an NLP news classification task, but my dataset is structured in an unusual way. Each article has multiple “topics” per row, but these topics are actually named entities, not true categories. For example: Row 1 topics: [“Doctors”, “NHS”, “British Medical Association (BMA)”] → clearly belongs to a broader domain like Health Row 2 topic: [“Glasgow”] → a location Row 3 topics: [“Sutton in Ashfield”, “Annesley”, “M1 motorway”] → places/infrastructure Row 4 topics: [“Elon Musk”, “Tesla”] → could belong to Business or Technology So the problem is: My labels are inconsistent and too granular (entities instead of domains) Each row has multi-label outputs, but they don’t directly map to meaningful categories. There is no predefined mapping from entities → domain Some entities are ambiguous (e.g., Elon Musk could be Business or Tech) What I’m trying to do: Convert these entity-level labels into higher-level domains (like Health, Business, Tech, Geography, etc.) Then train a multi-label classifier on those domains My main questions: 1. What is the best way to map entities → domains at scale? 2. Should this mapping be manual, rule-based, or embedding-based? 3. How should I handle ambiguous entities that can belong to multiple domains? 4. Is this still a classification problem, or should I rethink it entirely? Any guidance on restructuring this dataset or designing a proper pipeline would help.domain.

Discussion in the ATmosphere

Loading comments...