Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihlr66kwmikjhiqmvdz6tllydmy22ejvz5qfuqda6pw7mfxrhqwxy",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkfphzwlkuo2"
  },
  "path": "/t/when-your-labels-aren-t-really-labels-dealing-with-entity-based-nlp-datasets/175571#post_1",
  "publishedAt": "2026-04-26T12:55:56.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "I’m working on an NLP news classification task, but my dataset is structured in an unusual way.\n\nEach article has multiple “topics” per row, but these topics are actually named entities, not true categories. For example:\n\nRow 1 topics: [“Doctors”, “NHS”, “British Medical Association (BMA)”] → clearly belongs to a broader domain like Health Row 2 topic: [“Glasgow”] → a location Row 3 topics: [“Sutton in Ashfield”, “Annesley”, “M1 motorway”] → places/infrastructure Row 4 topics: [“Elon Musk”, “Tesla”] → could belong to Business or Technology\n\nSo the problem is:\n\nMy labels are inconsistent and too granular (entities instead of domains) Each row has multi-label outputs, but they don’t directly map to meaningful categories. There is no predefined mapping from entities → domain Some entities are ambiguous (e.g., Elon Musk could be Business or Tech)\n\nWhat I’m trying to do:\n\nConvert these entity-level labels into higher-level domains (like Health, Business, Tech, Geography, etc.) Then train a multi-label classifier on those domains\n\nMy main questions:\n\n  1. What is the best way to map entities → domains at scale?\n\n  2. Should this mapping be manual, rule-based, or embedding-based?\n\n  3. How should I handle ambiguous entities that can belong to multiple domains?\n\n  4. Is this still a classification problem, or should I rethink it entirely?\n\n\n\n\nAny guidance on restructuring this dataset or designing a proper pipeline would help.domain.",
  "title": "When Your “Labels” Aren’t Really Labels: Dealing with Entity-Based NLP Datasets"
}