Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigfzh24nitx2ouq56qp3lejf6k7cd5b6xbzzl3p6cn3wtkrvy6pku",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkgqzq3rbtl2"
  },
  "path": "/t/when-your-labels-aren-t-really-labels-dealing-with-entity-based-nlp-datasets/175571#post_2",
  "publishedAt": "2026-04-26T22:58:25.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "IPTC",
    "Hugging Face",
    "arXiv",
    "spaCy",
    "Wikipedia Data",
    "GitHub",
    "scikit-learn"
  ],
  "textContent": "for now:\n\n* * *\n\n# When your “labels” are really entities: how to restructure the dataset\n\nYour current `topics` column should **not** be treated as the final supervised label column. In the examples you gave, the values are mostly **entities** , **places** , **organisations** , **people** , **professions** , or **infrastructure names** :\n\n\n    [\"Doctors\", \"NHS\", \"British Medical Association (BMA)\"]\n    [\"Glasgow\"]\n    [\"Sutton in Ashfield\", \"Annesley\", \"M1 motorway\"]\n    [\"Elon Musk\", \"Tesla\"]\n\n\nThose values answer:\n\n\n    Which named things are mentioned?\n\n\nBut your target task asks:\n\n\n    What broad news domain is this article about?\n\n\nThose are different annotation layers. The right move is not to train directly on the current `topics` field. The right move is to treat it as **weak evidence** for a new, cleaner field such as `domain_labels`.\n\n* * *\n\n## 1. Reframe the task\n\nA better framing is:\n\n\n    Multi-label news-domain classification using entity-derived weak supervision.\n\n\nThat means:\n\nLayer | Example | Role\n---|---|---\nRaw entity/topic hints | `NHS`, `Glasgow`, `Elon Musk`, `Tesla` | Metadata / weak evidence\nEntity type | organisation, person, location, road | Helps interpret the hint\nCandidate domain | Health, Business, Technology, Transport | Possible labels\nFinal article label | Health + Labour + Politics | Supervised target\n\nSo the pipeline should be:\n\n\n    article text\n    + current entity-style topics\n    + entity normalization\n    + entity typing/linking\n    + text/context rules\n    + zero-shot/embedding suggestions\n    + selective human review\n    → domain_labels\n    → multi-label classifier\n\n\nnot:\n\n\n    topics column\n    → train classifier directly\n\n\n* * *\n\n## 2. Start with a taxonomy, not a model\n\nBefore mapping anything, define the target domains. Otherwise, ambiguous categories like `Business`, `Technology`, `Transport`, `Science`, and `Geography` will overlap inconsistently.\n\nFor news, **IPTC Media Topics** is a strong reference taxonomy: it is a constantly updated news-focused taxonomy of 1,200+ terms, originally based on IPTC Subject Codes, first released in 2010, and updated at least annually. (IPTC)\n\nYou do **not** need the full IPTC hierarchy at first. Start with a compact top-level taxonomy:\n\n\n    Health\n    Business & Economy\n    Technology\n    Politics & Government\n    Crime & Justice\n    Transport & Infrastructure\n    Education\n    Science\n    Environment\n    Sport\n    Arts & Entertainment\n    Society\n    Conflict & Security\n    Law\n    Labour\n    Religion\n    Other / Unclear\n\n\n### Important design choice: avoid automatic “Geography”\n\nA place mention is usually a **where** , not a **what**.\n\n\n    Glasgow hospital waiting times rise → Health\n    Glasgow council budget approved → Politics & Government\n    Glasgow Warriors win final → Sport\n    Glasgow rail disruption continues → Transport & Infrastructure\n\n\nSo `Glasgow` should usually become:\n\n\n    {\n      \"locations\": [\"Glasgow\"],\n      \"domain_labels\": []\n    }\n\n\nnot:\n\n\n    {\n      \"domain_labels\": [\"Geography\"]\n    }\n\n\nUse a `Geography / Places` label only if the article is genuinely about geography, land use, maps, demography, regional identity, tourism geography, or places as the subject.\n\n* * *\n\n## 3. Map entities to candidate domains, not final domains\n\nThe main mistake to avoid is this:\n\n\n    NHS → Health\n    Tesla → Technology\n    Elon Musk → Business\n    Glasgow → Geography\n\n\nThat looks clean, but it is too rigid. A better mapping is:\n\n\n    entity → candidate domains\n    article context → final domains\n\n\n### Example: NHS\n\n`NHS` is a strong Health signal, but not only Health.\n\n\n    NHS waiting lists rise → Health\n    NHS strike talks collapse → Health + Labour + Politics\n    NHS funding dispute → Health + Politics + Business/Economy\n    NHS cyberattack → Health + Technology + Law\n\n\nSo encode:\n\n\n    {\n      \"National Health Service\": [\n        \"Health\",\n        \"Politics & Government\",\n        \"Labour\"\n      ]\n    }\n\n\n### Example: Elon Musk / Tesla\n\n`Elon Musk` and `Tesla` are especially ambiguous:\n\n\n    {\n      \"Tesla\": [\n        \"Business & Economy\",\n        \"Technology\",\n        \"Transport & Infrastructure\",\n        \"Environment\",\n        \"Law\"\n      ],\n      \"Elon Musk\": [\n        \"Business & Economy\",\n        \"Technology\",\n        \"Media\",\n        \"Politics & Government\",\n        \"Law\",\n        \"Science\"\n      ]\n    }\n\n\nThen article context decides:\n\nArticle context | Better labels\n---|---\nTesla shares fall after weak earnings | Business & Economy\nTesla announces self-driving update | Technology + Transport\nTesla crash investigated by regulator | Transport + Law + Technology\nMusk comments on election policy | Politics + Media/Technology\nTesla factory expansion approved | Business + Politics + Environment\n\n* * *\n\n## 4. Recommended dataset schema\n\nKeep the raw entity field, but separate it from the final labels.\n\n### Rich development schema\n\nUse this while building and auditing labels:\n\n\n    {\n      \"article_id\": \"article_001\",\n      \"title\": \"Junior doctors announce strike after NHS talks collapse\",\n      \"body\": \"...\",\n\n      \"entities_raw\": [\n        \"Doctors\",\n        \"NHS\",\n        \"British Medical Association (BMA)\"\n      ],\n\n      \"entities_normalized\": [\n        \"doctor\",\n        \"National Health Service\",\n        \"British Medical Association\"\n      ],\n\n      \"entity_types\": {\n        \"doctor\": [\"profession\"],\n        \"National Health Service\": [\"healthcare organisation\", \"public body\"],\n        \"British Medical Association\": [\"professional association\", \"medical organisation\"]\n      },\n\n      \"people\": [],\n      \"organisations\": [\n        \"National Health Service\",\n        \"British Medical Association\"\n      ],\n      \"locations\": [],\n\n      \"candidate_domain_scores\": {\n        \"Health\": 0.96,\n        \"Politics & Government\": 0.74,\n        \"Labour\": 0.71\n      },\n\n      \"domain_labels\": [\n        \"Health\",\n        \"Politics & Government\",\n        \"Labour\"\n      ],\n\n      \"label_source\": \"rules_plus_review\",\n      \"review_status\": \"reviewed\",\n      \"taxonomy_version\": \"v0.1\",\n      \"labeling_rules_version\": \"v0.3\"\n    }\n\n\n### Training schema\n\nFor Hugging Face model training, simplify to:\n\n\n    {\n      \"text\": \"Junior doctors announce strike after NHS talks collapse. ...\",\n      \"domain_labels\": [\"Health\", \"Politics & Government\", \"Labour\"],\n      \"labels\": [1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0]\n    }\n\n\nThe readable `domain_labels` list is for humans. The fixed-length `labels` vector is for the model.\n\nFor Hugging Face datasets, prefer scriptless CSV/JSON/Parquet storage. Hugging Face Datasets can load local or remote CSV, JSON, TXT, and Parquet files with `load_dataset()`, and the docs note that a custom dataset loading script is usually unnecessary for formats such as CSV, JSON, JSON Lines, text, images, audio, or Parquet. (Hugging Face)\n\nA good repository layout:\n\n\n    README.md\n    data/train.parquet\n    data/validation.parquet\n    data/test.parquet\n\n\n* * *\n\n## 5. How to generate domain labels at scale\n\nUse a **hybrid weak-supervision pipeline**. Do not choose only manual, only rules, or only embeddings.\n\nMethod | Best role | Weakness\n---|---|---\nManual mapping | Taxonomy design; top frequent entities; validation set | Does not scale alone\nRules | Cheap, explainable weak labels | Brittle for ambiguity\nEntity linking | Resolving ambiguous names | Can be noisy or expensive\nEmbeddings | Clustering, similarity, suggestions | Not reliable as ground truth\nZero-shot classification | Cold-start label suggestions | Sensitive to label wording\nHuman review | Resolving uncertain cases | Expensive\nFine-tuned classifier | Final scalable prediction | Needs good labels first\n\nWeak supervision is a good fit because your current entity labels are noisy evidence. Snorkel formalizes this pattern: users write labeling functions that express heuristics, patterns, distant supervision, or other weak supervision sources, then combine and denoise their outputs into training labels. (arXiv)\n\n* * *\n\n## 6. Practical weak-labeling pipeline\n\n### Step 1: Rename the field\n\nDo this in your code and documentation:\n\n\n    topics → entities_raw\n\n\nAvoid calling it `labels`. That prevents accidental misuse.\n\n* * *\n\n### Step 2: Normalize entity strings\n\nCreate an alias table:\n\n\n    raw,canonical\n    BMA,British Medical Association\n    British Medical Association (BMA),British Medical Association\n    the BMA,British Medical Association\n    NHS,National Health Service\n    M1,M1 motorway\n\n\nThis turns:\n\n\n    [\"Doctors\", \"NHS\", \"British Medical Association (BMA)\"]\n\n\ninto:\n\n\n    [\"doctor\", \"National Health Service\", \"British Medical Association\"]\n\n\n* * *\n\n### Step 3: Separate entity types\n\nSplit entities into useful metadata fields:\n\n\n    people\n    organisations\n    locations\n    infrastructure\n    professions\n    products\n    events\n\n\nExample:\n\n\n    {\n      \"entities_raw\": [\"Sutton in Ashfield\", \"Annesley\", \"M1 motorway\"],\n      \"locations\": [\"Sutton in Ashfield\", \"Annesley\"],\n      \"infrastructure\": [\"M1 motorway\"],\n      \"candidate_domains\": [\"Transport & Infrastructure\"]\n    }\n\n\nThis matters because locations, people, organisations, and roads are different kinds of evidence.\n\n* * *\n\n### Step 4: Create entity-to-candidate-domain mappings\n\nExample:\n\n\n    {\n      \"National Health Service\": [\"Health\", \"Politics & Government\", \"Labour\"],\n      \"British Medical Association\": [\"Health\", \"Labour\"],\n      \"doctor\": [\"Health\", \"Labour\"],\n      \"Tesla\": [\"Business & Economy\", \"Technology\", \"Transport & Infrastructure\", \"Environment\", \"Law\"],\n      \"Elon Musk\": [\"Business & Economy\", \"Technology\", \"Media\", \"Politics & Government\", \"Law\"],\n      \"M1 motorway\": [\"Transport & Infrastructure\"],\n      \"Glasgow\": []\n    }\n\n\nNotice that `Glasgow` maps to an empty candidate list. It is metadata unless the text proves a domain.\n\n* * *\n\n### Step 5: Add article-context rules\n\nEntity mappings alone are not enough. Add rules based on headline and body text.\n\n\n    Health:\n    hospital, doctor, patient, medical, nurse, NHS, surgery, waiting list, disease, vaccine\n\n    Business:\n    shares, stock, earnings, revenue, profit, investors, market, acquisition, merger\n\n    Technology:\n    AI, software, cybersecurity, chip, app, platform, data centre, algorithm\n\n    Transport:\n    motorway, road, traffic, rail, train, airport, bridge, lane closure\n\n    Politics:\n    minister, parliament, government, council, election, policy, regulation\n\n    Crime/Justice:\n    police, arrested, charged, court, investigation, sentence, stabbing, fraud\n\n    Labour:\n    strike, union, pay dispute, workers, staff, contract, collective bargaining\n\n\nThink of these rules as **votes** , not final truth.\n\n* * *\n\n### Step 6: Add zero-shot or embedding suggestions\n\nZero-shot classification is useful before you have gold labels. SetFit’s zero-shot guide describes using class names with pretrained Sentence Transformer models to get a baseline without training samples. (Hugging Face) Hugging Face also has a cookbook showing how SetFit suggestions can support annotation workflows, including multi-label questions. (Hugging Face)\n\nUse it like this:\n\n\n    Article:\n    Junior doctors announce strike after NHS pay talks collapse.\n\n    Candidate labels:\n    Health\n    Business & Economy\n    Technology\n    Politics & Government\n    Labour\n    Transport\n    Crime & Justice\n\n\nExpected high scores:\n\n\n    {\n      \"Health\": 0.94,\n      \"Labour\": 0.82,\n      \"Politics & Government\": 0.70\n    }\n\n\nBut do not treat zero-shot output as ground truth. It is sensitive to label wording, candidate label set, thresholds, and domain mismatch.\n\n* * *\n\n### Step 7: Use entity linking selectively\n\nEntity linking is useful when strings are ambiguous. spaCy’s `EntityLinker` disambiguates named-entity mentions to unique identifiers in a knowledge base, grounding text mentions into “real-world” entities using candidate generation and local context. (spaCy)\n\nGood candidates for entity linking:\n\n\n    Tesla → Tesla, Inc. vs Nikola Tesla\n    Apple → company vs fruit\n    Jordan → country vs person\n    Glasgow → city vs club/university context\n    BMA → British Medical Association vs another abbreviation\n\n\nWikidata can help enrich linked entities because its SPARQL service lets you query structured data and export results as XML, JSON, CSV, TSV, and other formats. (Wikipedia Data)\n\nUse entity linking to improve candidate domains, not to bypass article-level context.\n\n* * *\n\n## 7. Apply the pipeline to your examples\n\n### Row 1\n\n\n    [\"Doctors\", \"NHS\", \"British Medical Association (BMA)\"]\n\n\nCandidate labels:\n\n\n    Health\n    Politics & Government\n    Labour\n\n\nFinal labels depend on context:\n\n\n    NHS waiting lists rise → Health\n    Junior doctors strike → Health + Labour + Politics\n    NHS cyberattack → Health + Technology + Law\n\n\n* * *\n\n### Row 2\n\n\n    [\"Glasgow\"]\n\n\nBetter representation:\n\n\n    {\n      \"locations\": [\"Glasgow\"],\n      \"candidate_domains\": [],\n      \"review_required\": true\n    }\n\n\nFinal label comes from text:\n\n\n    Glasgow hospital story → Health\n    Glasgow football story → Sport\n    Glasgow council story → Politics\n    Glasgow rail story → Transport\n\n\n* * *\n\n### Row 3\n\n\n    [\"Sutton in Ashfield\", \"Annesley\", \"M1 motorway\"]\n\n\nBetter representation:\n\n\n    {\n      \"locations\": [\"Sutton in Ashfield\", \"Annesley\"],\n      \"infrastructure\": [\"M1 motorway\"],\n      \"candidate_domains\": [\"Transport & Infrastructure\"]\n    }\n\n\nContext decides whether to add more:\n\n\n    Crash on M1 → Transport + Crime/Justice/Public Safety\n    Roadworks on M1 → Transport\n    Planning near M1 → Politics + Transport + Business\n\n\n* * *\n\n### Row 4\n\n\n    [\"Elon Musk\", \"Tesla\"]\n\n\nBetter representation:\n\n\n    {\n      \"people\": [\"Elon Musk\"],\n      \"organisations\": [\"Tesla\"],\n      \"candidate_domains\": [\n        \"Business & Economy\",\n        \"Technology\",\n        \"Transport & Infrastructure\",\n        \"Environment\",\n        \"Law\",\n        \"Politics & Government\"\n      ],\n      \"requires_context\": true\n    }\n\n\nContext examples:\n\n\n    Tesla shares fall → Business\n    Autopilot update → Technology + Transport\n    Crash investigation → Transport + Law + Technology\n    Election-policy comments → Politics + Media/Technology\n\n\n* * *\n\n## 8. Model training after label construction\n\nOnce you have `domain_labels`, the final model is a normal multi-label document classifier.\n\nUse a fixed label order:\n\n\n    label_names = [\n        \"Health\",\n        \"Business & Economy\",\n        \"Technology\",\n        \"Politics & Government\",\n        \"Crime & Justice\",\n        \"Transport & Infrastructure\",\n        \"Education\",\n        \"Science\",\n        \"Environment\",\n        \"Sport\",\n        \"Arts & Entertainment\",\n        \"Society\",\n        \"Conflict & Security\",\n        \"Law\",\n        \"Religion\",\n        \"Labour\",\n        \"Other / Unclear\",\n    ]\n\n\nExample multi-hot label vector:\n\n\n    # Health + Politics & Government + Labour\n    labels = [\n        1.0,  # Health\n        0.0,  # Business & Economy\n        0.0,  # Technology\n        1.0,  # Politics & Government\n        0.0,  # Crime & Justice\n        0.0,  # Transport & Infrastructure\n        0.0,  # Education\n        0.0,  # Science\n        0.0,  # Environment\n        0.0,  # Sport\n        0.0,  # Arts & Entertainment\n        0.0,  # Society\n        0.0,  # Conflict & Security\n        0.0,  # Law\n        0.0,  # Religion\n        1.0,  # Labour\n        0.0,  # Other / Unclear\n    ]\n\n\nHugging Face sequence-classification configs support `problem_type=\"multi_label_classification\"`, and Hugging Face forum guidance for `Trainer` points to setting this problem type for multi-label classification. (Hugging Face)\n\nExample:\n\n\n    from transformers import AutoModelForSequenceClassification\n\n    model = AutoModelForSequenceClassification.from_pretrained(\n        \"microsoft/deberta-v3-base\",\n        num_labels=len(label_names),\n        id2label={i: label for i, label in enumerate(label_names)},\n        label2id={label: i for i, label in enumerate(label_names)},\n        problem_type=\"multi_label_classification\",\n    )\n\n\nStart with a text-only classifier, then compare against a version that appends normalized entities to the input:\n\n\n    Title: ...\n    Entities: NHS; British Medical Association; doctor\n    Body: ...\n\n\nEntity injection may help short articles, but it can also cause overfitting to famous names.\n\n* * *\n\n## 9. Splitting and evaluation\n\nUse multi-label-aware splitting. The `iterative-stratification` project provides scikit-learn-compatible cross validators for stratification in multilabel data, and its code comments describe preserving label percentages across folds. (GitHub)\n\nEvaluate with more than one metric:\n\n\n    micro-F1\n    macro-F1\n    per-label precision\n    per-label recall\n    per-label F1\n    hamming loss\n    precision@k\n    manual error analysis\n\n\nscikit-learn’s F1 documentation defines F1 as the harmonic mean of precision and recall, and it supports different averaging modes such as micro, macro, weighted, samples, or per-label scoring. (scikit-learn)\n\nDo not rely only on micro-F1. It can look good while rare labels fail badly. Always inspect per-label scores.\n\nAlso tune thresholds per label rather than blindly using `0.5`:\n\n\n    Health: 0.42\n    Business & Economy: 0.55\n    Technology: 0.62\n    Politics & Government: 0.48\n    Transport & Infrastructure: 0.50\n    Labour: 0.38\n\n\nThese are examples; real thresholds should come from validation data.\n\n* * *\n\n## 10. Human review: where to spend effort\n\nDo not manually review everything. Review the rows where automation is least reliable:\n\n\n    location-only rows\n    rows with no confident label\n    rows with many candidate labels\n    rows involving ambiguous high-frequency entities\n    rows where rules and zero-shot disagree\n    rows assigned Other / Unclear\n    rare-label examples\n    random audit samples\n\n\nA practical target:\n\n\n    Minimum: 500 reviewed examples\n    Better: 1,000–2,000 reviewed examples\n    Strong: 5,000+ reviewed examples\n\n\nThe reviewed set is essential. Without it, you cannot tell whether the model is learning meaningful domains or merely reproducing noisy rules.\n\n* * *\n\n## 11. Useful reference dataset pattern\n\nA relevant schema reference is **DWIE** on Hugging Face. It is an entity-centric document-level information extraction dataset with named entities, coreference, relations, entity linking, and document-level IPTC classification codes; the dataset card describes 23,130 entities, 311 multi-label entity tags, and entity links to Wikipedia. (Hugging Face)\n\nThe important lesson from that style of dataset is structural:\n\n\n    mentions/entities/entity links ≠ article-level topic labels\n\n\nThat is exactly the separation your dataset needs.\n\n* * *\n\n## 12. Main pitfalls\n\nPitfall | Why it hurts | Better approach\n---|---|---\nTraining directly on entities | Creates huge sparse label space | Use entities as metadata\nOne entity → one domain | Fails on ambiguous entities | Entity → candidate domains\nLocation → Geography | Mislabels local news | Store locations separately\nToo many labels too early | Sparse positives, weak metrics | Start with coarse domains\nNo reviewed validation set | No trusted evaluation | Review uncertain + random samples\nOne global threshold | Over/under-predicts labels | Tune thresholds per label\nNo label provenance | Hard to debug changes | Store label source and versions\n\n* * *\n\n## 13. Recommended first-month plan\n\n### Week 1 — Audit and taxonomy\n\nProduce:\n\n\n    top entities\n    entity frequency distribution\n    rows with only locations\n    rows with only people\n    rows with only organisations\n    rows with 5+ topics\n    rare one-off topic rate\n    taxonomy_v0.md\n\n\n### Week 2 — Weak labels\n\nCreate:\n\n\n    entity_aliases.csv\n    entity_candidate_domains.csv\n    keyword_rules.py\n    domain_labels_weak\n    candidate_domain_scores\n    review_required\n\n\n### Week 3 — Review\n\nReview 500–1,000 examples, prioritizing:\n\n\n    ambiguous entities\n    location-only rows\n    rule disagreements\n    Other / Unclear rows\n    rare labels\n\n\n### Week 4 — Train baseline\n\nTrain and compare:\n\n\n    text-only classifier\n    text + normalized entity list classifier\n\n\nEvaluate with micro-F1, macro-F1, per-label metrics, threshold tuning, and manual error analysis.\n\n* * *\n\n# Direct answers\n\n## 1. Best way to map entities → domains at scale?\n\nUse a hybrid pipeline:\n\n\n    entity normalization\n    + entity typing/linking\n    + manual mappings for frequent entities\n    + rule-based weak labels\n    + article-text evidence\n    + zero-shot/embedding suggestions\n    + human review for uncertain cases\n    + final supervised classifier\n\n\nThe key rule is:\n\n\n    entity → candidate domains\n    article context → final domains\n\n\n* * *\n\n## 2. Manual, rule-based, or embedding-based?\n\nUse all three.\n\n\n    Manual:\n    taxonomy, high-frequency entities, validation set\n\n    Rules:\n    explainable weak labels\n\n    Embeddings:\n    similarity, clustering, candidate suggestions\n\n    Zero-shot:\n    cold-start label suggestions\n\n    Human review:\n    ambiguous/high-impact examples\n\n    Fine-tuned classifier:\n    final scalable prediction\n\n\nEmbeddings and zero-shot models are useful assistants, not ground truth.\n\n* * *\n\n## 3. How to handle ambiguous entities?\n\nUse many-to-many mappings, context resolution, confidence scores, and review queues.\n\n\n    {\n      \"entity\": \"Tesla\",\n      \"candidate_domains\": [\n        \"Business & Economy\",\n        \"Technology\",\n        \"Transport & Infrastructure\",\n        \"Environment\",\n        \"Law\"\n      ],\n      \"requires_context\": true\n    }\n\n\nThe same entity can produce different labels in different articles. That is correct.\n\n* * *\n\n## 4. Is this still classification?\n\nYes, but only after restructuring.\n\nRight now, the problem is:\n\n\n    label construction from noisy entity-style metadata\n\n\nAfter that, it becomes:\n\n\n    multi-label document classification\n\n\nThe final model is ordinary. The hard part is building reliable labels.\n\n* * *\n\n# Compact final recommendation\n\nBuild the dataset like this:\n\n\n    1. Rename current topics to entities_raw.\n    2. Normalize entity strings.\n    3. Separate people, organisations, locations, infrastructure, products.\n    4. Treat locations as metadata unless the article is truly about place/geography.\n    5. Define a small news-domain taxonomy.\n    6. Map entities to candidate domains, not fixed final domains.\n    7. Use article text to resolve ambiguity.\n    8. Generate weak labels with rules + zero-shot/embedding suggestions.\n    9. Human-review uncertain and high-impact rows.\n    10. Store final labels as domain_labels.\n    11. Convert domain_labels to fixed multi-hot float vectors.\n    12. Train a Hugging Face multi-label classifier.\n    13. Evaluate per label and tune thresholds.\n    14. Version taxonomy, rules, labels, and dataset files.\n\n\n**Short summary**\n\n  * Your current `topics` are mostly entities, not true labels.\n  * Keep them as `entities_raw`; create a new `domain_labels` field.\n  * Use IPTC-style news taxonomy thinking, but start with a smaller custom taxonomy.\n  * Do not map every entity to exactly one domain.\n  * Treat ambiguous entities as candidate-domain generators.\n  * Treat locations as metadata, not automatic `Geography`.\n  * Use weak supervision plus selective human review.\n  * Then train a normal Hugging Face multi-label classifier.\n\n",
  "title": "When Your “Labels” Aren’t Really Labels: Dealing with Entity-Based NLP Datasets"
}