{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiaiies2opme5znrposzciucvip2ktgndzaifmdj7tdpyyiojpp2rm",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mjg7vxb2nvj2"
},
"path": "/t/discussion-about-improving-intent-classification-accuracy-in-low-data-settings-with-overlapping-semantic-signals-using-lightweight-non-llm-techniques/175202#post_2",
"publishedAt": "2026-04-14T00:00:46.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"ACL Anthology",
"Sbert",
"scikit-learn",
"Hugging Face",
"Rasa Community Forum",
"GitHub"
],
"textContent": "From what I can tell, you might have wandered into some kind of strange maze:\n\n* * *\n\nFor your case, the main problem is not that embeddings are weak. It is that you are asking one similarity score to do three different jobs at once:\n\n 1. find roughly related intents,\n 2. separate very close intents, and\n 3. reject intents that are semantically nearby but still wrong.\n\n\n\nRecent work on few-shot text classification and intent classification keeps finding the same pattern: **label semantics matter** , **richer class descriptions help** , **hard-negative boundaries matter** , and performance drops when overlapping intents are added without explicitly arbitrating the collision. (ACL Anthology)\n\n## Why your current setup is failing\n\nDense sentence embeddings are good at broad semantic grouping. They are much worse when the true distinction is carried by a small number of **functional cues** such as:\n\n * the main action,\n * the target object,\n * the required state,\n * the relation between entities,\n * or an exclusion condition.\n\n\n\nThat is exactly why papers on label semantics and complex class descriptions report gains from using richer class representations rather than short labels or plain sentence similarity. The “complex class descriptions” paper is especially relevant here because it reframes classification as **matching examples to class descriptions** , and reports strong few-shot gains over baselines when classes are too complex to be represented by short names alone. (ACL Anthology)\n\nThere is also a second failure mode: your model probably has weak **negative boundaries**. The hard-negative OOS paper shows that intent classifiers struggle much more on out-of-scope inputs that are semantically close to in-scope intents than on generic OOS inputs, and that adding such hard negatives improves robustness. In other words, the model often does not really know what a class excludes. (ACL Anthology)\n\nA third failure mode is structural. Some confusions are not really model mistakes. They are **taxonomy collisions**. Redwood was built around this exact idea and shows that performance suffers when colliding intents are added without arbitration. That matters for your problem because some of your “overlap” may come from the ontology itself rather than the encoder. (ACL Anthology)\n\n## The most useful mental model\n\nDo not think of this as “intent = one embedding.”\n\nThink of it as:\n\n * **candidate generation** ,\n * **pairwise disambiguation** ,\n * **accept / reject decision**.\n\n\n\nSentence Transformers’ retrieve-and-rerank guidance matches this well: use a fast first stage to retrieve plausible candidates, then a stronger reranker for precision. That architecture is a much better fit for your problem than a single cosine score. (Sbert)\n\n## What I would build instead\n\n### 1. Replace label strings with intent cards\n\nEach intent should be represented as a structured object, not just a title and a few paraphrases.\n\nA good intent card should contain:\n\n * a positive definition,\n * the core action,\n * the target object,\n * required conditions,\n * exclusions,\n * nearest confusable intents,\n * a handful of positive examples,\n * and a handful of negative examples.\n\n\n\nExample shape:\n\n\n Intent: cancel_subscription\n\n Positive definition:\n The user wants to end an existing subscription or stop future renewals.\n\n Core action:\n cancel / stop / terminate\n\n Target object:\n subscription / membership / recurring plan\n\n Required conditions:\n An existing active or recurring service is implied.\n\n Exclusions:\n - canceling a one-time order\n - pausing temporarily\n - changing the plan tier\n - asking about pricing only\n\n Confusable intents:\n pause_subscription\n change_plan\n cancel_order\n\n Positive examples:\n \"Stop my membership.\"\n \"I want to cancel the premium plan.\"\n\n Negative examples:\n \"I only want to pause it.\"\n \"How much does the premium plan cost?\"\n \"I need to cancel yesterday's order.\"\n\n\nThis is not just good documentation. It is a better machine-readable class representation, and it is supported by the literature on label semantics and complex class descriptions. (ACL Anthology)\n\n### 2. Make the first stage sparse or hybrid, not dense-only\n\nFor your setting, I would give **lexical evidence more authority** than dense similarity.\n\nWhy:\n\n * sparse methods keep token-level evidence visible,\n * common terms are naturally down-weighted,\n * and hybrid retrieval is explicitly recommended in current Sentence Transformers material for combining recall and precision. (Sbert)\n\n\n\nThat means the first stage should be one of these:\n\n * BM25 or TF-IDF,\n * a neural sparse encoder,\n * or sparse + dense hybrid retrieval.\n\n\n\nThis stage should only produce top-k candidate intents. It should not make the final decision.\n\nThat is the first place where you solve “high-recall generic terms dominate scoring.” Sparse retrieval is much better than dense similarity at exposing which terms are carrying the match. The Sentence Transformers sparse docs and Hugging Face’s sparse-encoder article both position sparse methods as a useful middle ground between classic lexical retrieval and dense embeddings. (Sbert)\n\n### 3. Put a supervised sparse classifier at the center\n\nFor your actual decision layer, I would use a sparse linear classifier with explicit features.\n\nThis is the stack:\n\n * word n-grams,\n * character n-grams,\n * action features,\n * object features,\n * slot/entity flags,\n * negation and modality features,\n * state or status features.\n\n\n\nIn scikit-learn terms, that means:\n\n * `TfidfVectorizer` for word n-grams,\n * `TfidfVectorizer(analyzer=\"char_wb\")` for robust character features,\n * `DictVectorizer` or similar for hand-built symbolic features,\n * `FeatureUnion` to merge them,\n * `OneVsRestClassifier` with `LogisticRegression`,\n * then calibration and threshold tuning. (scikit-learn)\n\n\n\nWhy this is a strong fit:\n\n * `char_wb` preserves useful subword and wording cues without exploding noise. (scikit-learn)\n * `FeatureUnion` lets you combine lexical and symbolic signals in one model. (scikit-learn)\n * `OneVsRestClassifier` gives you per-class decision functions, which is better for class-specific boundaries. (Hugging Face)\n * `LogisticRegression` handles sparse inputs directly. (scikit-learn)\n\n\n\nThis gives you something dense-only approaches do not: **signed evidence**. A feature can actively support one class and actively hurt another.\n\nThat is how you start modeling “this signal should exclude the class.”\n\n### 4. Use a reranker only for close calls\n\nYour NLI-style second pass is a good instinct. The problem is using it too broadly.\n\nA better setup is:\n\n 1. first stage retrieves top 3–5 intents,\n 2. second stage reranks only those candidates against the full intent card.\n\n\n\nThat is exactly the retrieve-and-rerank pattern recommended in the Sentence Transformers docs. (Sbert)\n\nThis second stage can be:\n\n * a cross-encoder,\n * an NLI model,\n * or a label-description matching model.\n\n\n\nThe “complex class descriptions” paper and the intent-aware encoder paper both support this direction: intent classification improves when the model is allowed to align utterances with richer intent semantics, not just raw surface similarity. (ACL Anthology)\n\nSo the reranker’s question is not:\n\n> “Which class is nearest in embedding space?”\n\nIt is:\n\n> “Given these few candidate intents, which full intent description best matches this utterance, and which exclusions are violated?”\n\nThat is a much better question.\n\n### 5. Model negative boundaries with data, not rules\n\nYour biggest gap is probably here.\n\nThe clean way to encode negative boundaries is to build three kinds of negatives for each intent:\n\n#### Sibling negatives\n\nExamples from the most confusable neighboring intents.\n\n#### Minimal-pair negatives\n\nSmall edits that flip the class:\n\n * change the action,\n * change the object,\n * add or remove negation,\n * change the status,\n * swap the relation.\n\n\n\n#### Hard-negative OOS\n\nExamples that look domain-relevant and share vocabulary, but belong to no supported intent.\n\nThis is directly supported by the hard-negative OOS paper. Generic OOS is too easy. You need close, misleading negatives. (ACL Anthology)\n\nFor example, if an intent is `change_billing_date`, useful hard negatives are not random unrelated queries. They are things like:\n\n * “Why was I billed on this date?”\n * “Can I change my payment method?”\n * “Pause my subscription until next month.”\n * “Move my renewal to next week.”\n\n\n\nThese sit near the decision boundary. That is exactly where your system is weak.\n\n### 6. Treat confusion pairs as first-class objects\n\nDo not only inspect global accuracy, macro-F1, or top-1 intent accuracy.\n\nBuild a **confusion graph** :\n\n * nodes = intents,\n * edge weight = how often two intents appear as top-2 candidates or get confused.\n\n\n\nThen take the worst edges and treat each as its own subproblem.\n\nFor each bad pair, ask:\n\n * are the positive definitions actually distinct,\n * are the exclusions explicit,\n * are the examples balanced,\n * should this distinction be an entity or slot instead of an intent split,\n * is this actually a multi-intent case,\n * or should the pair be merged?\n\n\n\nRedwood is the main source behind this advice. It shows that collision handling is not optional when intent sets grow or overlap. (ACL Anthology)\n\nThere is also a very practical community lesson here. Threads from Rasa and Stack Overflow often end up concluding that some “different intents” are actually the same intent plus a different entity or slot. That is not a deep theorem, but it is a useful diagnostic pattern. (Rasa Community Forum)\n\n### 7. Add a real reject path\n\nYou should not force a label for every input.\n\nUse:\n\n * a per-intent acceptance threshold,\n * a top-1 minus top-2 margin threshold,\n * and optionally an OOS detector.\n\n\n\nscikit-learn now provides both `CalibratedClassifierCV` and `TunedThresholdClassifierCV`. The first calibrates scores. The second explicitly tunes the cut-off used to turn scores into labels. That is exactly the tooling you want for “accept, reject, or escalate.” (scikit-learn)\n\nThis is much cleaner than hand-written rules like “if score < 0.62, return unknown.”\n\nIt also aligns with a real production pain point: confidence scores are often unreliable across intents, especially when some classes are much tighter than others. (Rasa Community Forum)\n\n## A concrete lightweight pipeline\n\nThis is the pipeline I would actually recommend.\n\n### Stage A. Candidate generation\n\nUse:\n\n * TF-IDF or BM25,\n * or a sparse / hybrid retriever.\n\n\n\nGoal:\n\n * high recall,\n * top 5 intent candidates,\n * transparent lexical evidence.\n\n\n\nSupported by current sparse and retrieve-rerank guidance. (Sbert)\n\n### Stage B. Feature-based supervised classifier\n\nTrain an OvR classifier on:\n\n * word TF-IDF,\n * char `char_wb` TF-IDF,\n * action/object/slot/state/negation features,\n * optional metadata.\n\n\n\nGoal:\n\n * class-specific signed evidence,\n * interpretable coefficients,\n * better precision on overlapping classes. (scikit-learn)\n\n\n\n### Stage C. Pairwise reranking\n\nFor only the top few candidates:\n\n * compare utterance vs full intent card,\n * score entailment / compatibility / exclusion violations.\n\n\n\nGoal:\n\n * resolve the close cases where the sparse backbone still hesitates. (ACL Anthology)\n\n\n\n### Stage D. Thresholded decision\n\nUse:\n\n * calibrated probability or decision score,\n * per-class threshold,\n * top-1 minus top-2 margin,\n * OOS fallback.\n\n\n\nGoal:\n\n * avoid wrong forced labels. (scikit-learn)\n\n\n\n## How to prioritize discriminative signals\n\nYou asked specifically about weighting discriminative signals over generic ones.\n\nI would do it in four ways.\n\n### A. Use sparse lexical features\n\nThat naturally suppresses common terms better than dense similarity. (Sbert)\n\n### B. Add feature selection\n\nFor non-negative sparse features, scikit-learn’s `chi2` can rank features by class association. That gives you a simple way to identify which terms are actually discriminative and which are just frequent. (scikit-learn)\n\n### C. Use regularized linear models\n\nRegularized logistic regression on sparse features is a strong baseline for text classification and handles sparse inputs directly. (scikit-learn)\n\n### D. Split generic terms from functional features\n\nDo not let “subscription,” “account,” “billing,” or “transfer” be the only strong signals. Put action and constraint features in their own channel so the model can learn that:\n\n * `cancel + subscription` matters,\n * `pause + subscription` is different,\n * `why + subscription billed` is different again.\n\n\n\nThat separation is the practical version of the “functional signals” idea you raised.\n\n## Better intent representations than plain embeddings\n\nThe strongest alternatives are:\n\n### 1. Intent cards\n\nBest overall option for your use case. Supported by label-semantics and complex-description work. (ACL Anthology)\n\n### 2. Intent name + keyphrase set\n\nSupported by the intent-aware encoder work, which tries to align utterances with intent names and key phrases rather than only whole utterance similarity. (ACL Anthology)\n\n### 3. Positive and negative prototypes\n\nInstead of one prototype per class, keep:\n\n * one positive prototype bank,\n * one exclusion prototype bank.\n\n\n\nThen score both:\n\n * “How much does this look like class A?”\n * “How much does this violate class A?”\n\n\n\nThat is not directly from one single paper, but it is consistent with the hard-negative and complex-description literature. (ACL Anthology)\n\n## What data work matters most\n\nWith only a few examples per intent, the biggest gains usually come from **better examples** , not a fancier model.\n\nThe most valuable new examples are:\n\n * confusion-pair minimal pairs,\n * hard-negative near-misses,\n * OOS queries that share vocabulary with in-scope intents,\n * and examples covering status and constraint changes.\n\n\n\nThe few-shot retrieval paper is useful here because it treats adaptation as changing the retrieval index and the local evidence, not retraining a giant model every time. That is very aligned with your low-data constraint. (ACL Anthology)\n\n## Common pitfalls for your exact setup\n\n### 1. Using short intent labels as if they were enough\n\nThey usually are not. (ACL Anthology)\n\n### 2. Trusting dense similarity too early\n\nEven sentence-transformer users have reported cases where BM25 outperformed embedding search on their task. That is anecdotal, not universal, but it is a good warning against assuming semantic search always wins. (GitHub)\n\n### 3. Using generic OOS only\n\nToo easy. Hard-negative OOS is the real test. (ACL Anthology)\n\n### 4. Treating every bad confusion as a model issue\n\nSome are taxonomy issues or mislabeled data issues. BANKING77 even has a paper reporting potential label errors affecting intent-classification experiments. (ACL Anthology)\n\n### 5. One global threshold for all intents\n\nUsually too crude. Use per-intent thresholding and calibration instead. (scikit-learn)\n\n## The best practical starting point\n\nFor your exact constraints, the highest-leverage first build is:\n\n * sparse lexical first stage,\n * intent cards,\n * OvR sparse logistic regression with hand-built functional features,\n * rerank top 3 candidates with an NLI or cross-encoder style matcher,\n * calibrated thresholds plus OOS fallback,\n * and a confusion-driven hard-negative collection loop.\n\n\n\nThat stack is lightweight, interpretable, fast enough for production, and much better aligned with overlapping semantic boundaries than plain embedding routing. It is also consistent with current few-shot tooling like SetFit, which is explicitly designed for little labeled data and supports a logistic-regression classification head on top of a SentenceTransformer body. (GitHub)\n\n## The single most useful next move\n\nBuild a baseline that deliberately **does not depend on dense similarity** :\n\n * word TF-IDF,\n * char `char_wb`,\n * slot/entity/action/object/negation features,\n * OvR logistic regression,\n * calibration,\n * pairwise hard-negative evaluation.\n\n\n\nThat baseline will tell you very quickly whether your real problem is:\n\n * representation,\n * boundary quality,\n * or ontology design.\n\n\n\nOnce you know which of those three is dominant, the rest of the roadmap becomes much clearer.",
"title": "Discussion about improving intent classification accuracy in low-data settings with overlapping semantic signals using lightweight, non-LLM techniques"
}