{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidj42hvvfdna7xqdn4lns7ffxhwrllcjdxpuw75yl7pqcszkhr5ii",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlwef6c7f4t2"
  },
  "path": "/t/continuation-bryla-semantic-representation-ablation-masked-loss-results/176048#post_1",
  "publishedAt": "2026-05-15T22:09:04.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Szukam feedbacku — własna reprezentacja semantyczna \"bryła\" dla małych modeli",
    "@john6666",
    "krzysiekpl/bryla-kris"
  ],
  "textContent": "This is a continuation of my earlier thread which auto-closed before I could share results: Szukam feedbacku — własna reprezentacja semantyczna \"bryła\" dla małych modeli\n\n@john6666 - thank you again for the detailed feedback. I’ve spent the last two days running the experiments you suggested, plus a few I came up with along the way. Here’s what happened - including the failures, because I think they’re as informative as the successes.\n\nI’ve also added an English version of my project documentation to my HF repo: krzysiekpl/bryla-kris - see README_EN.md for the full story.\n\n## What I did\n\n**1. Field ablation (your suggestion: “split bryła into field families”)**\n\nBuilt 3 schema variants:\n\n  * MIN (3 fields: type, polarity, sep)\n  * MID (7 fields: + scope, intent, intensity, core)\n  * FULL (20 fields: all original fields)\n\n\n\nTrained 4 variants (RAW + 3 bryła) x 3 seeds on 695 Q/A pairs (welding/materials).\n\nResult: **MID won 2/3 seeds, +3% vs RAW. FULL did NOT beat MID** (lost by ~9 ppl). Suggests the 13 “default” fields in FULL are noise. Effect small (~3%), but the direction is consistent.\n\n**2. Honest perplexity metric (my own addition)**\n\nWhen scaling to Wikipedia (decoder-only LM), I noticed standard val_ppl was misleading because tags are deterministic and easy to predict. I added:\n\n  * val_ppl_std (standard, on all tokens)\n  * val_ppl_clean (ONLY on Polish text after [SEP_BRYLA])\n  * val_ppl_tags (only on bryła tags)\n\n\n\nFor FULL bryla on Wikipedia: val_ppl_std = 2.03 but val_ppl_clean = 3.10. The standard metric was hiding ~35% of the real perplexity. I now think this is a methodological lesson: any ablation that adds prefix tokens should use a target-only perplexity.\n\n**3. Three types of leakage I caught**\n\n  * surface_text duplicated inside bryla AND after [SEP_BRYLA] (model copying)\n  * [FACTS] block included previous Polish text\n  * Anchors contained 80-char surface_text snippets (still leaking after I removed (1))\n\n\n\nEach time I had to retrain. Last attempt still showed suspiciously low std between seeds (±0.01) - which suggested the model was matching templates from a biased corpus (Wikipedia has ~5000 nearly-identical village descriptions).\n\n**4. Token economy and latency (your suggestion)**\n\nVariant | Tokens vs RAW | Training | Inference\n---|---|---|---\nRAW | 1.0x | 4 min | 5.7 ms/tok\nMIN | 1.81x | 8 min | 5.9 ms/tok\nMID | 2.82x | 12 min | 5.8 ms/tok\nFULL | 6.06x | 30 min | 5.8 ms/tok\n\nFULL costs 6x more tokens for ~3% perplexity improvement. Tradeoff is poor.\n\n**5. Masked loss (the most interesting experiment)**\n\nI had an intuition: “bryła should CARRY information needed to generate the answer. The text is what the model should learn. The bryla is just context the model receives.”\n\nThis is equivalent to prefix-LM / conditional generation - loss only on Polish text after [SEP_BRYLA], bryła as context only.\n\nResult: val_ppl_clean almost identical (3.10 vs 3.18). Numerically neutral.\n\nBUT when I tested the masked model in a mini-chat with manually-crafted bryła prefixes differing only in polarity:\n\n\n    [OTHER] [POL:neutral]  -> geographic / astronomical content\n    [OTHER] [POL:positive] -> villages, places\n    [OTHER] [POL:negative] -> sports, competition\n\n\nThree different topical distributions for three polarities. The model IS reading the bryla as conditioning information. The numerical val_ppl doesn’t show this, but the generation does.\n\n## Honest summary\n\nWhat I think I showed:\n\n  * Field ablation: fewer informative fields > many fields with defaults\n  * val_ppl_clean is a necessary metric when tags are added to sequences\n  * Three types of leakage to watch for\n  * Conditional generation works: bryła as prefix DOES condition the output\n\n\n\nWhat I did NOT show:\n\n  * That bryla “helps” in a strong sense (gains are small, ~3%)\n  * That the approach scales (33M tokens is ~5% of Chinchilla optimal)\n  * That the parser is good enough (87% of Wikipedia sentences got [OTHER] - my parser was built for technical Q/A, not general text)\n\n\n\n## A question, if you have a minute\n\nI’m thinking about what to try next. The parser bottleneck (87% [OTHER]) suggests two options:\n\n  1. Extend the parser with more domain-specific rules\n  2. Build a smaller but balanced, multi-domain corpus (~200-500 examples per domain: biographies, geography, technique, daily life, science)\n\n\n\nSince Bielik 11B was built by volunteers (SpeakLeash community), I’m wondering: do you (or anyone reading this) know of clean, diverse Polish-language Q/A datasets, or have suggestions for community-driven small dataset construction?\n\nEven pointers to papers/projects that did small-data multi-domain ablation studies well would be very helpful - I’m in territory where I don’t quite know what good practice looks like.\n\nThanks again. Whatever happens next, this conversation has taught me more about experimental methodology than the actual results.\n\nBest,\nKrzysztof\nkrzysiekpl/bryla-kris",
  "title": "[Continuation] bryła semantic representation: ablation + masked loss results"
}