[Continuation] bryła semantic representation: ablation + masked loss results
This is a continuation of my earlier thread which auto-closed before I could share results: Szukam feedbacku — własna reprezentacja semantyczna "bryła" dla małych modeli
@john6666 - thank you again for the detailed feedback. I’ve spent the last two days running the experiments you suggested, plus a few I came up with along the way. Here’s what happened - including the failures, because I think they’re as informative as the successes.
I’ve also added an English version of my project documentation to my HF repo: krzysiekpl/bryla-kris - see README_EN.md for the full story.
What I did
1. Field ablation (your suggestion: “split bryła into field families”)
Built 3 schema variants:
- MIN (3 fields: type, polarity, sep)
- MID (7 fields: + scope, intent, intensity, core)
- FULL (20 fields: all original fields)
Trained 4 variants (RAW + 3 bryła) x 3 seeds on 695 Q/A pairs (welding/materials).
Result: MID won 2/3 seeds, +3% vs RAW. FULL did NOT beat MID (lost by 9 ppl). Suggests the 13 “default” fields in FULL are noise. Effect small (3%), but the direction is consistent.
2. Honest perplexity metric (my own addition)
When scaling to Wikipedia (decoder-only LM), I noticed standard val_ppl was misleading because tags are deterministic and easy to predict. I added:
- val_ppl_std (standard, on all tokens)
- val_ppl_clean (ONLY on Polish text after [SEP_BRYLA])
- val_ppl_tags (only on bryła tags)
For FULL bryla on Wikipedia: val_ppl_std = 2.03 but val_ppl_clean = 3.10. The standard metric was hiding ~35% of the real perplexity. I now think this is a methodological lesson: any ablation that adds prefix tokens should use a target-only perplexity.
3. Three types of leakage I caught
- surface_text duplicated inside bryla AND after [SEP_BRYLA] (model copying)
- [FACTS] block included previous Polish text
- Anchors contained 80-char surface_text snippets (still leaking after I removed (1))
Each time I had to retrain. Last attempt still showed suspiciously low std between seeds (±0.01) - which suggested the model was matching templates from a biased corpus (Wikipedia has ~5000 nearly-identical village descriptions).
4. Token economy and latency (your suggestion)
| Variant | Tokens vs RAW | Training | Inference |
|---|---|---|---|
| RAW | 1.0x | 4 min | 5.7 ms/tok |
| MIN | 1.81x | 8 min | 5.9 ms/tok |
| MID | 2.82x | 12 min | 5.8 ms/tok |
| FULL | 6.06x | 30 min | 5.8 ms/tok |
FULL costs 6x more tokens for ~3% perplexity improvement. Tradeoff is poor.
5. Masked loss (the most interesting experiment)
I had an intuition: “bryła should CARRY information needed to generate the answer. The text is what the model should learn. The bryla is just context the model receives.”
This is equivalent to prefix-LM / conditional generation - loss only on Polish text after [SEP_BRYLA], bryła as context only.
Result: val_ppl_clean almost identical (3.10 vs 3.18). Numerically neutral.
BUT when I tested the masked model in a mini-chat with manually-crafted bryła prefixes differing only in polarity:
[OTHER] [POL:neutral] -> geographic / astronomical content
[OTHER] [POL:positive] -> villages, places
[OTHER] [POL:negative] -> sports, competition
Three different topical distributions for three polarities. The model IS reading the bryla as conditioning information. The numerical val_ppl doesn’t show this, but the generation does.
Honest summary
What I think I showed:
- Field ablation: fewer informative fields > many fields with defaults
- val_ppl_clean is a necessary metric when tags are added to sequences
- Three types of leakage to watch for
- Conditional generation works: bryła as prefix DOES condition the output
What I did NOT show:
- That bryla “helps” in a strong sense (gains are small, ~3%)
- That the approach scales (33M tokens is ~5% of Chinchilla optimal)
- That the parser is good enough (87% of Wikipedia sentences got [OTHER] - my parser was built for technical Q/A, not general text)
A question, if you have a minute
I’m thinking about what to try next. The parser bottleneck (87% [OTHER]) suggests two options:
- Extend the parser with more domain-specific rules
- Build a smaller but balanced, multi-domain corpus (~200-500 examples per domain: biographies, geography, technique, daily life, science)
Since Bielik 11B was built by volunteers (SpeakLeash community), I’m wondering: do you (or anyone reading this) know of clean, diverse Polish-language Q/A datasets, or have suggestions for community-driven small dataset construction?
Even pointers to papers/projects that did small-data multi-domain ablation studies well would be very helpful - I’m in territory where I don’t quite know what good practice looks like.
Thanks again. Whatever happens next, this conversation has taught me more about experimental methodology than the actual results.
Best, Krzysztof krzysiekpl/bryla-kris
Discussion in the ATmosphere