External Publication
Visit Post

[Continuation] bryła semantic representation: ablation + masked loss results

Hugging Face Forums [Unofficial] May 15, 2026
Source

This is a continuation of my earlier thread which auto-closed before I could share results: Szukam feedbacku — własna reprezentacja semantyczna "bryła" dla małych modeli

@john6666 - thank you again for the detailed feedback. I’ve spent the last two days running the experiments you suggested, plus a few I came up with along the way. Here’s what happened - including the failures, because I think they’re as informative as the successes.

I’ve also added an English version of my project documentation to my HF repo: krzysiekpl/bryla-kris - see README_EN.md for the full story.

What I did

1. Field ablation (your suggestion: “split bryła into field families”)

Built 3 schema variants:

  • MIN (3 fields: type, polarity, sep)
  • MID (7 fields: + scope, intent, intensity, core)
  • FULL (20 fields: all original fields)

Trained 4 variants (RAW + 3 bryła) x 3 seeds on 695 Q/A pairs (welding/materials).

Result: MID won 2/3 seeds, +3% vs RAW. FULL did NOT beat MID (lost by 9 ppl). Suggests the 13 “default” fields in FULL are noise. Effect small (3%), but the direction is consistent.

2. Honest perplexity metric (my own addition)

When scaling to Wikipedia (decoder-only LM), I noticed standard val_ppl was misleading because tags are deterministic and easy to predict. I added:

  • val_ppl_std (standard, on all tokens)
  • val_ppl_clean (ONLY on Polish text after [SEP_BRYLA])
  • val_ppl_tags (only on bryła tags)

For FULL bryla on Wikipedia: val_ppl_std = 2.03 but val_ppl_clean = 3.10. The standard metric was hiding ~35% of the real perplexity. I now think this is a methodological lesson: any ablation that adds prefix tokens should use a target-only perplexity.

3. Three types of leakage I caught

  • surface_text duplicated inside bryla AND after [SEP_BRYLA] (model copying)
  • [FACTS] block included previous Polish text
  • Anchors contained 80-char surface_text snippets (still leaking after I removed (1))

Each time I had to retrain. Last attempt still showed suspiciously low std between seeds (±0.01) - which suggested the model was matching templates from a biased corpus (Wikipedia has ~5000 nearly-identical village descriptions).

4. Token economy and latency (your suggestion)

Variant Tokens vs RAW Training Inference
RAW 1.0x 4 min 5.7 ms/tok
MIN 1.81x 8 min 5.9 ms/tok
MID 2.82x 12 min 5.8 ms/tok
FULL 6.06x 30 min 5.8 ms/tok

FULL costs 6x more tokens for ~3% perplexity improvement. Tradeoff is poor.

5. Masked loss (the most interesting experiment)

I had an intuition: “bryła should CARRY information needed to generate the answer. The text is what the model should learn. The bryla is just context the model receives.”

This is equivalent to prefix-LM / conditional generation - loss only on Polish text after [SEP_BRYLA], bryła as context only.

Result: val_ppl_clean almost identical (3.10 vs 3.18). Numerically neutral.

BUT when I tested the masked model in a mini-chat with manually-crafted bryła prefixes differing only in polarity:

[OTHER] [POL:neutral]  -> geographic / astronomical content
[OTHER] [POL:positive] -> villages, places
[OTHER] [POL:negative] -> sports, competition

Three different topical distributions for three polarities. The model IS reading the bryla as conditioning information. The numerical val_ppl doesn’t show this, but the generation does.

Honest summary

What I think I showed:

  • Field ablation: fewer informative fields > many fields with defaults
  • val_ppl_clean is a necessary metric when tags are added to sequences
  • Three types of leakage to watch for
  • Conditional generation works: bryła as prefix DOES condition the output

What I did NOT show:

  • That bryla “helps” in a strong sense (gains are small, ~3%)
  • That the approach scales (33M tokens is ~5% of Chinchilla optimal)
  • That the parser is good enough (87% of Wikipedia sentences got [OTHER] - my parser was built for technical Q/A, not general text)

A question, if you have a minute

I’m thinking about what to try next. The parser bottleneck (87% [OTHER]) suggests two options:

  1. Extend the parser with more domain-specific rules
  2. Build a smaller but balanced, multi-domain corpus (~200-500 examples per domain: biographies, geography, technique, daily life, science)

Since Bielik 11B was built by volunteers (SpeakLeash community), I’m wondering: do you (or anyone reading this) know of clean, diverse Polish-language Q/A datasets, or have suggestions for community-driven small dataset construction?

Even pointers to papers/projects that did small-data multi-domain ablation studies well would be very helpful - I’m in territory where I don’t quite know what good practice looks like.

Thanks again. Whatever happens next, this conversation has taught me more about experimental methodology than the actual results.

Best, Krzysztof krzysiekpl/bryla-kris

Discussion in the ATmosphere

Loading comments...