External Publication

[Continuation] bryła semantic representation: ablation + masked loss results

Hugging Face Forums [Unofficial] May 15, 2026

This is a continuation of my earlier thread which auto-closed before I could share results: Szukam feedbacku — własna reprezentacja semantyczna "bryła" dla małych modeli

@john6666 - thank you again for the detailed feedback. I’ve spent the last two days running the experiments you suggested, plus a few I came up with along the way. Here’s what happened - including the failures, because I think they’re as informative as the successes.

I’ve also added an English version of my project documentation to my HF repo: krzysiekpl/bryla-kris - see README_EN.md for the full story.

What I did

1. Field ablation (your suggestion: “split bryła into field families”)

Built 3 schema variants:

MIN (3 fields: type, polarity, sep)
MID (7 fields: + scope, intent, intensity, core)
FULL (20 fields: all original fields)

Trained 4 variants (RAW + 3 bryła) x 3 seeds on 695 Q/A pairs (welding/materials).

Result: MID won 2/3 seeds, +3% vs RAW. FULL did NOT beat MID (lost by ~~9 ppl). Suggests the 13 “default” fields in FULL are noise. Effect small (~~3%), but the direction is consistent.

2. Honest perplexity metric (my own addition)

When scaling to Wikipedia (decoder-only LM), I noticed standard val_ppl was misleading because tags are deterministic and easy to predict. I added:

val_ppl_std (standard, on all tokens)
val_ppl_clean (ONLY on Polish text after [SEP_BRYLA])
val_ppl_tags (only on bryła tags)

For FULL bryla on Wikipedia: val_ppl_std = 2.03 but val_ppl_clean = 3.10. The standard metric was hiding ~35% of the real perplexity. I now think this is a methodological lesson: any ablation that adds prefix tokens should use a target-only perplexity.

3. Three types of leakage I caught

surface_text duplicated inside bryla AND after [SEP_BRYLA] (model copying)
[FACTS] block included previous Polish text
Anchors contained 80-char surface_text snippets (still leaking after I removed (1))

Each time I had to retrain. Last attempt still showed suspiciously low std between seeds (±0.01) - which suggested the model was matching templates from a biased corpus (Wikipedia has ~5000 nearly-identical village descriptions).

4. Token economy and latency (your suggestion)

Variant	Tokens vs RAW	Training	Inference
RAW	1.0x	4 min	5.7 ms/tok
MIN	1.81x	8 min	5.9 ms/tok
MID	2.82x	12 min	5.8 ms/tok
FULL	6.06x	30 min	5.8 ms/tok

FULL costs 6x more tokens for ~3% perplexity improvement. Tradeoff is poor.

5. Masked loss (the most interesting experiment)

I had an intuition: “bryła should CARRY information needed to generate the answer. The text is what the model should learn. The bryla is just context the model receives.”

This is equivalent to prefix-LM / conditional generation - loss only on Polish text after [SEP_BRYLA], bryła as context only.

Result: val_ppl_clean almost identical (3.10 vs 3.18). Numerically neutral.

BUT when I tested the masked model in a mini-chat with manually-crafted bryła prefixes differing only in polarity:

[OTHER] [POL:neutral]  -> geographic / astronomical content
[OTHER] [POL:positive] -> villages, places
[OTHER] [POL:negative] -> sports, competition

Three different topical distributions for three polarities. The model IS reading the bryla as conditioning information. The numerical val_ppl doesn’t show this, but the generation does.

Honest summary

What I think I showed:

Field ablation: fewer informative fields > many fields with defaults
val_ppl_clean is a necessary metric when tags are added to sequences
Three types of leakage to watch for
Conditional generation works: bryła as prefix DOES condition the output

What I did NOT show:

That bryla “helps” in a strong sense (gains are small, ~3%)
That the approach scales (33M tokens is ~5% of Chinchilla optimal)
That the parser is good enough (87% of Wikipedia sentences got [OTHER] - my parser was built for technical Q/A, not general text)

A question, if you have a minute

I’m thinking about what to try next. The parser bottleneck (87% [OTHER]) suggests two options:

Extend the parser with more domain-specific rules
Build a smaller but balanced, multi-domain corpus (~200-500 examples per domain: biographies, geography, technique, daily life, science)

Since Bielik 11B was built by volunteers (SpeakLeash community), I’m wondering: do you (or anyone reading this) know of clean, diverse Polish-language Q/A datasets, or have suggestions for community-driven small dataset construction?

Even pointers to papers/projects that did small-data multi-domain ablation studies well would be very helpful - I’m in territory where I don’t quite know what good practice looks like.

Thanks again. Whatever happens next, this conversation has taught me more about experimental methodology than the actual results.

Best, Krzysztof krzysiekpl/bryla-kris

What I did

Honest summary

A question, if you have a minute

Discussion in the ATmosphere