Custom semantic representation ("bryła") beats raw text in 24/27 configs — built solo on an RTX 2060, looking for feedback
Probably a big step forward. It looks like there is a recommended next procedure:
This looks like a much stronger proof-of-concept than the previous stage.
The important part is not only “Bryła wins in 24/27 configurations.” The more important part is that you found a concrete failure mode — low structured-input diversity — and then rebuilt the test around a controlled grid. That makes the result more credible.
I would still be careful with the claim, but the direction is good.
My direct answer would be:
Do not scale the synthetic setup much further yet. The next high-value step is a small natural-data test with
DOMAIN, shuffled-structure, and random-structure controls.
1. How I would read the current result
From what you describe, the result is now approximately:
| Stage | What it shows | What it does not yet show |
|---|---|---|
| Earlier technical QA result | Bryła can improve a tiny matched setup | Maybe fragile / domain-specific |
| Field ablation | compact fields are better than default-heavy FULL | not yet general |
| Clean PPL / masked loss | tags must be treated as context, not target | PPL alone is still incomplete |
| Current 24/27 grid | Bryła can transmit useful structured signal in a controlled setup | not yet proven on messy natural Polish data |
So the current claim I would make is:
In a controlled synthetic setting, Bryła appears to be a real conditioning signal rather than just noise. The next question is whether the same advantage survives natural data, real parser errors, and stronger controls.
That is already a good research position.
2. The next decisive experiment
I would run a very small natural-data test, not a larger synthetic one.
Use four or five conditions:
RAW
DOMAIN + RAW
BRYLA + RAW
SHUFFLED-BRYLA + RAW
RANDOM-BRYLA + RAW
The most important comparisons:
| Comparison | Meaning |
|---|---|
BRYLA > RAW |
Bryła still helps outside the synthetic setup |
BRYLA > DOMAIN |
Bryła adds more than a simple domain label |
BRYLA > SHUFFLED-BRYLA |
field-value alignment matters |
BRYLA > RANDOM-BRYLA |
result is not just prefix-format regularization |
DOMAIN ≈ BRYLA |
current Bryła may mostly encode domain/topic |
SHUFFLED-BRYLA ≈ BRYLA |
structure labels may not be semantically used |
RANDOM-BRYLA helps |
possible regularization / format artifact |
The core next question is:
Does Bryła beat DOMAIN-only and shuffled-structure controls on natural data?
If yes, the claim becomes much stronger.
3. Natural-data mini-benchmark
I would start small:
6 domains × 50 examples = 300 examples
Suggested domains:
| Domain | Why useful |
|---|---|
| technical / welding / materials | original strongest area |
| geography / places | tests templatic factual data |
| biographies | tests people, dates, roles, events |
| science explanations | tests definitions and causal relations |
| daily-life / practical QA | tests intent, urgency, user-facing pragmatics |
| sports / events | tests event structure and temporal facts |
Report results by domain, not only aggregate.
Example table:
| Domain | RAW | DOMAIN | BRYLA | SHUFFLED | Best | Note |
|---|---|---|---|---|---|---|
| technical | ||||||
| geography | ||||||
| biography | ||||||
| science | ||||||
| daily life | ||||||
| sports |
This matters because Bryła may help one domain and hurt another. That would still be useful information.
4. Replace synthetic parser noise with real parser error
The 0/10/20% parser-noise grid is useful, but the next step should be real parser failure.
A good progression would be:
synthetic parser noise
→ real parser errors
→ real domain shift
→ real QA / generation metric
Synthetic noise tells you the model is robust to artificial corruption. Real parser errors tell you whether the full system works.
I would report:
% parsed
% partial
% OTHER
field default rate
field entropy
field/domain correlation
Example parser dashboard:
| Domain | Parsed % | Partial % | OTHER % | Main failure mode |
|---|---|---|---|---|
| technical | ||||
| geography | ||||
| biography | ||||
| science | ||||
| daily life | ||||
| sports |
The parser is now part of the research object, not just preprocessing.
5. Keep structure-diversity metrics permanently
The best methodological insight in the new result may be the structured-diversity issue.
I would report these in every experiment:
unique raw texts
unique Bryła strings
Bryła/raw diversity ratio
field entropy
default-field ratio
parser OTHER%
average input tokens
Example:
| Metric | RAW | BRYLA |
|---|---|---|
| unique text strings | ||
| unique structured strings | — | |
| average source tokens | ||
| field entropy | — | |
| default-field ratio | — | |
| parser OTHER% | — |
This helps distinguish:
Bryła is useful
from:
Bryła collapsed many examples into the same structure
or:
Bryła mostly encoded domain/template identity
6. Keep clean PPL / masked loss
For prefix-style experiments, full-sequence PPL can be misleading because the model may get rewarded for predicting easy deterministic tags.
So I would keep:
val_ppl_clean = only natural Polish target text
val_ppl_tags = only Bryła tags
val_ppl_std = full sequence, diagnostic only
Primary metric:
val_ppl_clean
The Hugging Face docs on fixed-length-model perplexity are useful here because they emphasize that PPL depends on the exact likelihood/evaluation setup:
- Hugging Face docs — Perplexity of fixed-length models
For decoder-only prefix conditioning, I would use masked loss:
input:
[BRYLA PREFIX] [SEP_BRYLA] [POLISH TEXT]
labels:
[-100 ... -100] [-100] [POLISH TEXT LABELS]
That matches the conceptual setup:
Bryła = context
Polish text = target
7. Try cooldown
The most interesting next experiment after the control ladder is cooldown.
This is close to the idea in MeCo: train with metadata, then cool down on raw text so the model can function without metadata at inference time.
Resource:
- MeCo — Metadata Conditioning Accelerates Language Model Pre-training
- MeCo code
- MeCo OpenReview
For Bryła:
Phase 1:
train on BRYLA + text
Phase 2:
short cooldown on RAW-only text
Eval:
RAW-only
Controls:
RAW baseline
DOMAIN + text -> RAW cooldown
BRYLA + text -> RAW cooldown
RANDOM-BRYLA + text -> RAW cooldown
Interpretation:
| Result | Meaning |
|---|---|
| Bryła cooldown > RAW | Bryła may work as a training scaffold |
| Bryła cooldown ≈ RAW | no retained scaffold effect |
| DOMAIN cooldown ≈ Bryła cooldown | domain metadata may explain much of the gain |
| random-prefix cooldown helps | possible curriculum/regularization effect |
| Bryła requires Bryła at inference | useful, but deployment depends on parser |
If cooldown works, the story becomes stronger:
Bryła is not only an inference-time representation.
It may be a training scaffold for small models.
8. Test serialization format
Current Bryła looks like a compact symbolic representation. That may be best for tiny models, but it should be tested.
Structured-representation work suggests that code-like formats may be less model-friendly than natural-language descriptions in some settings.
Useful resource:
- SR-LLM — Rethinking the Structured Representation in Large Language Models
- SR-LLM arXiv
I would test:
BRYLA-symbolic
BRYLA-verbalized
BRYLA-hybrid
BRYLA-no-defaults
Example:
Symbolic:
[TYPE:fact] [POL:neutral] [SCOPE:general] [INTENT:inform] [CORE:yes]
Verbalized:
This is a neutral factual statement with general scope. The intent is to inform. The main content is central.
Hybrid:
[type: factual statement] [polarity: neutral] [scope: general] [intent: inform] [core: yes]
Also test field order, because sequence order can matter a lot for structured inputs.
Useful resource:
- Linearization Order Matters for AMR-to-Text Generation Input
9. Polish datasets and resources
For natural Polish QA / MRC testing, I would look at these.
| Resource | Use |
|---|---|
| PolQA | Polish OpenQA; useful for question/answer type analysis and evidence passages |
| PolQA dataset | practical HF dataset |
| PoQuAD | Polish SQuAD-like QA, includes impossible questions and generative answer layer |
| PoQuAD paper | dataset background |
| PolEval 2024 QA task | Polish reading-comprehension evaluation style |
| PolEval 2024 QA GitHub | task data/code |
| PUGG | Polish KBQA/MRC/IR construction pipeline |
| PUGG GitHub | implementation |
| PUGG dataset | HF dataset |
I would not mix all of these into one training soup immediately.
Better:
small clean natural benchmark
+ controlled ablations
+ separate larger-data experiments later
10. Suggested next reporting table
A compact table like this would be very clear:
| Setup | Data | Control type | Clean PPL | Task metric | Tokens | Wins/seeds | Comment |
|---|---|---|---|---|---|---|---|
| RAW | natural | baseline | |||||
| DOMAIN | natural | simple metadata | |||||
| BRYLA | natural | real structure | |||||
| SHUFFLED | natural | broken alignment | |||||
| RANDOM | natural | format control |
And by domain:
| Domain | BRYLA > RAW? | BRYLA > DOMAIN? | BRYLA > SHUFFLED? | Parser OTHER% | Note |
|---|---|---|---|---|---|
| technical | |||||
| geography | |||||
| biography | |||||
| science | |||||
| daily life | |||||
| sports |
11. What would make the claim much stronger
The result would become much harder to dismiss if the next stage shows:
BRYLA > RAW
BRYLA > DOMAIN
BRYLA > SHUFFLED-BRYLA
BRYLA > RANDOM-BRYLA
on small natural Polish data, with:
clean target-only loss
parser coverage reported
field entropy reported
token cost reported
domain-level breakdown
That would support the claim:
Bryła adds useful structure beyond domain conditioning and prefix-format effects.
12. What would weaken the claim
These would not kill the project, but they would change the interpretation:
| Observation | Interpretation |
|---|---|
DOMAIN ≈ BRYLA |
Bryła may mostly encode domain/topic |
SHUFFLED ≈ BRYLA |
field-value alignment may not matter |
RANDOM helps |
prefix format may act as regularization |
| gains vanish on natural data | synthetic setup may be too clean |
| gains only appear in full PPL | tag-prediction artifact |
parser outputs mostly [OTHER] |
structure is not reaching the model |
| Bryła works only in one domain | still useful, but domain-specific |
Short version
This is good progress.
The next step is not “make it bigger.” The next step is:
small natural data
+ DOMAIN control
+ shuffled-structure control
+ random-prefix control
+ clean PPL
+ parser diagnostics
If Bryła still wins there, the result becomes much stronger.
Discussion in the ATmosphere