External Publication

Custom semantic representation ("bryła") beats raw text in 24/27 configs — built solo on an RTX 2060, looking for feedback

Hugging Face Forums [Unofficial] June 5, 2026

Probably a big step forward. It looks like there is a recommended next procedure:

This looks like a much stronger proof-of-concept than the previous stage.

The important part is not only “Bryła wins in 24/27 configurations.” The more important part is that you found a concrete failure mode — low structured-input diversity — and then rebuilt the test around a controlled grid. That makes the result more credible.

I would still be careful with the claim, but the direction is good.

My direct answer would be:

Do not scale the synthetic setup much further yet. The next high-value step is a small natural-data test with DOMAIN, shuffled-structure, and random-structure controls.

1. How I would read the current result

From what you describe, the result is now approximately:

Stage	What it shows	What it does not yet show
Earlier technical QA result	Bryła can improve a tiny matched setup	Maybe fragile / domain-specific
Field ablation	compact fields are better than default-heavy FULL	not yet general
Clean PPL / masked loss	tags must be treated as context, not target	PPL alone is still incomplete
Current 24/27 grid	Bryła can transmit useful structured signal in a controlled setup	not yet proven on messy natural Polish data

So the current claim I would make is:

In a controlled synthetic setting, Bryła appears to be a real conditioning signal rather than just noise. The next question is whether the same advantage survives natural data, real parser errors, and stronger controls.

That is already a good research position.

2. The next decisive experiment

I would run a very small natural-data test, not a larger synthetic one.

Use four or five conditions:

RAW
DOMAIN + RAW
BRYLA + RAW
SHUFFLED-BRYLA + RAW
RANDOM-BRYLA + RAW

The most important comparisons:

Comparison	Meaning
`BRYLA > RAW`	Bryła still helps outside the synthetic setup
`BRYLA > DOMAIN`	Bryła adds more than a simple domain label
`BRYLA > SHUFFLED-BRYLA`	field-value alignment matters
`BRYLA > RANDOM-BRYLA`	result is not just prefix-format regularization
`DOMAIN ≈ BRYLA`	current Bryła may mostly encode domain/topic
`SHUFFLED-BRYLA ≈ BRYLA`	structure labels may not be semantically used
`RANDOM-BRYLA` helps	possible regularization / format artifact

The core next question is:

Does Bryła beat DOMAIN-only and shuffled-structure controls on natural data?

If yes, the claim becomes much stronger.

3. Natural-data mini-benchmark

I would start small:

6 domains × 50 examples = 300 examples

Suggested domains:

Domain	Why useful
technical / welding / materials	original strongest area
geography / places	tests templatic factual data
biographies	tests people, dates, roles, events
science explanations	tests definitions and causal relations
daily-life / practical QA	tests intent, urgency, user-facing pragmatics
sports / events	tests event structure and temporal facts

Report results by domain, not only aggregate.

Example table:

Domain	RAW	DOMAIN	BRYLA	SHUFFLED	Best	Note
technical
geography
biography
science
daily life
sports

This matters because Bryła may help one domain and hurt another. That would still be useful information.

4. Replace synthetic parser noise with real parser error

The 0/10/20% parser-noise grid is useful, but the next step should be real parser failure.

A good progression would be:

synthetic parser noise
→ real parser errors
→ real domain shift
→ real QA / generation metric

Synthetic noise tells you the model is robust to artificial corruption. Real parser errors tell you whether the full system works.

I would report:

% parsed
% partial
% OTHER
field default rate
field entropy
field/domain correlation

Example parser dashboard:

Domain	Parsed %	Partial %	OTHER %	Main failure mode
technical
geography
biography
science
daily life
sports

The parser is now part of the research object, not just preprocessing.

5. Keep structure-diversity metrics permanently

The best methodological insight in the new result may be the structured-diversity issue.

I would report these in every experiment:

unique raw texts
unique Bryła strings
Bryła/raw diversity ratio
field entropy
default-field ratio
parser OTHER%
average input tokens

Example:

Metric	RAW	BRYLA
unique text strings
unique structured strings	—
average source tokens
field entropy	—
default-field ratio	—
parser OTHER%	—

This helps distinguish:

Bryła is useful

from:

Bryła collapsed many examples into the same structure

or:

Bryła mostly encoded domain/template identity

6. Keep clean PPL / masked loss

For prefix-style experiments, full-sequence PPL can be misleading because the model may get rewarded for predicting easy deterministic tags.

So I would keep:

val_ppl_clean = only natural Polish target text
val_ppl_tags  = only Bryła tags
val_ppl_std   = full sequence, diagnostic only

Primary metric:

val_ppl_clean

The Hugging Face docs on fixed-length-model perplexity are useful here because they emphasize that PPL depends on the exact likelihood/evaluation setup:

Hugging Face docs — Perplexity of fixed-length models

For decoder-only prefix conditioning, I would use masked loss:

input:
  [BRYLA PREFIX] [SEP_BRYLA] [POLISH TEXT]

labels:
  [-100 ... -100] [-100]      [POLISH TEXT LABELS]

That matches the conceptual setup:

Bryła = context
Polish text = target

7. Try cooldown

The most interesting next experiment after the control ladder is cooldown.

This is close to the idea in MeCo: train with metadata, then cool down on raw text so the model can function without metadata at inference time.

Resource:

MeCo — Metadata Conditioning Accelerates Language Model Pre-training
MeCo code
MeCo OpenReview

For Bryła:

Phase 1:
  train on BRYLA + text

Phase 2:
  short cooldown on RAW-only text

Eval:
  RAW-only

Controls:

RAW baseline
DOMAIN + text -> RAW cooldown
BRYLA + text -> RAW cooldown
RANDOM-BRYLA + text -> RAW cooldown

Interpretation:

Result	Meaning
Bryła cooldown > RAW	Bryła may work as a training scaffold
Bryła cooldown ≈ RAW	no retained scaffold effect
DOMAIN cooldown ≈ Bryła cooldown	domain metadata may explain much of the gain
random-prefix cooldown helps	possible curriculum/regularization effect
Bryła requires Bryła at inference	useful, but deployment depends on parser

If cooldown works, the story becomes stronger:

Bryła is not only an inference-time representation.
It may be a training scaffold for small models.

8. Test serialization format

Current Bryła looks like a compact symbolic representation. That may be best for tiny models, but it should be tested.

Structured-representation work suggests that code-like formats may be less model-friendly than natural-language descriptions in some settings.

Useful resource:

SR-LLM — Rethinking the Structured Representation in Large Language Models
SR-LLM arXiv

I would test:

BRYLA-symbolic
BRYLA-verbalized
BRYLA-hybrid
BRYLA-no-defaults

Example:

Symbolic:
[TYPE:fact] [POL:neutral] [SCOPE:general] [INTENT:inform] [CORE:yes]

Verbalized:
This is a neutral factual statement with general scope. The intent is to inform. The main content is central.

Hybrid:
[type: factual statement] [polarity: neutral] [scope: general] [intent: inform] [core: yes]

Also test field order, because sequence order can matter a lot for structured inputs.

Useful resource:

Linearization Order Matters for AMR-to-Text Generation Input

9. Polish datasets and resources

For natural Polish QA / MRC testing, I would look at these.

Resource	Use
PolQA	Polish OpenQA; useful for question/answer type analysis and evidence passages
PolQA dataset	practical HF dataset
PoQuAD	Polish SQuAD-like QA, includes impossible questions and generative answer layer
PoQuAD paper	dataset background
PolEval 2024 QA task	Polish reading-comprehension evaluation style
PolEval 2024 QA GitHub	task data/code
PUGG	Polish KBQA/MRC/IR construction pipeline
PUGG GitHub	implementation
PUGG dataset	HF dataset

I would not mix all of these into one training soup immediately.

Better:

small clean natural benchmark
+ controlled ablations
+ separate larger-data experiments later

10. Suggested next reporting table

A compact table like this would be very clear:

Setup	Data	Control type
RAW	natural	baseline
DOMAIN	natural	simple metadata
BRYLA	natural	real structure
SHUFFLED	natural	broken alignment
RANDOM	natural	format control

And by domain:

Domain	BRYLA > RAW?	BRYLA > DOMAIN?	BRYLA > SHUFFLED?	Parser OTHER%	Note
technical
geography
biography
science
daily life
sports

11. What would make the claim much stronger

The result would become much harder to dismiss if the next stage shows:

BRYLA > RAW
BRYLA > DOMAIN
BRYLA > SHUFFLED-BRYLA
BRYLA > RANDOM-BRYLA

on small natural Polish data, with:

clean target-only loss
parser coverage reported
field entropy reported
token cost reported
domain-level breakdown

That would support the claim:

Bryła adds useful structure beyond domain conditioning and prefix-format effects.

12. What would weaken the claim

These would not kill the project, but they would change the interpretation:

Observation	Interpretation
`DOMAIN ≈ BRYLA`	Bryła may mostly encode domain/topic
`SHUFFLED ≈ BRYLA`	field-value alignment may not matter
`RANDOM` helps	prefix format may act as regularization
gains vanish on natural data	synthetic setup may be too clean
gains only appear in full PPL	tag-prediction artifact
parser outputs mostly `[OTHER]`	structure is not reaching the model
Bryła works only in one domain	still useful, but domain-specific

Short version

This is good progress.

The next step is not “make it bigger.” The next step is:

small natural data
+ DOMAIN control
+ shuffled-structure control
+ random-prefix control
+ clean PPL
+ parser diagnostics

If Bryła still wins there, the result becomes much stronger.