External Publication
Visit Post

Custom semantic representation ("bryła") beats raw text in 24/27 configs — built solo on an RTX 2060, looking for feedback

Hugging Face Forums [Unofficial] June 5, 2026
Source

Probably a big step forward. It looks like there is a recommended next procedure:


This looks like a much stronger proof-of-concept than the previous stage.

The important part is not only “Bryła wins in 24/27 configurations.” The more important part is that you found a concrete failure mode — low structured-input diversity — and then rebuilt the test around a controlled grid. That makes the result more credible.

I would still be careful with the claim, but the direction is good.

My direct answer would be:

Do not scale the synthetic setup much further yet. The next high-value step is a small natural-data test with DOMAIN, shuffled-structure, and random-structure controls.

1. How I would read the current result

From what you describe, the result is now approximately:

Stage What it shows What it does not yet show
Earlier technical QA result Bryła can improve a tiny matched setup Maybe fragile / domain-specific
Field ablation compact fields are better than default-heavy FULL not yet general
Clean PPL / masked loss tags must be treated as context, not target PPL alone is still incomplete
Current 24/27 grid Bryła can transmit useful structured signal in a controlled setup not yet proven on messy natural Polish data

So the current claim I would make is:

In a controlled synthetic setting, Bryła appears to be a real conditioning signal rather than just noise. The next question is whether the same advantage survives natural data, real parser errors, and stronger controls.

That is already a good research position.

2. The next decisive experiment

I would run a very small natural-data test, not a larger synthetic one.

Use four or five conditions:

RAW
DOMAIN + RAW
BRYLA + RAW
SHUFFLED-BRYLA + RAW
RANDOM-BRYLA + RAW

The most important comparisons:

Comparison Meaning
BRYLA > RAW Bryła still helps outside the synthetic setup
BRYLA > DOMAIN Bryła adds more than a simple domain label
BRYLA > SHUFFLED-BRYLA field-value alignment matters
BRYLA > RANDOM-BRYLA result is not just prefix-format regularization
DOMAIN ≈ BRYLA current Bryła may mostly encode domain/topic
SHUFFLED-BRYLA ≈ BRYLA structure labels may not be semantically used
RANDOM-BRYLA helps possible regularization / format artifact

The core next question is:

Does Bryła beat DOMAIN-only and shuffled-structure controls on natural data?

If yes, the claim becomes much stronger.

3. Natural-data mini-benchmark

I would start small:

6 domains × 50 examples = 300 examples

Suggested domains:

Domain Why useful
technical / welding / materials original strongest area
geography / places tests templatic factual data
biographies tests people, dates, roles, events
science explanations tests definitions and causal relations
daily-life / practical QA tests intent, urgency, user-facing pragmatics
sports / events tests event structure and temporal facts

Report results by domain, not only aggregate.

Example table:

Domain RAW DOMAIN BRYLA SHUFFLED Best Note
technical
geography
biography
science
daily life
sports

This matters because Bryła may help one domain and hurt another. That would still be useful information.

4. Replace synthetic parser noise with real parser error

The 0/10/20% parser-noise grid is useful, but the next step should be real parser failure.

A good progression would be:

synthetic parser noise
→ real parser errors
→ real domain shift
→ real QA / generation metric

Synthetic noise tells you the model is robust to artificial corruption. Real parser errors tell you whether the full system works.

I would report:

% parsed
% partial
% OTHER
field default rate
field entropy
field/domain correlation

Example parser dashboard:

Domain Parsed % Partial % OTHER % Main failure mode
technical
geography
biography
science
daily life
sports

The parser is now part of the research object, not just preprocessing.

5. Keep structure-diversity metrics permanently

The best methodological insight in the new result may be the structured-diversity issue.

I would report these in every experiment:

unique raw texts
unique Bryła strings
Bryła/raw diversity ratio
field entropy
default-field ratio
parser OTHER%
average input tokens

Example:

Metric RAW BRYLA
unique text strings
unique structured strings
average source tokens
field entropy
default-field ratio
parser OTHER%

This helps distinguish:

Bryła is useful

from:

Bryła collapsed many examples into the same structure

or:

Bryła mostly encoded domain/template identity

6. Keep clean PPL / masked loss

For prefix-style experiments, full-sequence PPL can be misleading because the model may get rewarded for predicting easy deterministic tags.

So I would keep:

val_ppl_clean = only natural Polish target text
val_ppl_tags  = only Bryła tags
val_ppl_std   = full sequence, diagnostic only

Primary metric:

val_ppl_clean

The Hugging Face docs on fixed-length-model perplexity are useful here because they emphasize that PPL depends on the exact likelihood/evaluation setup:

  • Hugging Face docs — Perplexity of fixed-length models

For decoder-only prefix conditioning, I would use masked loss:

input:
  [BRYLA PREFIX] [SEP_BRYLA] [POLISH TEXT]

labels:
  [-100 ... -100] [-100]      [POLISH TEXT LABELS]

That matches the conceptual setup:

Bryła = context
Polish text = target

7. Try cooldown

The most interesting next experiment after the control ladder is cooldown.

This is close to the idea in MeCo: train with metadata, then cool down on raw text so the model can function without metadata at inference time.

Resource:

  • MeCo — Metadata Conditioning Accelerates Language Model Pre-training
  • MeCo code
  • MeCo OpenReview

For Bryła:

Phase 1:
  train on BRYLA + text

Phase 2:
  short cooldown on RAW-only text

Eval:
  RAW-only

Controls:

RAW baseline
DOMAIN + text -> RAW cooldown
BRYLA + text -> RAW cooldown
RANDOM-BRYLA + text -> RAW cooldown

Interpretation:

Result Meaning
Bryła cooldown > RAW Bryła may work as a training scaffold
Bryła cooldown ≈ RAW no retained scaffold effect
DOMAIN cooldown ≈ Bryła cooldown domain metadata may explain much of the gain
random-prefix cooldown helps possible curriculum/regularization effect
Bryła requires Bryła at inference useful, but deployment depends on parser

If cooldown works, the story becomes stronger:

Bryła is not only an inference-time representation.
It may be a training scaffold for small models.

8. Test serialization format

Current Bryła looks like a compact symbolic representation. That may be best for tiny models, but it should be tested.

Structured-representation work suggests that code-like formats may be less model-friendly than natural-language descriptions in some settings.

Useful resource:

  • SR-LLM — Rethinking the Structured Representation in Large Language Models
  • SR-LLM arXiv

I would test:

BRYLA-symbolic
BRYLA-verbalized
BRYLA-hybrid
BRYLA-no-defaults

Example:

Symbolic:
[TYPE:fact] [POL:neutral] [SCOPE:general] [INTENT:inform] [CORE:yes]

Verbalized:
This is a neutral factual statement with general scope. The intent is to inform. The main content is central.

Hybrid:
[type: factual statement] [polarity: neutral] [scope: general] [intent: inform] [core: yes]

Also test field order, because sequence order can matter a lot for structured inputs.

Useful resource:

  • Linearization Order Matters for AMR-to-Text Generation Input

9. Polish datasets and resources

For natural Polish QA / MRC testing, I would look at these.

Resource Use
PolQA Polish OpenQA; useful for question/answer type analysis and evidence passages
PolQA dataset practical HF dataset
PoQuAD Polish SQuAD-like QA, includes impossible questions and generative answer layer
PoQuAD paper dataset background
PolEval 2024 QA task Polish reading-comprehension evaluation style
PolEval 2024 QA GitHub task data/code
PUGG Polish KBQA/MRC/IR construction pipeline
PUGG GitHub implementation
PUGG dataset HF dataset

I would not mix all of these into one training soup immediately.

Better:

small clean natural benchmark
+ controlled ablations
+ separate larger-data experiments later

10. Suggested next reporting table

A compact table like this would be very clear:

Setup Data Control type Clean PPL Task metric Tokens Wins/seeds Comment
RAW natural baseline
DOMAIN natural simple metadata
BRYLA natural real structure
SHUFFLED natural broken alignment
RANDOM natural format control

And by domain:

Domain BRYLA > RAW? BRYLA > DOMAIN? BRYLA > SHUFFLED? Parser OTHER% Note
technical
geography
biography
science
daily life
sports

11. What would make the claim much stronger

The result would become much harder to dismiss if the next stage shows:

BRYLA > RAW
BRYLA > DOMAIN
BRYLA > SHUFFLED-BRYLA
BRYLA > RANDOM-BRYLA

on small natural Polish data, with:

clean target-only loss
parser coverage reported
field entropy reported
token cost reported
domain-level breakdown

That would support the claim:

Bryła adds useful structure beyond domain conditioning and prefix-format effects.

12. What would weaken the claim

These would not kill the project, but they would change the interpretation:

Observation Interpretation
DOMAIN ≈ BRYLA Bryła may mostly encode domain/topic
SHUFFLED ≈ BRYLA field-value alignment may not matter
RANDOM helps prefix format may act as regularization
gains vanish on natural data synthetic setup may be too clean
gains only appear in full PPL tag-prediction artifact
parser outputs mostly [OTHER] structure is not reaching the model
Bryła works only in one domain still useful, but domain-specific

Short version

This is good progress.

The next step is not “make it bigger.” The next step is:

small natural data
+ DOMAIN control
+ shuffled-structure control
+ random-prefix control
+ clean PPL
+ parser diagnostics

If Bryła still wins there, the result becomes much stronger.

Discussion in the ATmosphere

Loading comments...