Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiel7cy4m4b7zwnwf6dll2vwvn35wuxc252zytuab2e7ledtpn2hr4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnj7xobj7xv2"
  },
  "path": "/t/custom-semantic-representation-bryla-beats-raw-text-in-24-27-configs-built-solo-on-an-rtx-2060-looking-for-feedback/176542#post_2",
  "publishedAt": "2026-06-05T02:21:44.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face docs — Perplexity of fixed-length models",
    "MeCo — Metadata Conditioning Accelerates Language Model Pre-training",
    "MeCo code",
    "MeCo OpenReview",
    "SR-LLM — Rethinking the Structured Representation in Large Language Models",
    "SR-LLM arXiv",
    "Linearization Order Matters for AMR-to-Text Generation Input",
    "PolQA",
    "PolQA dataset",
    "PoQuAD",
    "PoQuAD paper",
    "PolEval 2024 QA task",
    "PolEval 2024 QA GitHub",
    "PUGG",
    "PUGG GitHub",
    "PUGG dataset"
  ],
  "textContent": "Probably a big step forward. It looks like there is a recommended next procedure:\n\n* * *\n\nThis looks like a much stronger proof-of-concept than the previous stage.\n\nThe important part is not only “Bryła wins in 24/27 configurations.” The more important part is that you found a concrete failure mode — low structured-input diversity — and then rebuilt the test around a controlled grid. That makes the result more credible.\n\nI would still be careful with the claim, but the direction is good.\n\nMy direct answer would be:\n\n> Do not scale the synthetic setup much further yet.\n>  The next high-value step is a small natural-data test with `DOMAIN`, shuffled-structure, and random-structure controls.\n\n## 1. How I would read the current result\n\nFrom what you describe, the result is now approximately:\n\nStage | What it shows | What it does not yet show\n---|---|---\nEarlier technical QA result | Bryła can improve a tiny matched setup | Maybe fragile / domain-specific\nField ablation | compact fields are better than default-heavy FULL | not yet general\nClean PPL / masked loss | tags must be treated as context, not target | PPL alone is still incomplete\nCurrent 24/27 grid | Bryła can transmit useful structured signal in a controlled setup | not yet proven on messy natural Polish data\n\nSo the current claim I would make is:\n\n> In a controlled synthetic setting, Bryła appears to be a real conditioning signal rather than just noise. The next question is whether the same advantage survives natural data, real parser errors, and stronger controls.\n\nThat is already a good research position.\n\n## 2. The next decisive experiment\n\nI would run a very small natural-data test, not a larger synthetic one.\n\nUse four or five conditions:\n\n\n    RAW\n    DOMAIN + RAW\n    BRYLA + RAW\n    SHUFFLED-BRYLA + RAW\n    RANDOM-BRYLA + RAW\n\n\nThe most important comparisons:\n\nComparison | Meaning\n---|---\n`BRYLA > RAW` | Bryła still helps outside the synthetic setup\n`BRYLA > DOMAIN` | Bryła adds more than a simple domain label\n`BRYLA > SHUFFLED-BRYLA` | field-value alignment matters\n`BRYLA > RANDOM-BRYLA` | result is not just prefix-format regularization\n`DOMAIN ≈ BRYLA` | current Bryła may mostly encode domain/topic\n`SHUFFLED-BRYLA ≈ BRYLA` | structure labels may not be semantically used\n`RANDOM-BRYLA` helps | possible regularization / format artifact\n\nThe core next question is:\n\n\n    Does Bryła beat DOMAIN-only and shuffled-structure controls on natural data?\n\n\nIf yes, the claim becomes much stronger.\n\n## 3. Natural-data mini-benchmark\n\nI would start small:\n\n\n    6 domains × 50 examples = 300 examples\n\n\nSuggested domains:\n\nDomain | Why useful\n---|---\ntechnical / welding / materials | original strongest area\ngeography / places | tests templatic factual data\nbiographies | tests people, dates, roles, events\nscience explanations | tests definitions and causal relations\ndaily-life / practical QA | tests intent, urgency, user-facing pragmatics\nsports / events | tests event structure and temporal facts\n\nReport results by domain, not only aggregate.\n\nExample table:\n\nDomain | RAW | DOMAIN | BRYLA | SHUFFLED | Best | Note\n---|---|---|---|---|---|---\ntechnical |  |  |  |  |  |\ngeography |  |  |  |  |  |\nbiography |  |  |  |  |  |\nscience |  |  |  |  |  |\ndaily life |  |  |  |  |  |\nsports |  |  |  |  |  |\n\nThis matters because Bryła may help one domain and hurt another. That would still be useful information.\n\n## 4. Replace synthetic parser noise with real parser error\n\nThe 0/10/20% parser-noise grid is useful, but the next step should be real parser failure.\n\nA good progression would be:\n\n\n    synthetic parser noise\n    → real parser errors\n    → real domain shift\n    → real QA / generation metric\n\n\nSynthetic noise tells you the model is robust to artificial corruption. Real parser errors tell you whether the full system works.\n\nI would report:\n\n\n    % parsed\n    % partial\n    % OTHER\n    field default rate\n    field entropy\n    field/domain correlation\n\n\nExample parser dashboard:\n\nDomain | Parsed % | Partial % | OTHER % | Main failure mode\n---|---|---|---|---\ntechnical |  |  |  |\ngeography |  |  |  |\nbiography |  |  |  |\nscience |  |  |  |\ndaily life |  |  |  |\nsports |  |  |  |\n\nThe parser is now part of the research object, not just preprocessing.\n\n## 5. Keep structure-diversity metrics permanently\n\nThe best methodological insight in the new result may be the structured-diversity issue.\n\nI would report these in every experiment:\n\n\n    unique raw texts\n    unique Bryła strings\n    Bryła/raw diversity ratio\n    field entropy\n    default-field ratio\n    parser OTHER%\n    average input tokens\n\n\nExample:\n\nMetric | RAW | BRYLA\n---|---|---\nunique text strings |  |\nunique structured strings | — |\naverage source tokens |  |\nfield entropy | — |\ndefault-field ratio | — |\nparser OTHER% | — |\n\nThis helps distinguish:\n\n\n    Bryła is useful\n\n\nfrom:\n\n\n    Bryła collapsed many examples into the same structure\n\n\nor:\n\n\n    Bryła mostly encoded domain/template identity\n\n\n## 6. Keep clean PPL / masked loss\n\nFor prefix-style experiments, full-sequence PPL can be misleading because the model may get rewarded for predicting easy deterministic tags.\n\nSo I would keep:\n\n\n    val_ppl_clean = only natural Polish target text\n    val_ppl_tags  = only Bryła tags\n    val_ppl_std   = full sequence, diagnostic only\n\n\nPrimary metric:\n\n\n    val_ppl_clean\n\n\nThe Hugging Face docs on fixed-length-model perplexity are useful here because they emphasize that PPL depends on the exact likelihood/evaluation setup:\n\n  * Hugging Face docs — Perplexity of fixed-length models\n\n\n\nFor decoder-only prefix conditioning, I would use masked loss:\n\n\n    input:\n      [BRYLA PREFIX] [SEP_BRYLA] [POLISH TEXT]\n\n    labels:\n      [-100 ... -100] [-100]      [POLISH TEXT LABELS]\n\n\nThat matches the conceptual setup:\n\n\n    Bryła = context\n    Polish text = target\n\n\n## 7. Try cooldown\n\nThe most interesting next experiment after the control ladder is cooldown.\n\nThis is close to the idea in MeCo: train with metadata, then cool down on raw text so the model can function without metadata at inference time.\n\nResource:\n\n  * MeCo — Metadata Conditioning Accelerates Language Model Pre-training\n  * MeCo code\n  * MeCo OpenReview\n\n\n\nFor Bryła:\n\n\n    Phase 1:\n      train on BRYLA + text\n\n    Phase 2:\n      short cooldown on RAW-only text\n\n    Eval:\n      RAW-only\n\n\nControls:\n\n\n    RAW baseline\n    DOMAIN + text -> RAW cooldown\n    BRYLA + text -> RAW cooldown\n    RANDOM-BRYLA + text -> RAW cooldown\n\n\nInterpretation:\n\nResult | Meaning\n---|---\nBryła cooldown > RAW | Bryła may work as a training scaffold\nBryła cooldown ≈ RAW | no retained scaffold effect\nDOMAIN cooldown ≈ Bryła cooldown | domain metadata may explain much of the gain\nrandom-prefix cooldown helps | possible curriculum/regularization effect\nBryła requires Bryła at inference | useful, but deployment depends on parser\n\nIf cooldown works, the story becomes stronger:\n\n\n    Bryła is not only an inference-time representation.\n    It may be a training scaffold for small models.\n\n\n## 8. Test serialization format\n\nCurrent Bryła looks like a compact symbolic representation. That may be best for tiny models, but it should be tested.\n\nStructured-representation work suggests that code-like formats may be less model-friendly than natural-language descriptions in some settings.\n\nUseful resource:\n\n  * SR-LLM — Rethinking the Structured Representation in Large Language Models\n  * SR-LLM arXiv\n\n\n\nI would test:\n\n\n    BRYLA-symbolic\n    BRYLA-verbalized\n    BRYLA-hybrid\n    BRYLA-no-defaults\n\n\nExample:\n\n\n    Symbolic:\n    [TYPE:fact] [POL:neutral] [SCOPE:general] [INTENT:inform] [CORE:yes]\n\n    Verbalized:\n    This is a neutral factual statement with general scope. The intent is to inform. The main content is central.\n\n    Hybrid:\n    [type: factual statement] [polarity: neutral] [scope: general] [intent: inform] [core: yes]\n\n\nAlso test field order, because sequence order can matter a lot for structured inputs.\n\nUseful resource:\n\n  * Linearization Order Matters for AMR-to-Text Generation Input\n\n\n\n## 9. Polish datasets and resources\n\nFor natural Polish QA / MRC testing, I would look at these.\n\nResource | Use\n---|---\nPolQA | Polish OpenQA; useful for question/answer type analysis and evidence passages\nPolQA dataset | practical HF dataset\nPoQuAD | Polish SQuAD-like QA, includes impossible questions and generative answer layer\nPoQuAD paper | dataset background\nPolEval 2024 QA task | Polish reading-comprehension evaluation style\nPolEval 2024 QA GitHub | task data/code\nPUGG | Polish KBQA/MRC/IR construction pipeline\nPUGG GitHub | implementation\nPUGG dataset | HF dataset\n\nI would not mix all of these into one training soup immediately.\n\nBetter:\n\n\n    small clean natural benchmark\n    + controlled ablations\n    + separate larger-data experiments later\n\n\n## 10. Suggested next reporting table\n\nA compact table like this would be very clear:\n\nSetup | Data | Control type | Clean PPL | Task metric | Tokens | Wins/seeds | Comment\n---|---|---|---|---|---|---|---\nRAW | natural | baseline |  |  |  |  |\nDOMAIN | natural | simple metadata |  |  |  |  |\nBRYLA | natural | real structure |  |  |  |  |\nSHUFFLED | natural | broken alignment |  |  |  |  |\nRANDOM | natural | format control |  |  |  |  |\n\nAnd by domain:\n\nDomain | BRYLA > RAW? | BRYLA > DOMAIN? | BRYLA > SHUFFLED? | Parser OTHER% | Note\n---|---|---|---|---|---\ntechnical |  |  |  |  |\ngeography |  |  |  |  |\nbiography |  |  |  |  |\nscience |  |  |  |  |\ndaily life |  |  |  |  |\nsports |  |  |  |  |\n\n## 11. What would make the claim much stronger\n\nThe result would become much harder to dismiss if the next stage shows:\n\n\n    BRYLA > RAW\n    BRYLA > DOMAIN\n    BRYLA > SHUFFLED-BRYLA\n    BRYLA > RANDOM-BRYLA\n\n\non small natural Polish data, with:\n\n\n    clean target-only loss\n    parser coverage reported\n    field entropy reported\n    token cost reported\n    domain-level breakdown\n\n\nThat would support the claim:\n\n\n    Bryła adds useful structure beyond domain conditioning and prefix-format effects.\n\n\n## 12. What would weaken the claim\n\nThese would not kill the project, but they would change the interpretation:\n\nObservation | Interpretation\n---|---\n`DOMAIN ≈ BRYLA` | Bryła may mostly encode domain/topic\n`SHUFFLED ≈ BRYLA` | field-value alignment may not matter\n`RANDOM` helps | prefix format may act as regularization\ngains vanish on natural data | synthetic setup may be too clean\ngains only appear in full PPL | tag-prediction artifact\nparser outputs mostly `[OTHER]` | structure is not reaching the model\nBryła works only in one domain | still useful, but domain-specific\n\n## Short version\n\nThis is good progress.\n\nThe next step is not “make it bigger.”\nThe next step is:\n\n\n    small natural data\n    + DOMAIN control\n    + shuffled-structure control\n    + random-prefix control\n    + clean PPL\n    + parser diagnostics\n\n\nIf Bryła still wins there, the result becomes much stronger.",
  "title": "Custom semantic representation (\"bryła\") beats raw text in 24/27 configs — built solo on an RTX 2060, looking for feedback"
}