{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreifpdhk2amcpr6mxiagfr4ybl5xrg4s3n4t4thklpdn7cszg6nkgnu",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlxfx4ag4t32"
  },
  "path": "/t/continuation-bryla-semantic-representation-ablation-masked-loss-results/176048#post_2",
  "publishedAt": "2026-05-16T07:28:44.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Sennrich & Haddow 2016 — Linguistic Input Features Improve Neural Machine Translation",
    "arXiv version",
    "Hugging Face docs — Perplexity of fixed-length models",
    "Lee et al. 2022 — Deduplicating Training Data Makes Language Models Better",
    "Google Research summary",
    "Hugging Face / BigCode deduplication post",
    "CTRL paper — A Conditional Transformer Language Model for Controllable Generation",
    "CTRL GitHub repo",
    "AMR survey — Survey of Abstract Meaning Representation: Then, Now, Future",
    "DRS generation — Text Generation from Discourse Representation Structures",
    "Penman library paper — Good tools matter for graph notation",
    "Penman GitHub",
    "PolQA paper — ACL Anthology",
    "PolQA arXiv",
    "PolQA Hugging Face dataset",
    "PoQuAD GitHub",
    "PoQuAD article — ACM",
    "PolEval 2024 Task 1 page",
    "PolEval 2024 QA task GitHub",
    "PUGG paper — ACL Findings 2024",
    "PUGG arXiv",
    "PUGG GitHub",
    "PUGG Hugging Face dataset",
    "SpeakLeash GitHub organization",
    "SpeakLeash Hugging Face organization",
    "SpeakLeash package",
    "Bielik-PL-11B-v3.0-Instruct model card",
    "Hugging Face dataset cards",
    "Datasheets for Datasets",
    "Data Statements for NLP",
    "Model Cards for Model Reporting",
    "SR-LLM — ACL Anthology",
    "SR-LLM — arXiv",
    "Linearization Order Matters for AMR-to-Text Generation Input",
    "MeCo — Metadata Conditioning Accelerates Language Model Pre-training",
    "MeCo OpenReview"
  ],
  "textContent": "Hi! For now:\n\n* * *\n\n# Answer / suggestions for the current Bryła results\n\nThis update is useful because it moves the project from:\n\n\n    Maybe Bryła helps small models.\n\n\nto the much better research question:\n\n\n    Which parts of Bryła carry useful signal, which parts are noise, and how do we measure the effect without leakage or metric artifacts?\n\n\nThat is already a more serious position.\n\nThe most important result is not simply “MID improved PPL by about 3%.” The stronger result is:\n\n\n    compact useful structure > large default-heavy structure\n\n\nThat is exactly what I would expect if Bryła is acting as an inductive bias rather than as magic. A small set of informative fields can help; a large bundle of weak/default fields can add noise, token cost, and evaluation artifacts.\n\nMy short recommendation:\n\n> Do not expand `FULL` right now.\n>  Freeze `MID`, add stronger controls, use clean target-only loss, measure parser coverage, and build a small balanced diagnostic corpus before adding many new parser rules.\n\n* * *\n\n## 1. How I read the ablation result\n\nYou tested:\n\n\n    RAW\n\n    MIN:\n      type\n      polarity\n      sep\n\n    MID:\n      type\n      polarity\n      scope\n      intent\n      intensity\n      core\n      sep\n\n    FULL:\n      all 20 fields\n\n\nResult:\n\n\n    MID won 2/3 seeds.\n    MID was about +3% vs RAW.\n    FULL did not beat MID.\n    FULL lost by about 9 PPL vs MID.\n\n\nThe safest interpretation:\n\n> Bryła has a useful compact region. The 20-field version is currently too noisy, too default-heavy, or too expensive.\n\nThis is actually a good sign. If `FULL` had automatically won just because it had more tags, I would be more suspicious of leakage or metric artifacts. The fact that `MID > FULL` suggests that the useful signal is concentrated in a smaller subset.\n\nThis connects well to older work on adding explicit linguistic input features. Sennrich and Haddow showed that neural MT can benefit from additional input features such as morphology, POS tags, and dependency labels, improving perplexity and translation metrics. The relevant lesson for Bryła is not “add every possible annotation,” but “external structure can help when the features are informative and controlled.”\n\nReference:\n\n  * Sennrich & Haddow 2016 — Linguistic Input Features Improve Neural Machine Translation\n  * arXiv version\n\n\n\nSo I would phrase the result like this:\n\n> In a tiny Polish technical QA setting, a compact 7-field Bryła representation gives a small but repeatable gain over raw input, while the 20-field version adds substantial token cost and does not improve over the compact version.\n\nThat is cautious, but credible.\n\n* * *\n\n## 2. Freeze `MID` as `BRYLA-MID-v1`\n\nI would now freeze the current MID schema as:\n\n\n    BRYLA-MID-v1\n\n\nFields:\n\n\n    TYPE\n    POLARITY\n    SCOPE\n    INTENT\n    INTENSITY\n    CORE\n    SEP\n\n\nDo not keep changing this during the next phase. If the schema keeps moving, you cannot tell whether a gain came from Bryła, field order, tokenization, defaults, leakage removal, parser changes, or random seed effects.\n\nFor each field, document:\n\nField | What to define\n---|---\n`TYPE` | allowed values, default, missing value, expected effect\n`POLARITY` | allowed values, whether it means sentiment, stance, or something else\n`SCOPE` | local/general/contextual meaning\n`INTENT` | ask/inform/warn/instruct/etc.\n`INTENSITY` | low/mid/high or another fixed scale\n`CORE` | whether this marks central content/salience\n`SEP` | fixed boundary marker\n\nI would keep:\n\n\n    RAW = baseline\n    MIN = cheap lower bound\n    MID = main representation\n    FULL = diagnostic / stress test only\n\n\nFor now, `FULL` should not be your main story.\n\n* * *\n\n## 3. `val_ppl_clean` is essential, not optional\n\nYour split into:\n\n\n    val_ppl_std   = all tokens\n    val_ppl_clean = only Polish text after [SEP_BRYLA]\n    val_ppl_tags  = only Bryła tags\n\n\nis one of the most important improvements in the project.\n\nThe issue is simple:\n\n\n    Full-sequence PPL answers:\n    Can the model predict tags + text?\n\n    Clean PPL answers:\n    Does the prefix help predict the Polish target text?\n\n\nThose are different questions.\n\nYour example is very clear:\n\n\n    FULL Bryła on Wikipedia:\n\n    val_ppl_std   = 2.03\n    val_ppl_clean = 3.10\n\n\nThat means the standard metric was partially rewarding the model for predicting deterministic, low-entropy schema tokens.\n\nFor Bryła experiments, the primary metric should be:\n\n\n    val_ppl_clean\n\n\nnot:\n\n\n    val_ppl_std\n\n\nThis fits the general warning around perplexity: the value depends on exactly which tokens are included in the likelihood calculation and how evaluation context is handled.\n\nReference:\n\n  * Hugging Face docs — Perplexity of fixed-length models\n\n\n\nFor decoder-only Bryła-prefix training, masked loss should become the default:\n\n\n    input:\n      [BRYLA PREFIX] [SEP_BRYLA] [POLISH TEXT]\n\n    labels:\n      [-100 ... -100] [-100]      [POLISH TEXT LABELS]\n\n\nConceptually:\n\n\n    Bryła = context\n    Polish text = prediction target\n\n\nReport all three metrics, but treat them differently:\n\nMetric | Role\n---|---\n`val_ppl_clean` | primary result\n`val_ppl_std` | diagnostic only\n`val_ppl_tags` | diagnostic only\n`chrF` | useful Polish-friendly generation metric\ntoken F1 / EM | useful QA metric if answers are short\nsmall blind human eval | final sanity check\n\n* * *\n\n## 4. The leakage failures are valuable findings\n\nYou found three leakage paths:\n\n\n    1. surface_text duplicated inside Bryła and after [SEP_BRYLA]\n    2. [FACTS] block included previous Polish text\n    3. anchors contained 80-character surface_text snippets\n\n\nThis is not embarrassing. This is exactly what makes the experiment more credible now. Structured-prefix methods are very vulnerable to accidental copy shortcuts.\n\nAdd a permanent leakage-check script.\n\nMinimum checks:\n\nCheck | Why\n---|---\ntarget text appears in prefix | direct copying\nlong n-grams shared between prefix and target | partial copying\nanchors longer than a threshold | hidden text leakage\n`[FACTS]` contains answer-bearing text | retrieval-style leakage\nsame source document in train/dev | source leakage\nnear-duplicate pages across splits | template leakage\nsame generated/paraphrased seed across splits | synthetic leakage\n\nThis is especially important for Wikipedia-like data. If there are thousands of near-identical village descriptions, random row splitting is dangerous.\n\nUse:\n\n\n    split_group = source_article_id\n\n\nor:\n\n\n    split_group = template_cluster_id\n\n\nThen split by group, not by row.\n\nUseful references:\n\n  * Lee et al. 2022 — Deduplicating Training Data Makes Language Models Better\n  * arXiv version\n  * Google Research summary\n  * Hugging Face / BigCode deduplication post\n\n\n\n* * *\n\n## 5. Token economy basically rejects `FULL` for now\n\nYour token table is one of the clearest results:\n\nVariant | Tokens vs RAW | Training | Inference\n---|---|---|---\nRAW | 1.0x | 4 min | 5.7 ms/tok\nMIN | 1.81x | 8 min | 5.9 ms/tok\nMID | 2.82x | 12 min | 5.8 ms/tok\nFULL | 6.06x | 30 min | 5.8 ms/tok\n\n`FULL` costs about 6x the source tokens and gives no clear win over `MID`. For weak hardware, token count is budget. Every extra prefix token costs memory, training time, attention work, context length, and overfitting risk.\n\nThe design target should become:\n\n\n    maximum useful information per token\n\n\nnot:\n\n\n    maximum number of semantic/pragmatic fields\n\n\nAdd a reporting table like this:\n\nVariant | Clean PPL | Δ vs RAW | Tokens vs RAW | Practical verdict\n---|---|---|---|---\nRAW | `<value>` | — | 1.00x | baseline\nMIN | `<value>` | `<value>` | 1.81x | cheap but maybe under-informative\nMID | `<value>` | `<value>` | 2.82x | current best tradeoff\nFULL | `<value>` | `<value>` | 6.06x | too expensive / noisy\n\nThe current engineering conclusion:\n\n> `MID` is the best current quality/cost point. `FULL` should be paused.\n\n* * *\n\n## 6. The masked-loss result is not a failure\n\nYou got:\n\n\n    val_ppl_clean almost identical:\n    3.10 vs 3.18\n\n\nNumerically neutral.\n\nBut generation changed when you manually changed Bryła polarity:\n\n\n    [OTHER] [POL:neutral]  -> geographic / astronomical content\n    [OTHER] [POL:positive] -> villages / places\n    [OTHER] [POL:negative] -> sports / competition\n\n\nThis proves one thing:\n\n\n    The model is reading the prefix.\n\n\nIt does not yet prove:\n\n\n    The model understands polarity semantically.\n\n\nIt may mean:\n\n\n    POLARITY has become a hidden domain/topic label.\n\n\nThis is exactly the kind of thing known from control-code language modeling. CTRL trained a conditional Transformer on control codes that specify domain, subdomain, entities, relationships, dates, and task behavior. Such control codes steer generation, but the model learns corpus correlations, not human definitions of the labels.\n\nReferences:\n\n  * CTRL paper — A Conditional Transformer Language Model for Controllable Generation\n  * CTRL GitHub repo\n\n\n\nSo I would phrase your masked-loss result like this:\n\n> Masked-loss training confirms that Bryła tokens can act as conditioning signals. However, the observed polarity effect may be entangled with domain/topic correlations, so the next step is to compare Bryła against explicit `DOMAIN` controls and counterfactual prompts where topic is held constant.\n\nThat is precise and defensible.\n\n* * *\n\n## 7. Add `DOMAIN` as the mandatory next control\n\nThis is the most important next control.\n\nYour prefix may be helping because it encodes:\n\n\n    technical\n    geography\n    sports\n    biography\n    science\n    daily_life\n\n\nrather than because it encodes deeper semantic-pragmatic structure.\n\nRun this ladder:\n\n\n    RAW\n    DOMAIN + RAW\n    MID + RAW\n    DOMAIN + MID + RAW\n    MID shuffled values + RAW\n    MID shuffled field order + RAW\n    random tags same distribution + RAW\n\n\nInterpretation:\n\nResult | Meaning\n---|---\n`MID > DOMAIN > RAW` | strong: Bryła adds information beyond domain\n`MID ≈ DOMAIN > RAW` | Bryła is currently mostly domain conditioning\n`DOMAIN > MID` | current Bryła fields are noisy or parser is weak\n`DOMAIN + MID > both` | domain and Bryła are complementary\nshuffled values ≈ MID | field-value semantics are weak\nrandom tags help | formatting/regularization artifact\nshuffled order hurts badly | serialization order is part of the method\n\nThe decisive comparison is:\n\n\n    MID + RAW\n    vs\n    DOMAIN + RAW\n\n\nIf `MID` beats `DOMAIN`, Bryła has a stronger claim.\nIf `DOMAIN` matches `MID`, the current story becomes simpler: metadata/domain conditioning helps, but semantic-pragmatic structure is not yet proven.\n\nThis is still useful; it just changes the claim.\n\n* * *\n\n## 8. Do not blindly expand parser rules\n\nYou saw:\n\n\n    87% of Wikipedia sentences -> [OTHER]\n\n\nThat is a parser coverage problem.\n\nBut I would not immediately add many domain-specific rules. That risks building:\n\n\n    a parser for Wikipedia village templates\n\n\ninstead of:\n\n\n    a general Polish semantic-pragmatic parser\n\n\nSemantic-representation systems usually have this problem. AMR and DRS work both show that representation quality and parser quality are part of the system, not preprocessing details.\n\nUseful references:\n\n  * AMR survey — Survey of Abstract Meaning Representation: Then, Now, Future\n  * DRS generation — Text Generation from Discourse Representation Structures\n  * Penman library paper — Good tools matter for graph notation\n  * Penman GitHub\n\n\n\nThe next step should be:\n\n\n    measure parser behavior first\n\n\nnot:\n\n\n    add rules until [OTHER] decreases\n\n\n* * *\n\n## 9. Build a parser dashboard\n\nFor each domain:\n\nDomain | Parsed % | Partial % | OTHER % | Main failure type\n---|---|---|---|---\ntechnical/welding | `<value>` | `<value>` | `<value>` | `<note>`\ngeography | `<value>` | `<value>` | `<value>` | `<note>`\nbiography | `<value>` | `<value>` | `<value>` | `<note>`\ndaily life | `<value>` | `<value>` | `<value>` | `<note>`\nscience | `<value>` | `<value>` | `<value>` | `<note>`\nsports/events | `<value>` | `<value>` | `<value>` | `<note>`\n\nFor each field:\n\nField | Default rate | Entropy | Missing rate | Top values | Domain correlation\n---|---|---|---|---|---\n`TYPE` | `<value>` | `<value>` | `<value>` | `<values>` | `<value>`\n`POLARITY` | `<value>` | `<value>` | `<value>` | `<values>` | `<value>`\n`SCOPE` | `<value>` | `<value>` | `<value>` | `<values>` | `<value>`\n`INTENT` | `<value>` | `<value>` | `<value>` | `<values>` | `<value>`\n`INTENSITY` | `<value>` | `<value>` | `<value>` | `<values>` | `<value>`\n`CORE` | `<value>` | `<value>` | `<value>` | `<values>` | `<value>`\n\nThis tells you:\n\n  * which fields are dead/default-heavy;\n  * which fields actually vary;\n  * which fields are domain proxies;\n  * which domains the parser cannot handle;\n  * whether `[OTHER]` hides several different failure modes.\n\n\n\nA field with 95–99% default rate probably does not deserve tokens.\n\n* * *\n\n## 10. Replace one broad `[OTHER]` with typed unknowns\n\n`[OTHER]` is too destructive.\n\nInstead of:\n\n\n    [OTHER]\n\n\ntry:\n\n\n    [TYPE:unknown] [DOMAIN:geography]\n    [TYPE:unknown] [DOMAIN:biography]\n    [TYPE:unknown] [DOMAIN:sports]\n    [TYPE:unknown] [DOMAIN:technical]\n\n\nor:\n\n\n    [PARSE:partial] [DOMAIN:geography] [INTENT:inform]\n\n\nThis separates:\n\n\n    the parser does not know the semantic type\n\n\nfrom:\n\n\n    the system knows nothing at all\n\n\nA partial prefix can still carry useful information.\n\n* * *\n\n## 11. Create a tiny oracle-Bryła set\n\nTake 100–200 examples and manually assign correct `MID` fields.\n\nThen compare:\n\n\n    RAW\n    DOMAIN-only\n    parser-MID\n    manual/oracle-MID\n\n\nInterpretation:\n\nResult | Meaning\n---|---\noracle-MID helps, parser-MID does not | parser bottleneck\nparser-MID ≈ oracle-MID | parser is good enough\nneither helps | representation/model/task issue\nDOMAIN ≈ oracle-MID | Bryła mostly encodes domain\noracle-MID > DOMAIN | semantic-pragmatic fields add real signal\n\nThis is one of the cleanest possible experiments because it separates:\n\n\n    representation quality\n\n\nfrom:\n\n\n    parser quality\n\n\nEven 100 examples can be enough to identify the bottleneck.\n\n* * *\n\n## 12. Build the smaller balanced corpus first\n\nBetween:\n\n\n    1. Extend the parser with more domain-specific rules.\n    2. Build a smaller balanced multi-domain corpus.\n\n\nI would choose:\n\n> Build the smaller balanced corpus first, then extend the parser based on measured failures.\n\nStart with a diagnostic set:\n\n\n    6 domains × 50 examples = 300 examples\n\n\nSuggested domains:\n\nDomain | Why include it\n---|---\nwelding/materials/technical | original strongest domain\ngeography/places | tests template-heavy factual text\nbiographies | tests people, roles, dates, events\ndaily life/practical advice | tests intent, urgency, pragmatic cues\nscience explanations | tests definitions and causality\nsports/events | tests competitions, events, temporal facts\n\nThen scale later:\n\n\n    6 domains × 200 examples = 1,200 examples\n\n\nor:\n\n\n    6 domains × 500 examples = 3,000 examples\n\n\nDo not start with another huge uncontrolled corpus. If the parser fails on 300 balanced examples, it will also fail on 3,000.\n\n* * *\n\n## 13. Polish QA/data resources worth using\n\nUse Polish datasets by role, not as one mixed pool.\n\n### PolQA\n\nPolQA is one of the strongest Polish QA references. It contains 7,000 questions, 87,525 manually labeled evidence passages, and over 7 million candidate passages. It also classifies questions by formulation, question type, and answer entity type.\n\nLinks:\n\n  * PolQA paper — ACL Anthology\n  * PolQA arXiv\n  * PolQA Hugging Face dataset\n\n\n\nUse it for:\n\n  * question-type analysis;\n  * answer-type analysis;\n  * evidence-aware QA;\n  * retrieval + abstractive reader experiments;\n  * annotation-design inspiration.\n\n\n\nBe careful: OpenQA adds retrieval as another variable. For mechanism tests, use a controlled subset.\n\n### PoQuAD\n\nPoQuAD is a Polish QA dataset modeled on SQuAD 2.0. It includes impossible questions and a generative answer layer.\n\nLinks:\n\n  * PoQuAD GitHub\n  * PoQuAD article — ACM\n\n\n\nUse it for:\n\n  * passage-grounded QA;\n  * impossible/answerability cases;\n  * testing `SCOPE`, `SOURCE`, `CERTAINTY`, `CORE`, `INTENT`;\n  * generation metrics beyond PPL.\n\n\n\n### PolEval 2024 QA / Reading Comprehension\n\nPolEval 2024 Task 1 gives systems a question with a paired passage; some questions are impossible.\n\nLinks:\n\n  * PolEval 2024 Task 1 page\n  * PolEval 2024 QA task GitHub\n\n\n\nUse it for:\n\n  * Polish QA evaluation protocol;\n  * answerability scoring;\n  * passage-grounded experiments;\n  * moving beyond PPL.\n\n\n\n### PUGG\n\nPUGG is especially relevant because it is not only a dataset, but a semi-automated construction methodology for Polish KBQA, MRC, and IR.\n\nLinks:\n\n  * PUGG paper — ACL Findings 2024\n  * PUGG arXiv\n  * PUGG GitHub\n  * PUGG Hugging Face dataset\n\n\n\nUse it for:\n\n  * community-driven construction ideas;\n  * semi-automated Polish QA/MRC/IR design;\n  * baseline reporting style;\n  * low-resource dataset-building patterns.\n\n\n\n### SpeakLeash / Polish LLM ecosystem\n\nLinks:\n\n  * SpeakLeash GitHub organization\n  * SpeakLeash Hugging Face organization\n  * SpeakLeash package\n  * Bielik-PL-11B-v3.0-Instruct model card\n\n\n\nUse this ecosystem for:\n\n  * Polish data discovery;\n  * community contacts;\n  * documentation examples;\n  * possible weak teacher/evaluator models, with caution.\n\n\n\nDo not frame Bryła as competing with large Polish LLMs. Frame it as:\n\n\n    explicit structure for very small Polish models under weak-hardware / low-data constraints\n\n\n* * *\n\n## 14. Dataset strategy\n\nDo not mix all data into one pool.\n\nRole | Good sources | Purpose\n---|---|---\nclean controlled benchmark | your own balanced set, PoQuAD subset, PolEval subset | mechanism isolation\nevidence/OpenQA experiments | PolQA | retrieval + answer generation\nconstruction methodology | PUGG | semi-automated dataset building\nweak training / stress testing | larger Polish corpora | pretraining or parser stress\nfinal claim | small clean human-verified test | credible result\n\nAvoid:\n\n\n    PolQA + PoQuAD + Wikipedia + generated data -> one mixed pool -> one aggregate PPL\n\n\nPrefer:\n\n\n    small clean benchmark\n    + clear controls\n    + separate weak-data experiments\n\n\n* * *\n\n## 15. Community-driven small dataset construction\n\nA useful first dataset could be:\n\n\n    Bryła-MiniPL-QA v0.1\n\n\nStart with:\n\n\n    300 diagnostic examples\n\n\nThen:\n\n\n    1,200 benchmark examples = 6 domains × 200\n\n\nThen, only if the signal is real:\n\n\n    3,000 examples = 6 domains × 500\n\n\nSuggested schema:\n\n\n    id: geo_000123\n    domain: geography\n    source_type: manual | wikipedia | public_domain | synthetic_seeded\n    license: CC-BY-SA | CC0 | own | other\n    question: \"...\"\n    context: \"...\"\n    answer: \"...\"\n    answer_type: entity | date | number | yes_no | definition | procedure | explanation | list | unanswerable\n    is_answerable: true\n    bryla_mid: \"...\"\n    parser_status: parsed | partial | other | failed | oracle\n    parser_version: parser_v0.3\n    schema_version: bryla_mid_v1\n    split_group: source_article_or_template_id\n    split: train | dev | test\n    notes: \"optional\"\n\n\nMost important fields:\n\n\n    domain\n    answer_type\n    parser_status\n    split_group\n    schema_version\n    parser_version\n\n\nCommunity workflow:\n\n\n    1. Contributor writes question/context/answer.\n    2. Script runs parser and creates Bryła MID.\n    3. Reviewer checks answer correctness.\n    4. Bryła reviewer checks fields on a subset.\n    5. Maintainer runs leakage checks and split generation.\n\n\nKeep volunteer tasks small. Do not require every contributor to understand the whole parser.\n\nReview policy:\n\n\n    100% single review\n    10–20% double review\n    all disagreements saved\n\n\nDisagreements are useful because they reveal ambiguous schema definitions.\n\nDocumentation references:\n\n  * Hugging Face dataset cards\n  * Datasheets for Datasets\n  * Data Statements for NLP\n  * Model Cards for Model Reporting\n\n\n\n* * *\n\n## 16. Experiments I would run next\n\n### Experiment A: control ladder\n\nThis is the most important next experiment.\n\n\n    RAW\n    DOMAIN + RAW\n    MID + RAW\n    DOMAIN + MID + RAW\n    MID shuffled values + RAW\n    MID shuffled order + RAW\n    random tags same frequency + RAW\n\n\nUse:\n\n\n    masked loss\n    val_ppl_clean\n    chrF / F1 if possible\n    tokens vs RAW\n    same seeds\n    same split\n    same tokenizer\n\n\nMain question:\n\n\n    Does MID actually beat simple domain conditioning?\n\n\n* * *\n\n### Experiment B: field survival tournament\n\nStart from MID.\n\nLeave-one-out:\n\n\n    MID\n    MID - TYPE\n    MID - POLARITY\n    MID - SCOPE\n    MID - INTENT\n    MID - INTENSITY\n    MID - CORE\n\n\nSingle-field versions:\n\n\n    TYPE only\n    POLARITY only\n    SCOPE only\n    INTENT only\n    INTENSITY only\n    CORE only\n    DOMAIN only\n\n\nInterpretation:\n\nPattern | Meaning\n---|---\nfield helps alone and hurts when removed | strong useful field\nfield helps alone but not in MID | redundant\nfield only helps with another field | interaction\nfield does nothing | remove\nfield only helps without `DOMAIN` | likely domain proxy\n\nThis is more informative than only `MIN/MID/FULL`.\n\n* * *\n\n### Experiment C: serialization variants\n\nTest the same information in different formats.\n\n\n    MID-symbolic\n    MID-verbalized\n    MID-hybrid\n    MID-no-defaults\n    MID-shuffled-order\n\n\nExamples:\n\n\n    Symbolic:\n    [TYPE:fact] [POL:neutral] [SCOPE:general] [INTENT:inform] [INTENSITY:low] [CORE:yes]\n\n    Verbalized:\n    This is a neutral factual statement with general scope. The intent is to inform. The main content is central.\n\n    Hybrid:\n    [type: factual statement] [polarity: neutral] [scope: general] [intent: inform] [core: yes]\n\n\nWhy: structured representation format matters. SR-LLM argues that code-like structured representations can be less effective than natural-language descriptions, depending on model and setting.\n\nReferences:\n\n  * SR-LLM — ACL Anthology\n  * SR-LLM — arXiv\n  * Linearization Order Matters for AMR-to-Text Generation Input\n\n\n\n* * *\n\n### Experiment D: cooldown\n\nThis is one of the most interesting directions.\n\nMeCo trains with metadata prefixes, then uses a cooldown phase on standard text so the model can function without metadata at inference time.\n\nReferences:\n\n  * MeCo — Metadata Conditioning Accelerates Language Model Pre-training\n  * MeCo OpenReview\n\n\n\nFor Bryła, test:\n\n\n    RAW baseline\n\n    MID + text\n    eval: MID + text\n\n    MID + text for 80–90% of training\n    RAW text only for final 10–20%\n    eval: RAW text\n\n    DOMAIN + text for 80–90%\n    RAW text only for final 10–20%\n    eval: RAW text\n\n    random MID + text for 80–90%\n    RAW text only for final 10–20%\n    eval: RAW text\n\n\nMain question:\n\n\n    Is Bryła an inference-time dependency or a training scaffold?\n\n\nIf cooldown preserves some gain, that is a much stronger story.\n\n* * *\n\n### Experiment E: counterfactual prefix tests\n\nFormalize your mini-chat test.\n\nCreate 20–50 fixed content prompts. For each prompt, vary one field only:\n\n\n    same topic + different POLARITY\n    same topic + different INTENT\n    same topic + different INTENSITY\n    same topic + different CORE\n    same topic + different SCOPE\n\n\nExample topic:\n\n\n    gas cylinder leak during welding\n\n\nVariants:\n\n\n    [INTENT:inform]\n    [INTENT:warn]\n    [INTENT:instruct]\n\n\nManual scoring:\n\nCriterion | 0 | 1 | 2\n---|---|---|---\ntopic preserved | no | partly | yes\nintended control effect | no | partly | yes\nfactual consistency | no | partly | yes\nno domain drift | no | partly | yes\nanswer usefulness | no | partly | yes\n\nThis separates:\n\n\n    prefix changes output distribution\n\n\nfrom:\n\n\n    prefix controls the intended property\n\n\nThose are not the same thing.\n\n* * *\n\n## 17. What would convince me Bryła is doing something useful?\n\nA convincing pattern would be:\n\nTest | Desired result\n---|---\n`MID > RAW` | yes\n`MID > DOMAIN` | yes\n`MID > shuffled values` | yes\n`MID > random tags` | yes\nclean PPL improves | yes\nimprovement is not only full-sequence PPL | yes\nat least one task metric improves | yes\nparser coverage is reported | yes\nleakage checks pass | yes\ngroup splits are used | yes\nuseful fields are identified by ablation | yes\ncounterfactual tests preserve topic | yes\ncooldown preserves some gain | very strong bonus\n\nThe first four are especially important:\n\n\n    MID > RAW\n    MID > DOMAIN\n    MID > shuffled MID\n    MID > random tags\n\n\nThat would make the result much harder to dismiss.\n\n* * *\n\n## 18. What would make me skeptical?\n\nOutcome | Why it is a problem\n---|---\n`DOMAIN ≈ MID` | Bryła may mostly encode domain\nshuffled values ≈ real MID | field meanings may not matter\nrandom tags help | formatting/regularization artifact\nonly `val_ppl_std` improves | tag-prediction artifact\n`val_ppl_clean` does not improve | no target-text gain\none field changes topic instead of style | control is not semantic\nparser mostly outputs `[OTHER]` | model receives little structure\nseed std is extremely tiny on template data | near-duplicate/template issue\nrandom row split on Wikipedia | contamination risk\nFULL wins only when tags are included in loss | metric artifact\n\nThese are not reasons to stop. They are diagnostics.\n\n* * *\n\n## 19. Recommended 4-week plan\n\n### Week 1 — freeze and instrument\n\nDeliverables:\n\n\n    BRYLA-MID-v1 frozen\n    masked loss implemented\n    val_ppl_clean / val_ppl_std / val_ppl_tags reported\n    parser dashboard created\n    leakage checks scripted\n    DOMAIN prefix added\n\n\nDo not run many big trainings yet.\n\n### Week 2 — run the control ladder\n\nRun:\n\n\n    RAW\n    DOMAIN + RAW\n    MID + RAW\n    DOMAIN + MID + RAW\n    MID shuffled values + RAW\n    MID shuffled order + RAW\n    random tags same frequency + RAW\n\n\nMinimum:\n\n\n    3 seeds\n\n\nBetter:\n\n\n    5 seeds\n\n\nReport:\n\n\n    clean PPL\n    std PPL\n    tag PPL\n    tokens vs RAW\n    train time\n    inference time\n    win count\n\n\n### Week 3 — build 300-example diagnostic set\n\nCreate:\n\n\n    6 domains × 50 examples\n\n\nDomains:\n\n\n    technical\n    geography\n    biography\n    daily life\n    science\n    sports/events\n\n\nFor each example:\n\n\n    question\n    context\n    answer\n    domain\n    answer_type\n    parser_status\n    bryla_mid\n    split_group\n\n\nRun parser diagnostics first. Do not scale yet.\n\n### Week 4 — oracle Bryła + counterfactual probes\n\nCreate:\n\n\n    100–200 manually corrected MID examples\n\n\nCompare:\n\n\n    RAW\n    DOMAIN\n    parser-MID\n    oracle-MID\n\n\nAlso create:\n\n\n    20–50 counterfactual prefix probes\n\n\nThis will tell you whether the next bottleneck is parser quality or representation design.\n\n* * *\n\n## 20. Best public framing\n\nI would write the current state like this:\n\n> I found that the compact MID schema is a better tradeoff than the full 20-field schema: it gives a small but repeatable improvement in the technical QA setting, while FULL adds many mostly-default fields and a large token cost. I also found that full-sequence perplexity is misleading for prefix-tag experiments, so I now report target-only clean PPL after the separator. Masked-loss training shows that the model does read Bryła prefixes as conditioning information, but the observed polarity effect may be entangled with domain/topic correlations. The next step is to test MID against DOMAIN-only, shuffled-field, and random-tag controls under clean masked loss, and to build a small balanced multi-domain Polish QA set to measure parser coverage outside the original technical domain.\n\nAvoid saying:\n\n\n    Bryła proves semantic understanding.\n    Bryła replaces raw text.\n    Bryła scales generally.\n    FULL Bryła is better.\n    Polarity controls semantics.\n\n\nUse:\n\n\n    Bryła conditions generation.\n    MID is the current best tradeoff.\n    Clean PPL is required.\n    Parser coverage is the bottleneck.\n    Domain controls are necessary.\n    Cooldown is the next high-value test.\n\n\n* * *\n\n## 21. Direct answer to the two options\n\nBetween:\n\n\n    1. Extend the parser with more domain-specific rules.\n    2. Build a smaller balanced multi-domain corpus.\n\n\nI would choose:\n\n> Build the smaller balanced corpus first. Then extend the parser only where that corpus shows failures.\n\nReason:\n\n  * rule expansion without a balanced diagnostic set can overfit the parser to one corpus;\n  * the current parser failure is a coverage problem, but you need coverage by domain/type;\n  * a balanced dataset separates semantic usefulness from domain/template effects;\n  * a small clean dataset is more useful than a large noisy one at this stage.\n\n\n\nBest immediate target:\n\n\n    300 examples for diagnostics\n\n\nthen:\n\n\n    1,200 examples for real experiments\n\n\nnot another large uncontrolled Wikipedia run.\n\n* * *\n\n## Short summary\n\n  * The update is good because the failures make the result more credible.\n  * `MID > FULL` is important: compact informative fields beat default-heavy annotation.\n  * `val_ppl_clean` should be the primary metric from now on.\n  * Masked loss is the right objective for Bryła-as-context.\n  * The polarity generation result proves conditioning, but may also reveal domain leakage.\n  * Add `DOMAIN` as a mandatory control.\n  * Test `MID` against `DOMAIN`, shuffled MID, and random tags.\n  * Do not expand `FULL` now.\n  * Do not blindly add parser rules.\n  * Build a small balanced multi-domain diagnostic corpus first.\n  * Use PolQA, PoQuAD, PolEval, and PUGG as references/resources.\n  * Add parser dashboards, leakage checks, group splits, oracle-Bryła examples, and counterfactual prefix tests.\n  * The strongest next claim would be:\n\n\n\n\n    compact Bryła helps beyond domain conditioning under clean target-only loss.\n",
  "title": "[Continuation] bryła semantic representation: ablation + masked loss results"
}