[Continuation] bryła semantic representation: ablation + masked loss results
Hi! For now:
Answer / suggestions for the current Bryła results
This update is useful because it moves the project from:
Maybe Bryła helps small models.
to the much better research question:
Which parts of Bryła carry useful signal, which parts are noise, and how do we measure the effect without leakage or metric artifacts?
That is already a more serious position.
The most important result is not simply “MID improved PPL by about 3%.” The stronger result is:
compact useful structure > large default-heavy structure
That is exactly what I would expect if Bryła is acting as an inductive bias rather than as magic. A small set of informative fields can help; a large bundle of weak/default fields can add noise, token cost, and evaluation artifacts.
My short recommendation:
Do not expand
FULLright now. FreezeMID, add stronger controls, use clean target-only loss, measure parser coverage, and build a small balanced diagnostic corpus before adding many new parser rules.
1. How I read the ablation result
You tested:
RAW
MIN:
type
polarity
sep
MID:
type
polarity
scope
intent
intensity
core
sep
FULL:
all 20 fields
Result:
MID won 2/3 seeds.
MID was about +3% vs RAW.
FULL did not beat MID.
FULL lost by about 9 PPL vs MID.
The safest interpretation:
Bryła has a useful compact region. The 20-field version is currently too noisy, too default-heavy, or too expensive.
This is actually a good sign. If FULL had automatically won just because it had more tags, I would be more suspicious of leakage or metric artifacts. The fact that MID > FULL suggests that the useful signal is concentrated in a smaller subset.
This connects well to older work on adding explicit linguistic input features. Sennrich and Haddow showed that neural MT can benefit from additional input features such as morphology, POS tags, and dependency labels, improving perplexity and translation metrics. The relevant lesson for Bryła is not “add every possible annotation,” but “external structure can help when the features are informative and controlled.”
Reference:
- Sennrich & Haddow 2016 — Linguistic Input Features Improve Neural Machine Translation
- arXiv version
So I would phrase the result like this:
In a tiny Polish technical QA setting, a compact 7-field Bryła representation gives a small but repeatable gain over raw input, while the 20-field version adds substantial token cost and does not improve over the compact version.
That is cautious, but credible.
2. Freeze MID as BRYLA-MID-v1
I would now freeze the current MID schema as:
BRYLA-MID-v1
Fields:
TYPE
POLARITY
SCOPE
INTENT
INTENSITY
CORE
SEP
Do not keep changing this during the next phase. If the schema keeps moving, you cannot tell whether a gain came from Bryła, field order, tokenization, defaults, leakage removal, parser changes, or random seed effects.
For each field, document:
| Field | What to define |
|---|---|
TYPE |
allowed values, default, missing value, expected effect |
POLARITY |
allowed values, whether it means sentiment, stance, or something else |
SCOPE |
local/general/contextual meaning |
INTENT |
ask/inform/warn/instruct/etc. |
INTENSITY |
low/mid/high or another fixed scale |
CORE |
whether this marks central content/salience |
SEP |
fixed boundary marker |
I would keep:
RAW = baseline
MIN = cheap lower bound
MID = main representation
FULL = diagnostic / stress test only
For now, FULL should not be your main story.
3. val_ppl_clean is essential, not optional
Your split into:
val_ppl_std = all tokens
val_ppl_clean = only Polish text after [SEP_BRYLA]
val_ppl_tags = only Bryła tags
is one of the most important improvements in the project.
The issue is simple:
Full-sequence PPL answers:
Can the model predict tags + text?
Clean PPL answers:
Does the prefix help predict the Polish target text?
Those are different questions.
Your example is very clear:
FULL Bryła on Wikipedia:
val_ppl_std = 2.03
val_ppl_clean = 3.10
That means the standard metric was partially rewarding the model for predicting deterministic, low-entropy schema tokens.
For Bryła experiments, the primary metric should be:
val_ppl_clean
not:
val_ppl_std
This fits the general warning around perplexity: the value depends on exactly which tokens are included in the likelihood calculation and how evaluation context is handled.
Reference:
- Hugging Face docs — Perplexity of fixed-length models
For decoder-only Bryła-prefix training, masked loss should become the default:
input:
[BRYLA PREFIX] [SEP_BRYLA] [POLISH TEXT]
labels:
[-100 ... -100] [-100] [POLISH TEXT LABELS]
Conceptually:
Bryła = context
Polish text = prediction target
Report all three metrics, but treat them differently:
| Metric | Role |
|---|---|
val_ppl_clean |
primary result |
val_ppl_std |
diagnostic only |
val_ppl_tags |
diagnostic only |
chrF |
useful Polish-friendly generation metric |
| token F1 / EM | useful QA metric if answers are short |
| small blind human eval | final sanity check |
4. The leakage failures are valuable findings
You found three leakage paths:
1. surface_text duplicated inside Bryła and after [SEP_BRYLA]
2. [FACTS] block included previous Polish text
3. anchors contained 80-character surface_text snippets
This is not embarrassing. This is exactly what makes the experiment more credible now. Structured-prefix methods are very vulnerable to accidental copy shortcuts.
Add a permanent leakage-check script.
Minimum checks:
| Check | Why |
|---|---|
| target text appears in prefix | direct copying |
| long n-grams shared between prefix and target | partial copying |
| anchors longer than a threshold | hidden text leakage |
[FACTS] contains answer-bearing text |
retrieval-style leakage |
| same source document in train/dev | source leakage |
| near-duplicate pages across splits | template leakage |
| same generated/paraphrased seed across splits | synthetic leakage |
This is especially important for Wikipedia-like data. If there are thousands of near-identical village descriptions, random row splitting is dangerous.
Use:
split_group = source_article_id
or:
split_group = template_cluster_id
Then split by group, not by row.
Useful references:
- Lee et al. 2022 — Deduplicating Training Data Makes Language Models Better
- arXiv version
- Google Research summary
- Hugging Face / BigCode deduplication post
5. Token economy basically rejects FULL for now
Your token table is one of the clearest results:
| Variant | Tokens vs RAW | Training | Inference |
|---|---|---|---|
| RAW | 1.0x | 4 min | 5.7 ms/tok |
| MIN | 1.81x | 8 min | 5.9 ms/tok |
| MID | 2.82x | 12 min | 5.8 ms/tok |
| FULL | 6.06x | 30 min | 5.8 ms/tok |
FULL costs about 6x the source tokens and gives no clear win over MID. For weak hardware, token count is budget. Every extra prefix token costs memory, training time, attention work, context length, and overfitting risk.
The design target should become:
maximum useful information per token
not:
maximum number of semantic/pragmatic fields
Add a reporting table like this:
| Variant | Clean PPL | Δ vs RAW | Tokens vs RAW | Practical verdict |
|---|---|---|---|---|
| RAW | <value> |
— | 1.00x | baseline |
| MIN | <value> |
<value> |
1.81x | cheap but maybe under-informative |
| MID | <value> |
<value> |
2.82x | current best tradeoff |
| FULL | <value> |
<value> |
6.06x | too expensive / noisy |
The current engineering conclusion:
MIDis the best current quality/cost point.FULLshould be paused.
6. The masked-loss result is not a failure
You got:
val_ppl_clean almost identical:
3.10 vs 3.18
Numerically neutral.
But generation changed when you manually changed Bryła polarity:
[OTHER] [POL:neutral] -> geographic / astronomical content
[OTHER] [POL:positive] -> villages / places
[OTHER] [POL:negative] -> sports / competition
This proves one thing:
The model is reading the prefix.
It does not yet prove:
The model understands polarity semantically.
It may mean:
POLARITY has become a hidden domain/topic label.
This is exactly the kind of thing known from control-code language modeling. CTRL trained a conditional Transformer on control codes that specify domain, subdomain, entities, relationships, dates, and task behavior. Such control codes steer generation, but the model learns corpus correlations, not human definitions of the labels.
References:
- CTRL paper — A Conditional Transformer Language Model for Controllable Generation
- CTRL GitHub repo
So I would phrase your masked-loss result like this:
Masked-loss training confirms that Bryła tokens can act as conditioning signals. However, the observed polarity effect may be entangled with domain/topic correlations, so the next step is to compare Bryła against explicit
DOMAINcontrols and counterfactual prompts where topic is held constant.
That is precise and defensible.
7. Add DOMAIN as the mandatory next control
This is the most important next control.
Your prefix may be helping because it encodes:
technical
geography
sports
biography
science
daily_life
rather than because it encodes deeper semantic-pragmatic structure.
Run this ladder:
RAW
DOMAIN + RAW
MID + RAW
DOMAIN + MID + RAW
MID shuffled values + RAW
MID shuffled field order + RAW
random tags same distribution + RAW
Interpretation:
| Result | Meaning |
|---|---|
MID > DOMAIN > RAW |
strong: Bryła adds information beyond domain |
MID ≈ DOMAIN > RAW |
Bryła is currently mostly domain conditioning |
DOMAIN > MID |
current Bryła fields are noisy or parser is weak |
DOMAIN + MID > both |
domain and Bryła are complementary |
| shuffled values ≈ MID | field-value semantics are weak |
| random tags help | formatting/regularization artifact |
| shuffled order hurts badly | serialization order is part of the method |
The decisive comparison is:
MID + RAW
vs
DOMAIN + RAW
If MID beats DOMAIN, Bryła has a stronger claim.
If DOMAIN matches MID, the current story becomes simpler: metadata/domain conditioning helps, but semantic-pragmatic structure is not yet proven.
This is still useful; it just changes the claim.
8. Do not blindly expand parser rules
You saw:
87% of Wikipedia sentences -> [OTHER]
That is a parser coverage problem.
But I would not immediately add many domain-specific rules. That risks building:
a parser for Wikipedia village templates
instead of:
a general Polish semantic-pragmatic parser
Semantic-representation systems usually have this problem. AMR and DRS work both show that representation quality and parser quality are part of the system, not preprocessing details.
Useful references:
- AMR survey — Survey of Abstract Meaning Representation: Then, Now, Future
- DRS generation — Text Generation from Discourse Representation Structures
- Penman library paper — Good tools matter for graph notation
- Penman GitHub
The next step should be:
measure parser behavior first
not:
add rules until [OTHER] decreases
9. Build a parser dashboard
For each domain:
| Domain | Parsed % | Partial % | OTHER % | Main failure type |
|---|---|---|---|---|
| technical/welding | <value> |
<value> |
<value> |
<note> |
| geography | <value> |
<value> |
<value> |
<note> |
| biography | <value> |
<value> |
<value> |
<note> |
| daily life | <value> |
<value> |
<value> |
<note> |
| science | <value> |
<value> |
<value> |
<note> |
| sports/events | <value> |
<value> |
<value> |
<note> |
For each field:
| Field | Default rate | Entropy | Missing rate | Top values | Domain correlation |
|---|---|---|---|---|---|
TYPE |
<value> |
<value> |
<value> |
<values> |
<value> |
POLARITY |
<value> |
<value> |
<value> |
<values> |
<value> |
SCOPE |
<value> |
<value> |
<value> |
<values> |
<value> |
INTENT |
<value> |
<value> |
<value> |
<values> |
<value> |
INTENSITY |
<value> |
<value> |
<value> |
<values> |
<value> |
CORE |
<value> |
<value> |
<value> |
<values> |
<value> |
This tells you:
- which fields are dead/default-heavy;
- which fields actually vary;
- which fields are domain proxies;
- which domains the parser cannot handle;
- whether
[OTHER]hides several different failure modes.
A field with 95–99% default rate probably does not deserve tokens.
10. Replace one broad [OTHER] with typed unknowns
[OTHER] is too destructive.
Instead of:
[OTHER]
try:
[TYPE:unknown] [DOMAIN:geography]
[TYPE:unknown] [DOMAIN:biography]
[TYPE:unknown] [DOMAIN:sports]
[TYPE:unknown] [DOMAIN:technical]
or:
[PARSE:partial] [DOMAIN:geography] [INTENT:inform]
This separates:
the parser does not know the semantic type
from:
the system knows nothing at all
A partial prefix can still carry useful information.
11. Create a tiny oracle-Bryła set
Take 100–200 examples and manually assign correct MID fields.
Then compare:
RAW
DOMAIN-only
parser-MID
manual/oracle-MID
Interpretation:
| Result | Meaning |
|---|---|
| oracle-MID helps, parser-MID does not | parser bottleneck |
| parser-MID ≈ oracle-MID | parser is good enough |
| neither helps | representation/model/task issue |
| DOMAIN ≈ oracle-MID | Bryła mostly encodes domain |
| oracle-MID > DOMAIN | semantic-pragmatic fields add real signal |
This is one of the cleanest possible experiments because it separates:
representation quality
from:
parser quality
Even 100 examples can be enough to identify the bottleneck.
12. Build the smaller balanced corpus first
Between:
1. Extend the parser with more domain-specific rules.
2. Build a smaller balanced multi-domain corpus.
I would choose:
Build the smaller balanced corpus first, then extend the parser based on measured failures.
Start with a diagnostic set:
6 domains × 50 examples = 300 examples
Suggested domains:
| Domain | Why include it |
|---|---|
| welding/materials/technical | original strongest domain |
| geography/places | tests template-heavy factual text |
| biographies | tests people, roles, dates, events |
| daily life/practical advice | tests intent, urgency, pragmatic cues |
| science explanations | tests definitions and causality |
| sports/events | tests competitions, events, temporal facts |
Then scale later:
6 domains × 200 examples = 1,200 examples
or:
6 domains × 500 examples = 3,000 examples
Do not start with another huge uncontrolled corpus. If the parser fails on 300 balanced examples, it will also fail on 3,000.
13. Polish QA/data resources worth using
Use Polish datasets by role, not as one mixed pool.
PolQA
PolQA is one of the strongest Polish QA references. It contains 7,000 questions, 87,525 manually labeled evidence passages, and over 7 million candidate passages. It also classifies questions by formulation, question type, and answer entity type.
Links:
- PolQA paper — ACL Anthology
- PolQA arXiv
- PolQA Hugging Face dataset
Use it for:
- question-type analysis;
- answer-type analysis;
- evidence-aware QA;
- retrieval + abstractive reader experiments;
- annotation-design inspiration.
Be careful: OpenQA adds retrieval as another variable. For mechanism tests, use a controlled subset.
PoQuAD
PoQuAD is a Polish QA dataset modeled on SQuAD 2.0. It includes impossible questions and a generative answer layer.
Links:
- PoQuAD GitHub
- PoQuAD article — ACM
Use it for:
- passage-grounded QA;
- impossible/answerability cases;
- testing
SCOPE,SOURCE,CERTAINTY,CORE,INTENT; - generation metrics beyond PPL.
PolEval 2024 QA / Reading Comprehension
PolEval 2024 Task 1 gives systems a question with a paired passage; some questions are impossible.
Links:
- PolEval 2024 Task 1 page
- PolEval 2024 QA task GitHub
Use it for:
- Polish QA evaluation protocol;
- answerability scoring;
- passage-grounded experiments;
- moving beyond PPL.
PUGG
PUGG is especially relevant because it is not only a dataset, but a semi-automated construction methodology for Polish KBQA, MRC, and IR.
Links:
- PUGG paper — ACL Findings 2024
- PUGG arXiv
- PUGG GitHub
- PUGG Hugging Face dataset
Use it for:
- community-driven construction ideas;
- semi-automated Polish QA/MRC/IR design;
- baseline reporting style;
- low-resource dataset-building patterns.
SpeakLeash / Polish LLM ecosystem
Links:
- SpeakLeash GitHub organization
- SpeakLeash Hugging Face organization
- SpeakLeash package
- Bielik-PL-11B-v3.0-Instruct model card
Use this ecosystem for:
- Polish data discovery;
- community contacts;
- documentation examples;
- possible weak teacher/evaluator models, with caution.
Do not frame Bryła as competing with large Polish LLMs. Frame it as:
explicit structure for very small Polish models under weak-hardware / low-data constraints
14. Dataset strategy
Do not mix all data into one pool.
| Role | Good sources | Purpose |
|---|---|---|
| clean controlled benchmark | your own balanced set, PoQuAD subset, PolEval subset | mechanism isolation |
| evidence/OpenQA experiments | PolQA | retrieval + answer generation |
| construction methodology | PUGG | semi-automated dataset building |
| weak training / stress testing | larger Polish corpora | pretraining or parser stress |
| final claim | small clean human-verified test | credible result |
Avoid:
PolQA + PoQuAD + Wikipedia + generated data -> one mixed pool -> one aggregate PPL
Prefer:
small clean benchmark
+ clear controls
+ separate weak-data experiments
15. Community-driven small dataset construction
A useful first dataset could be:
Bryła-MiniPL-QA v0.1
Start with:
300 diagnostic examples
Then:
1,200 benchmark examples = 6 domains × 200
Then, only if the signal is real:
3,000 examples = 6 domains × 500
Suggested schema:
id: geo_000123
domain: geography
source_type: manual | wikipedia | public_domain | synthetic_seeded
license: CC-BY-SA | CC0 | own | other
question: "..."
context: "..."
answer: "..."
answer_type: entity | date | number | yes_no | definition | procedure | explanation | list | unanswerable
is_answerable: true
bryla_mid: "..."
parser_status: parsed | partial | other | failed | oracle
parser_version: parser_v0.3
schema_version: bryla_mid_v1
split_group: source_article_or_template_id
split: train | dev | test
notes: "optional"
Most important fields:
domain
answer_type
parser_status
split_group
schema_version
parser_version
Community workflow:
1. Contributor writes question/context/answer.
2. Script runs parser and creates Bryła MID.
3. Reviewer checks answer correctness.
4. Bryła reviewer checks fields on a subset.
5. Maintainer runs leakage checks and split generation.
Keep volunteer tasks small. Do not require every contributor to understand the whole parser.
Review policy:
100% single review
10–20% double review
all disagreements saved
Disagreements are useful because they reveal ambiguous schema definitions.
Documentation references:
- Hugging Face dataset cards
- Datasheets for Datasets
- Data Statements for NLP
- Model Cards for Model Reporting
16. Experiments I would run next
Experiment A: control ladder
This is the most important next experiment.
RAW
DOMAIN + RAW
MID + RAW
DOMAIN + MID + RAW
MID shuffled values + RAW
MID shuffled order + RAW
random tags same frequency + RAW
Use:
masked loss
val_ppl_clean
chrF / F1 if possible
tokens vs RAW
same seeds
same split
same tokenizer
Main question:
Does MID actually beat simple domain conditioning?
Experiment B: field survival tournament
Start from MID.
Leave-one-out:
MID
MID - TYPE
MID - POLARITY
MID - SCOPE
MID - INTENT
MID - INTENSITY
MID - CORE
Single-field versions:
TYPE only
POLARITY only
SCOPE only
INTENT only
INTENSITY only
CORE only
DOMAIN only
Interpretation:
| Pattern | Meaning |
|---|---|
| field helps alone and hurts when removed | strong useful field |
| field helps alone but not in MID | redundant |
| field only helps with another field | interaction |
| field does nothing | remove |
field only helps without DOMAIN |
likely domain proxy |
This is more informative than only MIN/MID/FULL.
Experiment C: serialization variants
Test the same information in different formats.
MID-symbolic
MID-verbalized
MID-hybrid
MID-no-defaults
MID-shuffled-order
Examples:
Symbolic:
[TYPE:fact] [POL:neutral] [SCOPE:general] [INTENT:inform] [INTENSITY:low] [CORE:yes]
Verbalized:
This is a neutral factual statement with general scope. The intent is to inform. The main content is central.
Hybrid:
[type: factual statement] [polarity: neutral] [scope: general] [intent: inform] [core: yes]
Why: structured representation format matters. SR-LLM argues that code-like structured representations can be less effective than natural-language descriptions, depending on model and setting.
References:
- SR-LLM — ACL Anthology
- SR-LLM — arXiv
- Linearization Order Matters for AMR-to-Text Generation Input
Experiment D: cooldown
This is one of the most interesting directions.
MeCo trains with metadata prefixes, then uses a cooldown phase on standard text so the model can function without metadata at inference time.
References:
- MeCo — Metadata Conditioning Accelerates Language Model Pre-training
- MeCo OpenReview
For Bryła, test:
RAW baseline
MID + text
eval: MID + text
MID + text for 80–90% of training
RAW text only for final 10–20%
eval: RAW text
DOMAIN + text for 80–90%
RAW text only for final 10–20%
eval: RAW text
random MID + text for 80–90%
RAW text only for final 10–20%
eval: RAW text
Main question:
Is Bryła an inference-time dependency or a training scaffold?
If cooldown preserves some gain, that is a much stronger story.
Experiment E: counterfactual prefix tests
Formalize your mini-chat test.
Create 20–50 fixed content prompts. For each prompt, vary one field only:
same topic + different POLARITY
same topic + different INTENT
same topic + different INTENSITY
same topic + different CORE
same topic + different SCOPE
Example topic:
gas cylinder leak during welding
Variants:
[INTENT:inform]
[INTENT:warn]
[INTENT:instruct]
Manual scoring:
| Criterion | 0 | 1 | 2 |
|---|---|---|---|
| topic preserved | no | partly | yes |
| intended control effect | no | partly | yes |
| factual consistency | no | partly | yes |
| no domain drift | no | partly | yes |
| answer usefulness | no | partly | yes |
This separates:
prefix changes output distribution
from:
prefix controls the intended property
Those are not the same thing.
17. What would convince me Bryła is doing something useful?
A convincing pattern would be:
| Test | Desired result |
|---|---|
MID > RAW |
yes |
MID > DOMAIN |
yes |
MID > shuffled values |
yes |
MID > random tags |
yes |
| clean PPL improves | yes |
| improvement is not only full-sequence PPL | yes |
| at least one task metric improves | yes |
| parser coverage is reported | yes |
| leakage checks pass | yes |
| group splits are used | yes |
| useful fields are identified by ablation | yes |
| counterfactual tests preserve topic | yes |
| cooldown preserves some gain | very strong bonus |
The first four are especially important:
MID > RAW
MID > DOMAIN
MID > shuffled MID
MID > random tags
That would make the result much harder to dismiss.
18. What would make me skeptical?
| Outcome | Why it is a problem |
|---|---|
DOMAIN ≈ MID |
Bryła may mostly encode domain |
| shuffled values ≈ real MID | field meanings may not matter |
| random tags help | formatting/regularization artifact |
only val_ppl_std improves |
tag-prediction artifact |
val_ppl_clean does not improve |
no target-text gain |
| one field changes topic instead of style | control is not semantic |
parser mostly outputs [OTHER] |
model receives little structure |
| seed std is extremely tiny on template data | near-duplicate/template issue |
| random row split on Wikipedia | contamination risk |
| FULL wins only when tags are included in loss | metric artifact |
These are not reasons to stop. They are diagnostics.
19. Recommended 4-week plan
Week 1 — freeze and instrument
Deliverables:
BRYLA-MID-v1 frozen
masked loss implemented
val_ppl_clean / val_ppl_std / val_ppl_tags reported
parser dashboard created
leakage checks scripted
DOMAIN prefix added
Do not run many big trainings yet.
Week 2 — run the control ladder
Run:
RAW
DOMAIN + RAW
MID + RAW
DOMAIN + MID + RAW
MID shuffled values + RAW
MID shuffled order + RAW
random tags same frequency + RAW
Minimum:
3 seeds
Better:
5 seeds
Report:
clean PPL
std PPL
tag PPL
tokens vs RAW
train time
inference time
win count
Week 3 — build 300-example diagnostic set
Create:
6 domains × 50 examples
Domains:
technical
geography
biography
daily life
science
sports/events
For each example:
question
context
answer
domain
answer_type
parser_status
bryla_mid
split_group
Run parser diagnostics first. Do not scale yet.
Week 4 — oracle Bryła + counterfactual probes
Create:
100–200 manually corrected MID examples
Compare:
RAW
DOMAIN
parser-MID
oracle-MID
Also create:
20–50 counterfactual prefix probes
This will tell you whether the next bottleneck is parser quality or representation design.
20. Best public framing
I would write the current state like this:
I found that the compact MID schema is a better tradeoff than the full 20-field schema: it gives a small but repeatable improvement in the technical QA setting, while FULL adds many mostly-default fields and a large token cost. I also found that full-sequence perplexity is misleading for prefix-tag experiments, so I now report target-only clean PPL after the separator. Masked-loss training shows that the model does read Bryła prefixes as conditioning information, but the observed polarity effect may be entangled with domain/topic correlations. The next step is to test MID against DOMAIN-only, shuffled-field, and random-tag controls under clean masked loss, and to build a small balanced multi-domain Polish QA set to measure parser coverage outside the original technical domain.
Avoid saying:
Bryła proves semantic understanding.
Bryła replaces raw text.
Bryła scales generally.
FULL Bryła is better.
Polarity controls semantics.
Use:
Bryła conditions generation.
MID is the current best tradeoff.
Clean PPL is required.
Parser coverage is the bottleneck.
Domain controls are necessary.
Cooldown is the next high-value test.
21. Direct answer to the two options
Between:
1. Extend the parser with more domain-specific rules.
2. Build a smaller balanced multi-domain corpus.
I would choose:
Build the smaller balanced corpus first. Then extend the parser only where that corpus shows failures.
Reason:
- rule expansion without a balanced diagnostic set can overfit the parser to one corpus;
- the current parser failure is a coverage problem, but you need coverage by domain/type;
- a balanced dataset separates semantic usefulness from domain/template effects;
- a small clean dataset is more useful than a large noisy one at this stage.
Best immediate target:
300 examples for diagnostics
then:
1,200 examples for real experiments
not another large uncontrolled Wikipedia run.
Short summary
The update is good because the failures make the result more credible.
MID > FULLis important: compact informative fields beat default-heavy annotation.val_ppl_cleanshould be the primary metric from now on.Masked loss is the right objective for Bryła-as-context.
The polarity generation result proves conditioning, but may also reveal domain leakage.
Add
DOMAINas a mandatory control.Test
MIDagainstDOMAIN, shuffled MID, and random tags.Do not expand
FULLnow.Do not blindly add parser rules.
Build a small balanced multi-domain diagnostic corpus first.
Use PolQA, PoQuAD, PolEval, and PUGG as references/resources.
Add parser dashboards, leakage checks, group splits, oracle-Bryła examples, and counterfactual prefix tests.
The strongest next claim would be:
compact Bryła helps beyond domain conditioning under clean target-only loss.
Discussion in the ATmosphere