{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibicdnfozppxxuk5rxrzt6frnphbl5l6igky3gs6stm54r4seu5ue",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlxfx75glbo2"
  },
  "path": "/t/legal-data-creation/175953#post_2",
  "publishedAt": "2026-05-16T07:09:02.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "LegalBench",
    "LegalBench-RAG",
    "VLQA",
    "ViLQA / ViBidLQA",
    "ARES",
    "RAGAS",
    "BEIR",
    "HotpotQA",
    "FEVER",
    "RAGBench"
  ],
  "textContent": "Hmm…\n\n* * *\n\n# Building a Vietnamese Legal QA Dataset for Evaluating RAG Systems\n\nI think your current plan is directionally correct, but the main weakness is **where the QA generation starts**.\n\nRight now, your pipeline is roughly:\n\n\n    one legal article\n    → strict prompt\n    → low-temperature LLM generation\n    → question-answer pair\n    → local validator\n\n\nThat is acceptable for **easy factual questions** , because the answer is supposed to be inside one legal text.\n\nBut it is structurally weak for **medium** and **hard** legal questions, because a legally complete answer often needs more than the seed article. It may need:\n\n  * a definition from another article;\n  * a special rule for a special class of people;\n  * an exception;\n  * an implementing decree or circular;\n  * a procedure article;\n  * an authority/competence article;\n  * an amendment or effective-date rule;\n  * a higher-level law that controls a lower-level document.\n\n\n\nSo the model may produce an answer that is:\n\n  * faithful to the given article;\n  * locally correct;\n  * fluent;\n  * stable under low temperature;\n  * accepted by a validator trained on that law;\n\n\n\nbut still **legally incomplete** because the full answer requires another law or article.\n\nFor RAG evaluation, that is the key issue. A RAG benchmark should not only test whether the generator can write a nice answer. It should test whether the system can:\n\n\n    retrieve all legally required provisions\n    → use them correctly\n    → cite them correctly\n    → produce a legally complete answer\n\n\nThis is why I would redesign your dataset around **evidence packets** , not isolated articles.\n\n* * *\n\n## 1. Core recommendation\n\nDo **not** generate medium/hard QA directly from a single article.\n\nInstead:\n\n\n    seed article\n    → retrieve related legal provisions\n    → classify legal relations\n    → build an evidence packet\n    → generate the question\n    → generate the gold answer\n    → validate citation coverage\n    → validate legal completeness\n\n\nIn short:\n\n> **Generate the evidence first, then generate the QA.**\n\nAn **evidence packet** is the complete set of legal provisions required to answer a question correctly.\n\nFor example:\n\n\n    {\n      \"seed_provision\": \"<main_article_or_clause>\",\n      \"required_provisions\": [\n        {\n          \"citation\": \"<supporting_law_article_or_clause>\",\n          \"role\": \"special_subject_rule\",\n          \"why_required\": \"The general rule changes for this type of subject.\"\n        },\n        {\n          \"citation\": \"<definition_article_or_clause>\",\n          \"role\": \"definition\",\n          \"why_required\": \"This provision defines a term used in the question.\"\n        }\n      ],\n      \"hard_negatives\": [\n        {\n          \"citation\": \"<similar_but_not_controlling_article>\",\n          \"reason\": \"Topically similar, but not legally required.\"\n        }\n      ]\n    }\n\n\nThis is the main fix.\n\n* * *\n\n## 2. Why this matters for RAG evaluation\n\nA normal QA dataset can often be built like this:\n\n\n    paragraph → question → answer\n\n\nLegal QA is different.\n\nLegal reasoning is often distributed across multiple legal units:\n\n\n    general rule\n    + definition\n    + exception\n    + special rule\n    + procedure\n    + authority\n    + effective date\n    = complete legal answer\n\n\nThis is why legal RAG benchmarks should evaluate **retrieval completeness** , not only final answer similarity.\n\nUseful references:\n\n  * LegalBench shows that legal reasoning is not one generic ability; it includes multiple forms of legal reasoning.\n  * LegalBench-RAG focuses specifically on the retrieval step in legal RAG and emphasizes precise legal text retrieval.\n  * VLQA is highly relevant because it is a Vietnamese legal QA dataset with expert verification and statutory references.\n  * ViLQA / ViBidLQA is also relevant because it uses LLM-generated Vietnamese legal QA corrected by domain experts.\n  * ARES is useful for thinking about lightweight RAG judges trained on synthetic data plus limited human labels.\n  * RAGAS is useful for RAG evaluation and synthetic testset generation ideas.\n  * BEIR is useful because it shows BM25 is still a strong retrieval baseline, while reranking/late-interaction methods are often stronger but more expensive.\n  * HotpotQA is useful as a multi-hop QA reference because it stores supporting facts, not only final answers.\n  * FEVER is useful because it includes evidence-based labels like Supported / Refuted / NotEnoughInfo.\n  * RAGBench is useful because it emphasizes explainable/actionable RAG evaluation labels.\n\n\n\n* * *\n\n## 3. Refine your taxonomy\n\nYour current taxonomy is good:\n\n### Difficulty\n\n  * **easy** : basic comprehension; answer is in the given legal text.\n  * **medium** : requires some legal knowledge/reasoning; may need multiple articles/laws.\n  * **hard** : requires deep understanding, exceptions, case analysis, unusual situations, or expert grading.\n\n\n\n### Question type\n\n  * **factual** : direct fact from text.\n  * **interpretation** : meaning/purpose of a law or legal phrase.\n  * **analytical** : comparison, relationship, connection, difference.\n  * **application** : scenario/case/facts applied to law.\n\n\n\nYour mapping is mostly right:\n\n\n    factual       → easy\n    interpretation → medium\n    analytical    → medium or hard\n    application   → hard\n\n\nBut I would add one more axis:\n\n> **Evidence scope**\n\nBecause difficulty and evidence scope are not the same.\n\nA question can be easy because it needs one clause. A question can be medium because it needs two related articles. A question can be hard because it needs multiple laws, an exception, and a fact pattern.\n\nUse this expanded schema:\n\n\n    {\n      \"difficulty\": \"easy | medium | hard\",\n      \"question_type\": \"factual | interpretation | analytical | application\",\n      \"evidence_scope\": \"single_clause | single_article | multi_article_same_law | multi_law | exception_based | temporal | insufficient_facts\",\n      \"legal_operation\": [\n        \"definition\",\n        \"rule_extraction\",\n        \"comparison\",\n        \"condition_check\",\n        \"exception\",\n        \"special_subject_rule\",\n        \"procedure\",\n        \"authority\",\n        \"sanction\",\n        \"hierarchy\",\n        \"amendment\",\n        \"case_application\"\n      ],\n      \"answerability\": \"answerable | partially_answerable | insufficient_facts | not_in_corpus | version_unclear\"\n    }\n\n\nThis makes the dataset much more useful for diagnosing RAG failures.\n\n* * *\n\n## 4. Difficulty design\n\n## Easy\n\n**Purpose:** test basic retrieval and comprehension.\n\nEasy questions should be:\n\n  * factual only;\n  * answerable from one clause/article;\n  * direct;\n  * objectively gradable;\n  * citation-simple.\n\n\n\nExample patterns:\n\n\n    Theo Điều <article>, cơ quan nào có thẩm quyền <action>?\n    Theo khoản <clause>, điều kiện để <do_something> gồm những gì?\n    Luật quy định thời hạn <process> là bao lâu?\n    Hành vi nào bị cấm theo Điều <article>?\n\n\nRecommended schema:\n\n\n    {\n      \"difficulty\": \"easy\",\n      \"question_type\": \"factual\",\n      \"evidence_scope\": \"single_clause\",\n      \"required_citations\": [\n        {\n          \"role\": \"answer_source\",\n          \"citation\": \"<law_article_clause>\"\n        }\n      ]\n    }\n\n\nReject an easy question if:\n\n  * it needs another law;\n  * it asks “why”;\n  * it requires interpretation;\n  * it requires exceptions;\n  * it depends on external legal knowledge.\n\n\n\n* * *\n\n## Medium\n\n**Purpose:** test whether the RAG system can connect legal provisions.\n\nMedium does **not** mean “longer answer.” Medium means the answer requires at least two legal points.\n\nTypical evidence scopes:\n\n\n    multi_article_same_law\n    multi_law\n    definition_based\n    procedure_based\n    special_subject_based\n\n\nMedium question types:\n\n\n    interpretation\n    analytical\n\n\nGood medium patterns:\n\n\n    Cụm từ <legal_phrase> trong quy định này nên được hiểu như thế nào?\n    Mục đích của quy định <rule> là gì?\n    Vì sao luật quy định riêng trường hợp <special_subject>?\n    Phân biệt <concept_A> và <concept_B> theo quy định pháp luật.\n    Điều kiện áp dụng <rule> gồm những yếu tố nào, và căn cứ pháp lý nằm ở đâu?\n\n\nRecommended schema:\n\n\n    {\n      \"difficulty\": \"medium\",\n      \"question_type\": \"interpretation\",\n      \"evidence_scope\": \"multi_article_same_law\",\n      \"legal_operation\": [\n        \"definition\",\n        \"purpose_interpretation\"\n      ],\n      \"required_citations\": [\n        {\n          \"role\": \"anchor_rule\",\n          \"citation\": \"<main_article>\"\n        },\n        {\n          \"role\": \"definition_or_purpose\",\n          \"citation\": \"<supporting_article>\"\n        }\n      ]\n    }\n\n\nA good medium item should pass this test:\n\n\n    Answer using only the seed article = incomplete.\n    Answer using the full evidence packet = complete.\n\n\nReject a medium item if:\n\n  * it is fully answerable from the seed article alone;\n  * the supporting article is only topically similar, not legally necessary;\n  * the answer relies on vague policy reasoning without citation;\n  * required citations are unclear.\n\n\n\n* * *\n\n## Hard\n\n**Purpose:** test legally complete reasoning.\n\nHard questions should contain a legal trap or require multiple legal operations.\n\nGood hard patterns include:\n\nTrap type | What it tests\n---|---\nGeneral rule vs exception | Does the system retrieve the exception?\nGeneral law vs special law | Does the system find the controlling special rule?\nDefinition trap | Does it use statutory meaning, not ordinary meaning?\nProcedure trap | Does it know the required process/authority?\nTemporal trap | Does it know effective date/amendment status?\nMissing-fact trap | Does it avoid over-answering?\nSimilar-term trap | Does it distinguish close legal concepts?\nCross-domain trap | Does it combine civil/criminal/administrative/labor/tax law?\n\nHard question types:\n\n\n    analytical\n    application\n\n\nHard application answers should follow:\n\n\n    Issue\n    → Applicable rules\n    → Application to facts\n    → Conclusion\n    → Caveats / missing facts\n\n\nRecommended schema:\n\n\n    {\n      \"difficulty\": \"hard\",\n      \"question_type\": \"application\",\n      \"evidence_scope\": \"exception_based\",\n      \"legal_operation\": [\n        \"case_application\",\n        \"exception\",\n        \"special_subject_rule\"\n      ],\n      \"answerability\": \"partially_answerable\",\n      \"required_citations\": [\n        {\n          \"role\": \"anchor_rule\",\n          \"citation\": \"<main_rule>\"\n        },\n        {\n          \"role\": \"exception\",\n          \"citation\": \"<exception_rule>\"\n        },\n        {\n          \"role\": \"procedure_or_authority\",\n          \"citation\": \"<procedure_rule>\"\n        }\n      ],\n      \"expert_review_required\": true,\n      \"mos_threshold\": 3\n    }\n\n\nYour 0–5 MOS idea is good. I would use:\n\nScore | Meaning\n---|---\n0 | Irrelevant, hallucinated, or dangerous\n1 | Mentions topic but misses controlling law\n2 | Partially correct but misses a major rule, exception, or required law\n3 | Minimally acceptable; correct general conclusion but incomplete nuance/citation\n4 | Correct, well-grounded, cites required provisions, handles main exceptions\n5 | Expert-level: complete rule synthesis, application, caveats, citations, and limits\n\nAcceptance rule:\n\n\n    accept hard item only if MOS >= 3\n\n\nAlso store disagreement:\n\n\n    {\n      \"reviewer_scores\": [2, 4, 3],\n      \"mos\": 3.0,\n      \"score_range\": 2,\n      \"needs_adjudication\": true\n    }\n\n\nIf lawyers disagree heavily, mark the item as ambiguous instead of pretending it is clean.\n\n* * *\n\n## 5. Evidence packet generation\n\nThis is the core pipeline.\n\nFor each seed provision:\n\n\n    1. Retrieve candidate related provisions.\n    2. Classify each candidate.\n    3. Keep only legally necessary provisions.\n    4. Build an evidence packet.\n    5. Generate QA from that packet.\n\n\nCandidate provisions should be classified as:\n\n\n    REQUIRED\n    HELPFUL_BACKGROUND\n    HARD_NEGATIVE\n    IRRELEVANT\n\n\nMeaning:\n\nLabel | Meaning\n---|---\nREQUIRED | Omitting this provision makes the answer incomplete or wrong\nHELPFUL_BACKGROUND | Useful but not necessary\nHARD_NEGATIVE | Similar enough to confuse retrieval, but not legally controlling\nIRRELEVANT | Not useful\n\nPrompt:\n\n\n    You are building a Vietnamese legal RAG evaluation dataset.\n\n    Seed legal provision:\n    <seed_provision>\n\n    Candidate legal provision:\n    <candidate_provision>\n\n    Task:\n    Classify whether the candidate provision is legally required to answer questions about the seed provision.\n\n    Return JSON only:\n    {\n      \"status\": \"REQUIRED | HELPFUL_BACKGROUND | HARD_NEGATIVE | IRRELEVANT\",\n      \"relation_type\": \"definition | exception | condition | special_subject_rule | procedure | authority | sanction | amendment | same_topic | irrelevant\",\n      \"why\": \"...\",\n      \"would_omission_make_answer_incomplete\": true\n    }\n\n    Rules:\n    - REQUIRED means omitting this provision can make a legal answer incomplete or wrong.\n    - HELPFUL_BACKGROUND means useful but not necessary.\n    - HARD_NEGATIVE means similar enough to confuse retrieval but not legally controlling.\n    - IRRELEVANT means unrelated.\n    - Do not mark a provision REQUIRED merely because it is about the same topic.\n\n\n* * *\n\n## 6. Use a legal graph over your 4,000 documents\n\nWith 4,000+ documents, you should build a legal graph.\n\nNodes:\n\n\n    law\n    chapter\n    section\n    article\n    clause\n    point\n    sentence/span\n\n\nMinimum useful node:\n\n\n    {\n      \"node_id\": \"<doc_id>_article_<n>_clause_<m>\",\n      \"law_title\": \"<law_title>\",\n      \"law_number\": \"<law_number>\",\n      \"authority_level\": \"law | decree | circular | resolution\",\n      \"article\": \"<article_number>\",\n      \"clause\": \"<clause_number>\",\n      \"text\": \"<legal_text>\",\n      \"effective_date\": \"<date>\",\n      \"status\": \"effective | amended | repealed | unknown\",\n      \"topics\": [\"<topic_1>\", \"<topic_2>\"],\n      \"explicit_references\": []\n    }\n\n\nEdges:\n\nEdge type | Meaning\n---|---\nexplicit_reference | One provision cites another\ndefines_term | One provision defines a term used elsewhere\nexception_to | One provision creates an exception\nspecial_rule_for | One provision creates special treatment for a group\ncondition_for | One provision gives conditions\nprocedure_for | One provision explains process or authority\nsanction_for | One provision gives consequence/penalty\nimplements | Decree/circular implements a law\namends | Later law changes earlier law\nrepeals | Later law removes earlier law\nsame_topic | Related but not necessarily required\n\nMedium/hard QA should be generated from strong legal edges:\n\n\n    anchor_rule + definition\n    anchor_rule + exception\n    anchor_rule + special_subject_rule\n    anchor_rule + procedure\n    anchor_rule + amendment\n\n\nNot merely:\n\n\n    anchor_rule + same_topic\n\n\n* * *\n\n## 7. Retrieval should be hybrid\n\nDo not use only vector search.\n\nVietnamese legal retrieval needs exact matching and semantic matching.\n\nUse:\n\n\n    BM25\n    + dense retrieval\n    + citation regex search\n    + graph expansion\n    + reranking\n\n\nWhy?\n\nLegal documents contain exact signals:\n\n  * law names;\n  * article numbers;\n  * clause numbers;\n  * “Điều”, “khoản”, “điểm”;\n  * “trừ trường hợp”;\n  * “theo quy định tại”;\n  * “sửa đổi, bổ sung”;\n  * formal legal phrases.\n\n\n\nA practical retrieval stack:\n\n\n    Stage 1:\n      BM25 top 50\n      dense retrieval top 50\n      citation index top 20\n      graph neighbors top 20\n\n    Merge:\n      deduplicate by node_id\n\n    Stage 2:\n      rerank top 50 to top 10 or top 20\n\n    Stage 3:\n      evidence sufficiency check\n\n\nSuggested metrics:\n\n\n    anchor_recall@k\n    required_citation_recall@k\n    all_required_retrieved@k\n    supporting_law_recall@k\n    exception_recall@k\n    definition_recall@k\n    hard_negative_rate\n\n\nThe most important one for medium/hard items:\n\n\n    all_required_retrieved@k\n\n\nIf the retriever misses one controlling provision, the answer may be incomplete.\n\n* * *\n\n## 8. QA generation prompt\n\nGenerate QA only after the evidence packet is ready.\n\n\n    You are generating Vietnamese legal QA data for evaluating RAG systems.\n\n    Evidence packet:\n    <evidence_packet>\n\n    Generate one question and one gold answer.\n\n    Requirements:\n    - The question must require all REQUIRED provisions.\n    - The answer must cite all REQUIRED provisions.\n    - The question must not be fully answerable from the seed provision alone.\n    - The answer must be legally cautious.\n    - Do not invent policy reasons that are not supported by the legal provisions.\n    - If facts are insufficient, say so.\n\n    Return JSON:\n    {\n      \"question\": \"...\",\n      \"gold_answer\": \"...\",\n      \"difficulty\": \"...\",\n      \"question_type\": \"...\",\n      \"evidence_scope\": \"...\",\n      \"legal_operation\": [...],\n      \"required_citations\": [...],\n      \"supporting_legal_points\": [...],\n      \"why_seed_alone_is_insufficient\": \"...\",\n      \"unacceptable_incomplete_answer\": \"...\",\n      \"rubric_0_5\": {...}\n    }\n\n\n* * *\n\n## 9. Add the seed-only vs full-packet test\n\nFor every medium/hard item, run two validations.\n\n### Test A: seed-only\n\nGive the model only the seed article and ask it to answer.\n\nExpected result:\n\n\n    incomplete\n\n\n### Test B: full packet\n\nGive the model all required provisions and ask it to answer.\n\nExpected result:\n\n\n    complete\n\n\nStore this:\n\n\n    {\n      \"seed_only_answer_status\": \"incomplete\",\n      \"full_packet_answer_status\": \"complete\"\n    }\n\n\nThis directly catches your current failure mode.\n\nIf the seed-only answer is already complete, then the question is probably not truly medium/hard.\n\n* * *\n\n## 10. Add a missing-law validator\n\nYour validator should not only ask:\n\n\n    Is the answer consistent with the given law?\n\n\nIt should ask:\n\n\n    Does the answer omit any required law/article/clause?\n\n\nPrompt:\n\n\n    Question:\n    <question>\n\n    Required citations:\n    <required_citations>\n\n    Candidate answer:\n    <answer>\n\n    Evaluate whether the answer is legally complete.\n\n    Return JSON only:\n    {\n      \"score_0_to_5\": 0,\n      \"complete\": true,\n      \"failure_type\": \"none | missing_required_citation | missing_exception | wrong_citation | unsupported_claim | wrong_legal_conclusion | insufficient_facts_not_detected\",\n      \"missing_required_citations\": [],\n      \"missing_legal_points\": [],\n      \"wrong_or_irrelevant_citations\": [],\n      \"unsupported_claims\": [],\n      \"short_reason\": \"...\"\n    }\n\n    Rules:\n    - If any required citation is missing, score cannot exceed 3.\n    - If the conclusion is wrong, score cannot exceed 2.\n    - If the answer is faithful to one provision but misses another controlling provision, mark it as missing_required_citation.\n\n\nThis validator is more aligned with legal RAG than a generic answer-correctness judge.\n\n* * *\n\n## 11. How to use your LoRA validator\n\nKeep the LoRA validator, but change its role.\n\nUse it as a **cheap first-pass validator** , not as the final legal judge.\n\nGood uses:\n\n  * detecting obviously bad outputs;\n  * checking local consistency;\n  * validating easy factual items;\n  * filtering bad generations;\n  * reducing API cost.\n\n\n\nWeak uses:\n\n  * cross-law completeness;\n  * special-law exceptions;\n  * amendment/version issues;\n  * hierarchy conflicts;\n  * hard case studies.\n\n\n\nRecommended validation stack:\n\n\n    rule-based citation checker\n    → LoRA local validator\n    → stronger LLM missing-law judge\n    → lawyer review for hard/uncertain items\n\n\nTrain or adapt future validators on failure types:\n\n\n    complete\n    missing_supporting_law\n    missing_exception\n    wrong_citation\n    unsupported_claim\n    wrong_legal_version\n    insufficient_facts_not_detected\n\n\nThis is more useful than only training on correct/incorrect labels.\n\n* * *\n\n## 12. Store supporting legal points\n\nDo not store only final answers.\n\nStore the legal points required for the answer.\n\nExample:\n\n\n    {\n      \"supporting_legal_points\": [\n        {\n          \"point\": \"The general rule requires considering individual characteristics.\",\n          \"source_node_id\": \"<node_a>\",\n          \"role\": \"anchor_rule\"\n        },\n        {\n          \"point\": \"A special rule applies to persons under 18.\",\n          \"source_node_id\": \"<node_b>\",\n          \"role\": \"special_subject_rule\"\n        }\n      ]\n    }\n\n\nThis is similar in spirit to multi-hop QA datasets that store supporting facts, such as HotpotQA, but adapted to legal provisions.\n\n* * *\n\n## 13. Add unanswerable and insufficient-fact items\n\nLegal systems should not always answer confidently.\n\nSome legal questions are incomplete because:\n\n  * key facts are missing;\n  * effective date is unknown;\n  * the controlling document is not in the corpus;\n  * the answer depends on a contract or administrative decision;\n  * the situation requires case-specific legal judgment.\n\n\n\nAdd labels:\n\n\n    {\n      \"answerability\": \"answerable | partially_answerable | insufficient_facts | not_in_corpus | version_unclear\"\n    }\n\n\nExpected answer style for insufficient facts:\n\n\n    Chưa thể kết luận chắc chắn vì tình huống chưa nêu rõ <missing_fact>.\n    Nếu <condition_A> được đáp ứng thì <legal_consequence_A>.\n    Nếu không đáp ứng <condition_A> thì <legal_consequence_B>.\n\n\nThis is useful because real legal RAG should be able to say:\n\n\n    The available facts are not enough for a final conclusion.\n\n\nThe FEVER style of evidence-based “Supported / Refuted / NotEnoughInfo” labeling is a useful conceptual reference.\n\n* * *\n\n## 14. Recommended final dataset schema\n\nUse JSONL.\n\nOne item per line:\n\n\n    {\n      \"id\": \"vnlegalrag_000001\",\n      \"language\": \"vi\",\n      \"jurisdiction\": \"VN\",\n      \"as_of_date\": \"2026-05-16\",\n\n      \"difficulty\": \"medium\",\n      \"question_type\": \"analytical\",\n      \"evidence_scope\": \"multi_law\",\n      \"legal_operation\": [\n        \"general_rule\",\n        \"special_subject_rule\",\n        \"policy_reasoning\"\n      ],\n      \"answerability\": \"answerable\",\n\n      \"question\": \"<question_text>\",\n      \"gold_answer\": \"<gold_answer_text>\",\n\n      \"required_citations\": [\n        {\n          \"node_id\": \"<node_id_1>\",\n          \"citation_text\": \"<citation_text_1>\",\n          \"role\": \"anchor_rule\",\n          \"must_retrieve\": true,\n          \"must_cite\": true\n        },\n        {\n          \"node_id\": \"<node_id_2>\",\n          \"citation_text\": \"<citation_text_2>\",\n          \"role\": \"special_subject_rule\",\n          \"must_retrieve\": true,\n          \"must_cite\": true\n        }\n      ],\n\n      \"supporting_legal_points\": [\n        {\n          \"point\": \"<legal_point>\",\n          \"source_node_id\": \"<node_id>\",\n          \"role\": \"anchor_rule\"\n        }\n      ],\n\n      \"hard_negatives\": [\n        {\n          \"node_id\": \"<negative_node_id>\",\n          \"reason\": \"Topically similar but not controlling.\"\n        }\n      ],\n\n      \"unacceptable_incomplete_answers\": [\n        {\n          \"failure_type\": \"missing_supporting_law\",\n          \"description\": \"States the general rule but omits the special rule.\"\n        }\n      ],\n\n      \"rubric_0_5\": {\n        \"0\": \"Irrelevant, hallucinated, or legally wrong.\",\n        \"1\": \"Mentions topic but misses main rule.\",\n        \"2\": \"States main rule but misses required supporting law.\",\n        \"3\": \"Minimally acceptable but incomplete citations or reasoning.\",\n        \"4\": \"Correct and cites all required provisions.\",\n        \"5\": \"Expert-quality answer with complete reasoning, citations, exceptions, and caveats.\"\n      },\n\n      \"validation\": {\n        \"seed_only_answer_status\": \"incomplete\",\n        \"full_packet_answer_status\": \"complete\",\n        \"automatic_validation_passed\": true,\n        \"expert_review_status\": \"not_required\"\n      }\n    }\n\n\n* * *\n\n## 15. Evaluation design\n\nCreate three benchmark tracks.\n\n### Track A: retrieval-only\n\nInput:\n\n\n    question\n\n\nExpected output:\n\n\n    required legal provisions\n\n\nMetrics:\n\n\n    required_citation_recall@5\n    required_citation_recall@10\n    all_required_citations_retrieved@10\n    MRR\n    nDCG\n    hard_negative_rate\n\n\n### Track B: gold-context generation\n\nInput:\n\n\n    question + gold evidence packet\n\n\nExpected output:\n\n\n    answer\n\n\nMetrics:\n\n\n    legal correctness\n    faithfulness\n    citation use\n    clarity\n    exception handling\n\n\n### Track C: full RAG\n\nInput:\n\n\n    question + corpus\n\n\nSystem must:\n\n\n    retrieve → answer\n\n\nMetrics:\n\n\n    retrieval completeness\n    legal correctness\n    citation completeness\n    faithfulness\n    answer usefulness\n\n\nDiagnostic table:\n\nRetrieval-only | Gold-context generation | Full RAG | Diagnosis\n---|---|---|---\nGood | Good | Good | System works\nBad | Good | Bad | Retriever failed\nGood | Bad | Bad | Generator/reasoner failed\nGood | Good | Bad | Context assembly or prompt failed\nBad | Bad | Bad | Both retrieval and generation are weak\n\nThis separation is important. Otherwise, you may blame the LLM when the real problem is missing retrieval.\n\n* * *\n\n## 16. Recommended dataset size\n\nDo not start with a huge dataset.\n\nStart with a high-quality pilot.\n\n### Version 0.1\n\n\n    500 easy\n    300 medium\n    50 hard\n    50 insufficient/unanswerable\n\n\nGoal:\n\n\n    debug schema, prompts, validators, retrieval metrics\n\n\n### Version 0.2\n\n\n    2,000 easy\n    1,000 medium\n    150 hard\n    150 insufficient/unanswerable\n\n\nGoal:\n\n\n    evaluate real RAG pipelines\n\n\n### Version 1.0\n\n\n    5,000–8,000 total\n    20–30% multi-source\n    5–10% expert-reviewed hard\n    5–10% insufficient/unanswerable\n\n\nA smaller expert-reviewed hard set is more valuable than a large weak synthetic hard set.\n\n* * *\n\n## 17. Suggested distribution\n\nFor a RAG evaluation benchmark:\n\nDifficulty | Share | Purpose\n---|---|---\nEasy | 35–45% | Basic retrieval/comprehension\nMedium | 40–50% | Main RAG stress test\nHard | 10–15% | Expert-reviewed reasoning\nInsufficient/unanswerable | 5–10% | Safety and uncertainty handling\n\nBy question type:\n\nType | Share\n---|---\nFactual | 35–45%\nInterpretation | 15–20%\nAnalytical | 25–35%\nApplication | 10–15%\n\nBy evidence scope:\n\nScope | Share\n---|---\nSingle clause/article | 35–45%\nMulti-article same law | 20–25%\nMulti-law | 20–25%\nException/special/temporal | 10–15%\n\n* * *\n\n## 18. Common pitfalls\n\n### Pitfall 1: generating medium/hard questions from one article\n\nProblem:\n\n\n    The answer is locally correct but legally incomplete.\n\n\nFix:\n\n\n    Generate from evidence packets.\n\n\n### Pitfall 2: confusing topical similarity with legal necessity\n\nProblem:\n\n\n    A provision may discuss the same topic but not control the answer.\n\n\nFix:\n\n\n    Classify provisions as REQUIRED / HELPFUL / HARD_NEGATIVE / IRRELEVANT.\n\n\n### Pitfall 3: using dense retrieval only\n\nProblem:\n\n\n    Embeddings may miss article numbers, exact legal terms, and formal references.\n\n\nFix:\n\n\n    BM25 + dense retrieval + citation search + reranker.\n\n\n### Pitfall 4: scoring only final answer similarity\n\nProblem:\n\n\n    A semantically similar answer can still miss a controlling exception.\n\n\nFix:\n\n\n    Score required citation coverage, legal conclusion correctness, context sufficiency, and exception handling.\n\n\n### Pitfall 5: treating a narrow LoRA validator as a universal legal judge\n\nProblem:\n\n\n    It may validate local consistency but miss cross-law incompleteness.\n\n\nFix:\n\n\n    Use it as a cheap filter, then use missing-law validation and expert review.\n\n\n### Pitfall 6: ignoring legal version/date\n\nProblem:\n\n\n    The correct answer may depend on effective date, amendment, or repeal status.\n\n\nFix:\n\n\n    {\n      \"as_of_date\": \"<date>\",\n      \"law_status\": \"effective | amended | repealed | unknown\",\n      \"effective_date\": \"<date>\"\n    }\n\n\n* * *\n\n## 19. Final architecture\n\n\n    4,000+ Vietnamese legal documents\n            ↓\n    legal parser\n            ↓\n    article/clause nodes\n            ↓\n    metadata extraction\n            ↓\n    citation + relation extraction\n            ↓\n    legal graph\n            ↓\n    BM25 + dense + citation indexes\n            ↓\n    evidence packet builder\n            ↓\n    QA generator\n            ↓\n    seed-only vs full-packet validation\n            ↓\n    citation completeness validator\n            ↓\n    LoRA cheap validator\n            ↓\n    strong LLM judge for medium/hard\n            ↓\n    lawyer MOS review for hard\n            ↓\n    final RAG evaluation dataset\n\n\n* * *\n\n## 20. Final answer\n\nYour current method works for **easy factual QA** , but it is not enough for **medium** and **hard** legal QA.\n\nThe main issue is not temperature, prompt strictness, or generation style.\n\nThe main issue is:\n\n> **The model is generating from incomplete legal evidence.**\n\nTo fix that, build every medium/hard item around a **gold evidence packet**.\n\nEach dataset item should store:\n\n\n    question\n    gold answer\n    required citations\n    supporting legal points\n    hard negatives\n    difficulty\n    question type\n    evidence scope\n    legal operation\n    answerability\n    validation status\n    expert review status\n\n\nThe strongest version of your project would not be just a Vietnamese legal QA dataset.\n\nIt would be:\n\n> **A Vietnamese legal RAG benchmark that measures required legal evidence retrieval and legal completeness.**\n\nThat is a much more valuable contribution than plain synthetic QA.",
  "title": "Legal data creation"
}