{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicbha23pd4lxkz5gzfcn6nxl44znwffusr5ezpn4gqiwfklwgwjqa",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlo5vzdtjsv2"
  },
  "path": "/t/legal-data-creation/175953#post_1",
  "publishedAt": "2026-05-12T15:03:33.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Example"
  ],
  "textContent": "Hi, I’m trying to create my own Vietnamese Legal Data to help with **evaluating RAG systems**. Currently I have obtain more than **4000 legal documents**. I wanted to use LLM to **synthesize my dataset** which is a question and answer pair. I devise my dataset into these:\n\n3 types of difficulties:\n\n  * **easy** : basic Comprehension questions, the answer is in the given legal text.\n\n  * **medium** : Require some legal knowledge and reasoning, the answer may contains multiple laws or articles to be consider correct.\n\n  * **hard** : Deep understanding and analysis, ask about exceptions to the rules, case studies, or how to handle unusual situations. Require multiple lawyers to grade the answer with the score from 0-5 to get at least 3+ MOS (Mean Opinion Score)\n\n\n\n\n4 types of questions:\n\n  * **factual** : Direct questions about specific facts mentioned in the text, so this type is always easy and easy can only be this type\n\n  * **interpretation** : Asking for the meaning or the goal behind a law. It focus on what does the legal term actually means in the real world. Ex: explain the meaning of the phrase “mineral processing activities not tied to an investment project” in the law\n\n  * **analytical** : These require breaking down the law to find connections, differences, or relationships between concepts. Ex: rules for exploring a mine vs the rules for mining it, and to explain why those differences exist.\n\n  * **application** : Made up scenarios, case studies, or real-life cases (has been solve with answer)\n\n\n\n\n=> {Factual} ∈ Easy, {Interpretation, analytical} ∈ Medium, {Analytical, Application} ∈ Hard\n\nThe way I synthesize my data is just **writing prompt with strict guidelines** , give it a law article to generate question & answer, set low temp, call API. I also have a validator which is pre-train on 1 Vietnamese Legal law using **LoRA** (I’m poor, can’t train more) which I use to **validate the generate question and answer** of that law. So the problem that I ran into is that the synthesized medium and hard answer were correct but only based on that given article, the fully correct answer needed another law. Example\n\nI haven’t though of anything that could help me so **PLEASE HELP ME!**",
  "title": "Legal data creation"
}