External Publication
Visit Post

Legal data creation

Hugging Face Forums [Unofficial] May 12, 2026
Source

Hi, I’m trying to create my own Vietnamese Legal Data to help with evaluating RAG systems. Currently I have obtain more than 4000 legal documents. I wanted to use LLM to synthesize my dataset which is a question and answer pair. I devise my dataset into these:

3 types of difficulties:

  • easy : basic Comprehension questions, the answer is in the given legal text.

  • medium : Require some legal knowledge and reasoning, the answer may contains multiple laws or articles to be consider correct.

  • hard : Deep understanding and analysis, ask about exceptions to the rules, case studies, or how to handle unusual situations. Require multiple lawyers to grade the answer with the score from 0-5 to get at least 3+ MOS (Mean Opinion Score)

4 types of questions:

  • factual : Direct questions about specific facts mentioned in the text, so this type is always easy and easy can only be this type

  • interpretation : Asking for the meaning or the goal behind a law. It focus on what does the legal term actually means in the real world. Ex: explain the meaning of the phrase “mineral processing activities not tied to an investment project” in the law

  • analytical : These require breaking down the law to find connections, differences, or relationships between concepts. Ex: rules for exploring a mine vs the rules for mining it, and to explain why those differences exist.

  • application : Made up scenarios, case studies, or real-life cases (has been solve with answer)

=> {Factual} ∈ Easy, {Interpretation, analytical} ∈ Medium, {Analytical, Application} ∈ Hard

The way I synthesize my data is just writing prompt with strict guidelines , give it a law article to generate question & answer, set low temp, call API. I also have a validator which is pre-train on 1 Vietnamese Legal law using LoRA (I’m poor, can’t train more) which I use to validate the generate question and answer of that law. So the problem that I ran into is that the synthesized medium and hard answer were correct but only based on that given article, the fully correct answer needed another law. Example

I haven’t though of anything that could help me so PLEASE HELP ME!

Discussion in the ATmosphere

Loading comments...