External Publication
Visit Post

How can i build a High Quality dataset?

Hugging Face Forums [Unofficial] June 8, 2026
Source
Thank you But I still have one more question: how should I prepare the SFT dataset? For example, I need natural Persian SFT data, but I have not been able to find a dataset that feels truly natural in Persian. Also, as you said, I need to cover all the capabilities the model is expected to have, and writing everything by hand would take too much time. The model needs capabilities such as: * tool calling * school-grade math reasoning * strong Persian academic/helpful support for students * grammar help in Farsi, such as فاعل، مفعول، متمم, and similar concepts If the right approach is something like: CPT → SFT → Hand-Written SFT and I can keep the final hand-written SFT dataset small, that is completely fine for me, I am willing to write that part manually. However, I still do not understand how to generate SFT data efficiently from the raw text I already have. Assume the text has already been cleaned using the pipeline we discussed earlier. How can I turn that cleaned raw text into a useful SFT dataset in an efficient way? I remember you mentioned using a larger LLM to help generate SFT data from raw text. If possible, could you explain the best workflow for that? Also, if you can give me an example of the dataset size usually needed for SFT, I think I could make a much better plan.

Discussion in the ATmosphere

Loading comments...