Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihpp2dkyex2wjntsjrz3xmmxkjbx7of3vbuu57pzjyu2mk4haw57m",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnqyf3nasok2"
  },
  "path": "/t/how-can-i-build-a-high-quality-dataset/176571#post_5",
  "publishedAt": "2026-06-08T05:29:06.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Thank you  But I still have one more question: how should I prepare the SFT dataset?\n\nFor example, I need natural Persian SFT data, but I have not been able to find a dataset that feels truly natural in Persian. Also, as you said, I need to cover all the capabilities the model is expected to have, and writing everything by hand would take too much time.\n\nThe model needs capabilities such as:\n\n  * tool calling\n\n  * school-grade math reasoning\n\n  * strong Persian academic/helpful support for students\n\n  * grammar help in Farsi, such as فاعل، مفعول، متمم, and similar concepts\n\n\n\n\nIf the right approach is something like:\n\nCPT → SFT → Hand-Written SFT\n\nand I can keep the final hand-written SFT dataset small, that is completely fine for me, I am willing to write that part manually.\n\nHowever, I still do not understand how to generate SFT data efficiently from the raw text I already have.\n\nAssume the text has already been cleaned using the pipeline we discussed earlier.\n\nHow can I turn that cleaned raw text into a useful SFT dataset in an efficient way?\n\nI remember you mentioned using a larger LLM to help generate SFT data from raw text. If possible, could you explain the best workflow for that?\n\nAlso, if you can give me an example of the dataset size usually needed for SFT, I think I could make a much better plan.",
  "title": "How can i build a High Quality dataset?"
}