Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreid4mysy4htfcn5zifq5cr4kq65wsy7tjm4pqm6dnv3ej3hngm7zfu",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mgaxn2m3vqs2"
  },
  "path": "/t/need-help-to-find-a-dataset-for-fine-tuning/87645#post_2",
  "publishedAt": "2026-03-04T15:10:27.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Here’s a short, engaging reply you can post on that thread (helpful first, no selling):\n\n* * *\n\nHi Aditi, yes, this is doable and there are a couple good dataset directions.\n\n  1. Look for scientific paper summarization datasets where you can treat parts of the paper as “extractive style input” and the abstract as the target. The most common are arXiv and PubMed style datasets used in scientific summarization.\n\n  2. If you specifically need “extractive summary to abstractive summary” pairs, you usually create the extractive side yourself. For example, run a simple extractive method like TextRank or use the top k sentences from the paper, then train the model to map that to the abstract.\n\n\n\n\nModel wise, BART or T5 are strong baselines for abstractive summarization and work well with the Transformers library.\n\nQuick question so people can recommend the right dataset and setup. Do you want to summarize full papers, or only specific sections like introduction plus conclusion, and which domain, arXiv or biomedical?",
  "title": "Need help to find a dataset for fine tuning"
}