Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicu2gvkagjlplanhtza6pmb2lhfxelp6gdahn2kc3gue2peyar5oy",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mh6kfzmnwyc2"
  },
  "path": "/t/best-practices-to-create-an-audio-dataset/174312#post_2",
  "publishedAt": "2026-03-16T12:37:03.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Google for Developers",
    "scikit-learn",
    "CHiME Challenges and Workshops"
  ],
  "textContent": "Since having enough data during fine-tuning is all that matters, there isn’t really a fixed ratio. The key is maybe whether there is enough data.\n\n* * *\n\nYes. There is a guideline. But it is **not** a universal percentage rule.\n\nFor audio datasets, the best practice is:\n\n**Use a split that matches the kind of “unseen” audio you want the model to handle, and keep validation and test independent.**\n\nYour example, **85% train / 5% validation / 10% test** , can be a perfectly reasonable starting point. But it is only good if the validation and test portions are still large enough to be meaningful, representative, and cleanly separated from training data. Google’s ML guidance is explicit that there are **no fixed percentage requirements** for train, validation, and test sets, and that good holdout sets should be representative, statistically meaningful, and free of duplicates from training. (Google for Developers)\n\n## What the three splits are for\n\nThe **training set** is the part the model learns from.\nThe **validation set** is the part you use during development to make decisions, such as hyperparameter tuning, early stopping, augmentation choices, threshold selection, model selection, and preprocessing choices.\nThe **test set** is the final untouched set you use only after development is finished, to estimate how well the system will work on new data. Google describes this as the normal train → validate → adjust → test workflow, and scikit-learn states plainly that learning parameters and testing on the same data is a methodological mistake. (Google for Developers)\n\n## So is 85 / 5 / 10 a good split?\n\nIt is a **valid starting split** , not a law.\n\nThe important question is not “Is 85 / 5 / 10 the standard?” The important question is:\n\n**Do the 5% and 10% sets contain enough diversity and enough examples to tell you something reliable?**\n\nIf yes, then 85 / 5 / 10 can work well. If not, then even a “standard-looking” split is weak. Google’s guidance does not prescribe a single ratio. It emphasizes instead that the holdout sets must be large enough to be statistically meaningful and must resemble real-world data. (Google for Developers)\n\n## Why audio is different from many other data types\n\nWith audio, the biggest danger is usually **not** the exact percentage. It is **leakage**.\n\nTwo files can be technically different files and still be too closely related for fair evaluation. That happens when:\n\n  * the same **speaker** appears in train and test,\n  * many clips are cut from the same **source recording** ,\n  * the same **session** , **room** , or **microphone** appears across splits,\n  * augmented variants of the same clip are spread across train, validation, and test,\n  * or repeated utterances appear in more than one split. (scikit-learn)\n\n\n\nThis is exactly why scikit-learn provides `GroupKFold` and `StratifiedGroupKFold`: they keep groups non-overlapping across splits instead of pretending every row is independent. (scikit-learn)\n\n## What strong audio benchmarks do\n\nWell-designed audio benchmarks usually do **not** just random-shuffle files.\n\nFor example, the CHiME challenge states that its evaluation data is disjoint from training and development, with **no overlap in participants or rooms** , and warns that participant overlap in a dev set can encourage overfitting. ESC-50 uses predefined folds so that clips from the same **original source** stay in the same fold. VoxCeleb explicitly uses **disjoint speakers** between development and test. These are not arbitrary details. They are the reason the benchmark is credible. (CHiME Challenges and Workshops)\n\n## The right way to think about splitting audio\n\nDo not start with percentages.\n\nStart with this question:\n\n**What does “unseen” mean for my real use case?**\n\nIf the model must work on **new speakers** , then split by speaker.\nIf it must work on **new sessions or meetings** , split by session or meeting.\nIf it must work on **new environments** , split by room, device, or environment.\nIf your clips are cut from long recordings, split by **source recording** , not by clip. That is the same logic used in grouped cross-validation and in audio benchmarks such as ESC-50 and CHiME. (scikit-learn)\n\n## A practical recommendation\n\nIf you are building an audio dataset from scratch and do not yet have a special reason to do otherwise, this is a strong default:\n\n  * use your proposed **85 / 5 / 10** or a similar three-way split,\n  * but make the split by **speaker, session, or source recording** , not by raw file name alone,\n  * and check that each split is still representative in class balance, duration, noise conditions, and recording conditions. (Google for Developers)\n\n\n\nIf the dataset is small, a fixed three-way split may waste too much data. In that case, a cleaner approach is often to keep one final holdout test set and use grouped cross-validation on the rest during development. scikit-learn’s cross-validation guidance supports this logic, especially when you need model selection without contaminating the final test estimate. (scikit-learn)\n\n## Your bonus question: can test and evaluation use the same MP3 files?\n\nFor a proper setup, **no**.\n\nIf by “evaluation” you mean **validation/dev** , then using the **same MP3 files** for validation and test defeats the purpose of having both sets. The validation set is used repeatedly while you make decisions. Once you use those files to choose model settings, they are no longer an independent final test. Google explicitly says validation and test sets can “wear out” with repeated use, and scikit-learn warns that test information leaking into fitting or preprocessing produces overly optimistic scores. (Google for Developers)\n\nSo the clean rule is:\n\n  * **training** teaches the model,\n  * **validation** helps you choose,\n  * **test** gives the final unbiased check. (Google for Developers)\n\n\n\nIf validation and test are the same files, then in reality you have only **one holdout set** , not two separate ones. That is acceptable only as a compromise when data is scarce, and then it should be described honestly as a single holdout setup, not as a full train/validation/test design. (Google for Developers)\n\n## In audio, “same files” is broader than it sounds\n\nThis point matters a lot.\n\nEven if the MP3 filenames are different, the split can still be weak if validation and test contain:\n\n  * different cuts from the same long recording,\n  * noisy and clean versions of the same utterance,\n  * augmented variants of the same clip,\n  * repeated recordings from the same speaker in the same session,\n  * or material that is nearly identical except for trivial transformations. Google recommends removing duplicates from validation and test, and scikit-learn recommends splitting before preprocessing to avoid leakage. ESC-50’s fold design is a concrete example of keeping related clips together. (Google for Developers)\n\n\n\n## The shortest useful answer\n\nYes, **85 / 5 / 10** can be a good guideline.\nNo, it is **not** a universal rule.\nFor audio, **how you split** matters more than the exact percentages.\nAnd **validation and test should not be the same MP3 files** if you want an honest final result. (Google for Developers)",
  "title": "Best practices to create an audio dataset"
}