{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiddvyowgvgzlgb6eyi4jwejjxpas4o5zl4uoxmhmgkhlwegzvxjgm",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkgr23lwntl2"
},
"path": "/t/how-to-edit-dataset-to-train-ai/175577#post_2",
"publishedAt": "2026-04-26T22:46:15.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"Hugging Face",
"Creative Commons"
],
"textContent": "It’s a bit long to put into words, but basically:\n\n * Just to be safe, delete any metadata “within the image files.”\n * Convert the image files into lightweight formats suitable for training, such as JPEG, and use those as the main part of the dataset. (You could keep the TIFF files for those who want high-quality images.) It’s faster to do this in Python… and it’s easy to have a generative AI generate the actual code.\n * Create a CSV or JSON file to be used as metadata (labels) during training. Any text file format that’s easy for you to edit will work. (Typically, users decide for themselves which fields to use and how, so it’s fine to have many metadata fields. The accuracy and precision of the labels directly impact the model’s performance. This is the “main dataset creation phase” that comes after gathering the image collection.)\n * Write the license and `README.md`\n * Other minor details\n\n\n\nThe process should generally follow these steps.\n\n* * *\n\n# What to do next with your Hugging Face portrait dataset\n\nYou already did the hardest first step: you made a public Hugging Face dataset and selected **CC0 1.0** , which is the right license direction if your goal is broad reuse, including AI/ML training and commercial reuse. The next step is not to train a model; it is to make the dataset **clear, loadable, legally understandable, and useful to people or systems that may include it in training corpora**.\n\nRight now, your dataset page shows **CC0 1.0** , **2 rows** , a total file size of about **350 MB** , an empty README, and a Dataset Viewer failure because it tried to scan **350,119,395 bytes** against a **300,000,000-byte** limit. That means the problem is not the idea; the problem is packaging. (Hugging Face)\n\n* * *\n\n## 1. Reframe the dataset\n\nUse this framing:\n\n> A consented, CC0, high-resolution portrait photograph released for unrestricted public reuse, including AI/ML training, evaluation, research, commercial use, redistribution, modification, and inclusion in larger datasets.\n\nAvoid framing it as:\n\n> A TIFF file for AI to train on.\n\nThat is too narrow. AI systems do not automatically train on every public upload. People and dataset builders are more likely to use it if the repo is easy to understand, easy to load, and legally clear.\n\n* * *\n\n## 2. Separate the archival image from the training image\n\nYour TIFF is valuable as a **source/master file** , but it should not be the default training row. A 302 MB-style archival image is heavy for previews and most training pipelines. Your current viewer error is evidence that the default data is too large for convenient Hub preview. Hugging Face documents `TooBigContentError` as a Dataset Viewer limit issue and recommends avoiding very large first-row content or moving large payloads to separate files when possible. (Hugging Face)\n\nUse this target structure:\n\n\n README.md\n\n train/\n ian_portrait_001.jpg\n metadata.csv\n\n original/\n ian_portrait_001.tif\n\n\nWhat this means:\n\nPath | Role\n---|---\n`train/ian_portrait_001.jpg` | Normal dataset image; easy to preview/load/train from\n`train/metadata.csv` | Caption, license, consent, and source metadata\n`original/ian_portrait_001.tif` | Full-resolution archival source\n\nThis turns the repo from “two versions of the same image as two rows” into “one usable dataset row plus one archival source file.”\n\n* * *\n\n## 3. Create a smaller training-friendly image\n\nMake a high-quality JPEG or PNG derivative from the TIFF.\n\nSuggested default:\n\n * Format: **JPEG**\n * Long side: **2048 px** or **1024 px**\n * Color mode: **RGB**\n * Quality: high, but not huge\n * Metadata: stripped unless intentionally retained\n * Purpose: default loadable image\n\n\n\nExample:\n\n\n # pip install pillow\n\n from pathlib import Path\n from PIL import Image, ImageOps\n\n source_path = Path(\"Ian-1.tif\")\n output_dir = Path(\"train\")\n output_dir.mkdir(exist_ok=True)\n\n img = Image.open(source_path)\n img = ImageOps.exif_transpose(img)\n img = img.convert(\"RGB\")\n\n # Keep quality high while making the file practical for preview/training.\n max_side = 2048\n img.thumbnail((max_side, max_side))\n\n img.save(output_dir / \"ian_portrait_001.jpg\", quality=95, optimize=True)\n\n\nThis gives users a practical default image while preserving the TIFF for people who need the full-resolution source.\n\n* * *\n\n## 4. Check and strip hidden metadata\n\nHigh-resolution portraits may include EXIF/IPTC metadata such as camera model, lens, timestamp, editing software, creator fields, contact info, or GPS coordinates. Use ExifTool to inspect metadata; it is a standard tool for reading/writing image metadata. (Creative Commons)\n\nInspect:\n\n\n exiftool Ian-1.tif\n exiftool train/ian_portrait_001.jpg\n\n\nStrip the public training derivative:\n\n\n exiftool -all= -overwrite_original train/ian_portrait_001.jpg\n\n\nRecommended approach:\n\nFile | Recommendation\n---|---\nPrivate original TIFF | Keep untouched locally\nPublic training JPG | Strip metadata\nPublic archival TIFF | Inspect; remove private/GPS/contact metadata if present\nExtra PNG derivative | Strip metadata if kept\n\n* * *\n\n## 5. Add `metadata.csv`\n\nCreate:\n\n\n train/metadata.csv\n\n\nUse this:\n\n\n file_name,text,subject_type,source_format,archival_file,license,ai_training_permission,depicted_person_consent,rights_holder_release,no_endorsement\n ian_portrait_001.jpg,\"Studio portrait photograph of a young Black man, neutral expression, direct gaze, plain background.\",human portrait,TIFF,original/ian_portrait_001.tif,cc0-1.0,yes,yes,yes,\"Reuse does not imply endorsement by the depicted person.\"\n\n\nWhy this works:\n\nColumn | Purpose\n---|---\n`file_name` | Connects the row to the image file\n`text` | Caption for image-captioning/search/training workflows\n`subject_type` | States that this is a human portrait\n`source_format` | Explains the original file format\n`archival_file` | Points to the TIFF without making it a training row\n`license` | Makes the license explicit at row level\n`ai_training_permission` | States your intent clearly\n`depicted_person_consent` | Important for a recognizable person\n`rights_holder_release` | Important if a photographer/copyright holder is involved\n`no_endorsement` | Clarifies that reuse does not imply endorsement\n\nHugging Face’s image dataset guide supports this no-code pattern: image files plus `metadata.csv`, `metadata.jsonl`, or `metadata.parquet`, with `file_name` linking metadata to images. (Hugging Face)\n\nAvoid naming the TIFF pointer `original_file_name` or `archival_file_name`. Since Hugging Face treats `file_name` / `*_file_name` fields as media references, use `archival_file` instead.\n\n* * *\n\n## 6. Replace the empty README with a real dataset card\n\nOn Hugging Face, `README.md` is the dataset card. Dataset-card metadata helps with license display, tags, discoverability, size, language, and data-files configuration. (Hugging Face)\n\nUse this polished README:\n\n\n ---\n license: cc0-1.0\n pretty_name: \"CC0 High-Resolution Portrait Photograph of a Young Black Man\"\n language:\n - en\n tags:\n - image\n - portrait\n - photography\n - human\n - cc0\n - public-domain\n - ai-training\n - computer-vision\n - image-captioning\n task_categories:\n - image-to-text\n - text-to-image\n size_categories:\n - n<1K\n configs:\n - config_name: default\n data_dir: train\n default: true\n ---\n\n # CC0 High-Resolution Portrait Photograph of a Young Black Man\n\n ## Dataset Summary\n\n This dataset contains a consented, high-resolution portrait photograph intentionally released for unrestricted public reuse, including AI/ML training, fine-tuning, evaluation, research, education, commercial use, redistribution, modification, and inclusion in larger datasets.\n\n The default dataset contains a training-friendly image derivative in `train/`. The full-resolution TIFF source file is preserved separately in `original/`.\n\n ## Dataset Contents\n\n | Path | Purpose |\n |---|---|\n | `train/ian_portrait_001.jpg` | Training-friendly public image |\n | `train/metadata.csv` | Caption, license, consent, and source metadata |\n | `original/ian_portrait_001.tif` | Archival full-resolution source image |\n\n ## Data Fields\n\n The default dataset contains:\n\n - `image`: the training-friendly portrait image\n - `text`: factual image description\n - `subject_type`: broad subject category\n - `source_format`: original source format\n - `archival_file`: path to the full-resolution source file\n - `license`: row-level license identifier\n - `ai_training_permission`: explicit AI/ML training permission\n - `depicted_person_consent`: consent flag\n - `rights_holder_release`: rights-holder release flag\n - `no_endorsement`: no-endorsement statement\n\n ## Intended Use\n\n This dataset may be used for:\n\n - AI/ML training\n - image-captioning examples\n - text-to-image dataset experiments\n - computer-vision testing\n - public-domain image reuse\n - research and education\n - commercial and non-commercial projects\n - inclusion in larger datasets\n\n ## AI/ML Training Permission\n\n This image is intentionally released for AI/ML training, fine-tuning, evaluation, research, commercial use, redistribution, modification, and inclusion in larger datasets under CC0 1.0.\n\n ## Consent and Rights\n\n The depicted person has consented to public release of this image for unrestricted reuse, including AI/ML training and commercial use.\n\n The uploader represents that they have the rights necessary to release the image and its derivatives under CC0 1.0.\n\n Reuse of this image does not imply endorsement by the depicted person, uploader, photographer, or any other contributor.\n\n ## License\n\n This dataset is released under CC0 1.0.\n\n No attribution is required. Attribution is appreciated but not required.\n\n ## Limitations\n\n This is a single-image dataset. It is useful as a public portrait sample, image-captioning example, dataset-loading example, test image, or one image in a larger corpus.\n\n It is not large enough by itself to train a robust identity model, face-recognition model, or general image-generation model.\n\n This dataset contains one person and should not be treated as representative of any demographic group.\n\n ## Ethical Considerations\n\n This dataset contains a recognizable human portrait. Users should consider privacy, publicity, likeness, and endorsement issues in downstream uses, even when the image is openly licensed.\n\n Do not imply that the depicted person endorses a downstream model, product, dataset, output, or use case unless separate permission has been granted.\n\n ## How to Load\n\n ```python\n from datasets import load_dataset\n\n ds = load_dataset(\"jericho98/tiff-photograph-of-young-black-man\", split=\"train\")\n print(ds)\n print(ds[0])\n ```\n\n ## Citation\n\n No citation is required under CC0. If you cite the dataset anyway, cite the Hugging Face dataset page.\n\n\nThe `configs` block matters because it tells the Hub that `train/` is the default dataset directory, rather than treating every image-like file as a normal row. Hugging Face supports YAML configuration for dataset cards and data files. (Hugging Face)\n\n* * *\n\n## 7. Be precise about CC0, consent, and endorsement\n\nCC0 is a good fit for your goal because it allows copying, modification, distribution, and commercial use without asking permission, to the extent allowed by law. (Creative Commons)\n\nBut a portrait has an extra layer: **likeness, privacy, publicity, and endorsement**. Creative Commons notes that CC0 does not remove rights others may have around image, likeness, privacy, or publicity. (Creative Commons)\n\nSo your dataset card should say both:\n\n\n This dataset is released under CC0 1.0.\n\n\nand:\n\n\n The depicted person has consented to public release for unrestricted reuse, including AI/ML training and commercial use.\n\n\nIf you are the depicted person, use:\n\n\n The depicted person is the uploader and has intentionally released this image for unrestricted public use, including AI/ML training and commercial reuse.\n\n\nIf a photographer took the photo, add:\n\n\n The photographer/copyright holder has granted permission to release this image and its derivatives under CC0 1.0.\n\n\n* * *\n\n## 8. Upload/edit workflow\n\n### Easiest path: Hugging Face web UI\n\n 1. Open your dataset.\n\n 2. Go to **Files and versions**.\n\n 3. Upload:\n\n * `train/ian_portrait_001.jpg`\n * `train/metadata.csv`\n * `original/ian_portrait_001.tif`\n 4. Replace `README.md` with the dataset card above.\n\n 5. Remove or move the old root-level image files.\n\n 6. Commit with:\n\n\n\n\n\n Restructure dataset with metadata and archival source\n\n\nHugging Face supports uploading datasets through the Hub UI, including common formats such as images, CSV, JSONL, and Parquet. (Hugging Face)\n\n### Git path\n\n\n git lfs install\n\n git clone https://huggingface.co/datasets/jericho98/tiff-photograph-of-young-black-man\n cd tiff-photograph-of-young-black-man\n\n mkdir -p train original\n\n git mv Ian-1.tif original/ian_portrait_001.tif\n cp /path/to/ian_portrait_001.jpg train/ian_portrait_001.jpg\n\n # Create train/metadata.csv and replace README.md before this step.\n git add README.md train/metadata.csv train/ian_portrait_001.jpg original/ian_portrait_001.tif\n git commit -m \"Restructure dataset with metadata and archival source\"\n git push\n\n\nIf you do not want to keep the old PNG:\n\n\n git rm Ian-1.png\n git commit -m \"Remove duplicate root-level image derivative\"\n git push\n\n\n* * *\n\n## 9. Test the cleaned dataset\n\nInstall:\n\n\n pip install -U datasets pillow\n\n\nTest:\n\n\n from datasets import load_dataset\n\n repo_id = \"jericho98/tiff-photograph-of-young-black-man\"\n\n ds = load_dataset(repo_id, split=\"train\")\n\n print(ds)\n print(ds.features)\n print(ds[0])\n\n image = ds[0][\"image\"]\n print(type(image), image.size, image.mode)\n image.save(\"loaded_test.jpg\")\n\n\nExpected result:\n\n\n Dataset({\n features: ['image', 'text', 'subject_type', ...],\n num_rows: 1\n })\n\n\nYou want:\n\n * one default row,\n * working image preview,\n * caption visible,\n * license/consent fields visible,\n * TIFF preserved but not loaded as the default row,\n * no custom dataset builder script.\n\n\n\n* * *\n\n## 10. Do not use a dataset builder script\n\nFor your case, a builder script is unnecessary. Hugging Face’s `ImageFolder` exists so image datasets can be loaded without writing custom dataset code. (Hugging Face)\n\nUse:\n\n\n train/\n ian_portrait_001.jpg\n metadata.csv\n\n\nAvoid:\n\n\n dataset.py\n builder.py\n tiff_photograph_of_young_black_man.py\n\n\nA simple file-based dataset is more durable, easier for beginners, and easier for the Hub to preview.\n\n* * *\n\n## 11. Be honest about what one photo can do\n\nOne image is useful as:\n\n * a public-domain portrait asset,\n * a dataset-loading example,\n * an image-captioning example,\n * a computer-vision test image,\n * an image-processing source,\n * one item in a larger training corpus,\n * a clean consent/licensing example.\n\n\n\nOne image is **not** enough by itself for:\n\n * robust face recognition,\n * general portrait generation,\n * identity modeling,\n * demographic evaluation,\n * a balanced dataset,\n * strong DreamBooth/LoRA subject personalization.\n\n\n\nSo do not oversell it. Say:\n\n\n This is a single-image dataset and should not be treated as representative of any demographic group.\n\n\nThat is accurate and responsible.\n\n* * *\n\n## 12. Optional later: add more photos\n\nIf you later want to make this more useful for subject-personalization training, add several real photos with variation:\n\n\n front-facing portrait\n three-quarter angle\n side angle\n different expression\n different lighting\n different background\n close-up\n mid-shot\n possibly full-body\n\n\nFuture structure:\n\n\n README.md\n\n train/\n ian_portrait_001.jpg\n ian_portrait_002.jpg\n ian_portrait_003.jpg\n metadata.csv\n\n original/\n ian_portrait_001.tif\n ian_portrait_002.tif\n ian_portrait_003.tif\n\n\nExample metadata:\n\n\n file_name,text,view,expression,lighting,background,license,ai_training_permission,depicted_person_consent,archival_file\n ian_portrait_001.jpg,\"Studio portrait photograph of a young Black man facing the camera, neutral expression.\",front,neutral,studio,plain,cc0-1.0,yes,yes,original/ian_portrait_001.tif\n ian_portrait_002.jpg,\"Studio portrait photograph of a young Black man at a three-quarter angle, slight smile.\",three-quarter,slight smile,studio,plain,cc0-1.0,yes,yes,original/ian_portrait_002.tif\n\n\nDo not add crops or filters as if they were independent originals. If you add derivatives, label them as derivatives.\n\n* * *\n\n## 13. Optional later: DOI\n\nA DOI is useful if you want the dataset to be cited formally. Hugging Face supports DOIs for datasets and models, but DOI-linked objects are meant to be persistent, so cleanup should come first. (Hugging Face)\n\nWait until:\n\n * README is complete,\n * file structure is stable,\n * metadata is correct,\n * Dataset Viewer works,\n * `load_dataset()` works,\n * you are confident you will not restructure again.\n\n\n\n* * *\n\n# Final checklist\n\n## Do first\n\n * Make `train/ian_portrait_001.jpg`.\n * Move TIFF to `original/ian_portrait_001.tif`.\n * Add `train/metadata.csv`.\n * Replace empty README with a real dataset card.\n * Remove root-level duplicate image files.\n * Test with `load_dataset()`.\n\n\n\n## Key best practices\n\n * Keep TIFF as archival source, not default row.\n * Use a smaller JPG as the normal training image.\n * Add caption, consent, license, and AI-training permission.\n * Be explicit that commercial use and AI training are allowed.\n * Say reuse does not imply endorsement.\n * Do not use a custom dataset builder script.\n * Do not claim one image is a complete training corpus.\n\n\n\n## Short summary\n\n * Your dataset idea is good.\n * The current repo needs cleanup: large default files, 2 duplicate-ish rows, empty README, viewer failure.\n * The best version is simple: `README.md`, `train/metadata.csv`, one training-friendly JPG, and the full TIFF in `original/`.\n * The dataset’s value is **rights clarity + consent + CC0 + easy loading** , not size.\n\n",
"title": "How to edit dataset to train AI?"
}