{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiba3nhcjvr7lvnxrd6efoqmzmiuomebnarupfhp74hj47beq6z6au",
    "uri": "at://did:plc:2jhaz5p2w5y3oosc3zr26cjo/app.bsky.feed.post/3me5n6jxuc7n2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreibvbabdb35okvpcxdjtmigjmditzjyqks32o6wvhynnpexh6vgule"
    },
    "mimeType": "image/jpeg",
    "size": 123127
  },
  "description": "How to find duplicates, missing originals, and silent re-encodes.",
  "path": "/media-file-check/",
  "publishedAt": "2026-02-06T00:22:29.000Z",
  "site": "https://www.yonkeydonkey.blog",
  "tags": [
    "check-media-integrity",
    "FFprobe vs MediaInfo comparison",
    "how Czkawka detects duplicates",
    "dupeGuru duplicate finder",
    "Duplicate-Media-Finder"
  ],
  "textContent": "**Y** our media library can look clean while it rots underneath. Same filenames, different guts. “Edited” copies that quietly replaced the only original. A folder full of videos that play fine until minute 43, then fall apart like cheap alibis.\n\nA media file integrity check is how you stop guessing. You collect facts, not vibes. You prove what’s real, what’s duplicated, what’s missing, and what got re-encoded when you weren’t looking.\n\nYou don’t need a lab. You need a routine. And the nerve to trust evidence over memory.\n\n## Start with a media file truth check\n\n### Before you delete anything.\n\nImportant rule: don’t clean up while you’re still blind. Make a snapshot copy (or at least a read-only mount) so your checking doesn’t become the damage.\n\nStart by separating three different problems that get mixed up on bad days:\n\n  * **Corruption** : the file can’t be decoded all the way through.\n  * **Duplicates** : two files that are the same, or close enough to trick you.\n  * **Silent changes** : same content, different encoding, stripped metadata, new timestamps.\n\n\n\nFor video and audio integrity, `ffmpeg` is the gatekeeper. It doesn’t care about your feelings.\n\n  * Fast decode scan: `ffmpeg -v error -i \"file.mp4\" -f null -`\n  * Stream-focused scan: `ffprobe -v error -show_entries format=duration:stream=codec_name,codec_type,bit_rate,width,height,r_frame_rate -of default=nw=1 \"file.mp4\"`\n\n\n\nIf you want a purpose-built sweep across folders (photos, videos, audio), consider the open-source CLI tool check-media-integrity. It’s a blunt instrument, which is what you want at the start.\n\nNow the trap most people step into: timestamps. Filesystems lie all the time.\n\n  * Copy tools can rewrite `mtime`.\n  * Timezone fixes can shift EXIF dates without changing the pixels.\n  * Cloud sync can “touch” files during conflict resolution.\n\n\n\nSo treat timestamps as hints, not proof. If you need metadata you can interrogate, use ExifTool:\n\n  * Quick camera dates: `exiftool -time:all -a -G1 -s \"IMG_1234.JPG\"`\n  * Batch export dates: `exiftool -r -csv -CreateDate -DateTimeOriginal -ModifyDate /path/to/library > dates.csv`\n\n\n\nIf you’re comparing analysis tools, this FFprobe vs MediaInfo comparison lays out the strengths and limits in plain terms. In practice, you’ll use both. `ffprobe` for scriptable truth, MediaInfo for quick human scanning and “what did this come from?” clues.\n\n## Duplicates: byte-for-byte twins, and the ones wearing a mask\n\nSome duplicates are harmless. A second copy on another drive. A mirror for backup. Fine.\n\nThe dangerous ones are the near-duplicates. Same scene, same duration, different encoding. One of them is the original; the other is a second-generation story. Softer edges, crushed shadows, smeared noise. Still watchable. Still wrong.\n\nStart with the cleanest win: cryptographic hashes. If the hashes match, the files are identical. No debate.\n\n  * Create a SHA-256 manifest: `hashdeep -r -c sha256 /path/to/media > SHA256SUMS.txt`\n  * Verify later: `hashdeep -a -k SHA256SUMS.txt -r /path/to/media`\n\n\n\nThen find true duplicates in place:\n\n  * `fdupes -r /path/to/media`\n  * `rdfind -makesymlinks true /path/to/media` (use linking only when you’re confident and backed up)\n\n\n\nIf you prefer a GUI with serious teeth, Czkawka scales well without getting sloppy. Their write-up on how Czkawka detects duplicates is worth reading because it explains why “same name” is a junk signal. Available for Mac, Linux and Windows.\n\nDupeGuru is another solid GUI when you want something calmer and cross-platform: dupeGuru duplicate finder. Also available for Mac, Linux and Windows.\n\nWhen you hit near-duplicates, switch methods. Hashes won’t help because the bytes differ. That’s where perceptual matching comes in, the idea of “looks the same” even if it’s not bit-identical. For photos, Czkawka’s similar-image mode can help. For mixed image-and-video sets, this project is a starting point: Duplicate-Media-Finder.\n\nHere’s a quick decision table you can live by:\n\nEvidence you see| What it usually means| What you do next\n---|---|---\nSHA-256 matches| Safe duplicate| Keep one, archive the other, or hardlink\nHash differs, same duration and frame size| Likely re-encode or remux| Compare codec, bitrate, encoder tags\nHash differs, same date/name pattern| Likely edited metadata or moved container| Check EXIF/QuickTime tags, then stream info\nPerceptual match, different resolution/bitrate| Likely export/proxy| Tag it as derivative, don’t replace original\n\n## Missing originals and silent re-encodes\n\n### The quiet damage.\n\nMissing originals don’t announce themselves. They sit behind your “organized” exports and pretend everything’s fine.\n\nYou notice later. Years later. When you want the RAW. When you want the full-quality video. When you want the version before the app “optimized” it.\n\nA practical way to catch missing originals is to define what an original looks like in your world, then scan for gaps.\n\n  * For photos: originals are often`.CR2`, `.NEF`, `.ARW`, `.DNG`, `.RAF` plus the camera JPEG.\n  * For video: originals might be `.MOV` from a phone, `.MP4` from an action cam, or high-bitrate camera files.\n\n\n\nUse file lists to compare basenames. Example pattern:\n\nList likely originals:\n\n\n    find /media -type f \\( -iname \"*.cr2\" -o -iname \"*.nef\" -o -iname \"*.arw\" -o -iname \"*.dng\" \\) -printf \"%f\\n\" | sed 's/\\.[^.]*$//' | sort -u > originals.txt\n\nList likely originals\n\nList exports:\n\n\n    find /media -type f -iname \"*.jpg\" -printf \"%f\\n\" | sed 's/\\.[^.]*$//' | sort -u > exports.txt\n\nList exports\n\nThen compare with `comm; `Linux & Mac OS only:\n\n\n    comm -13 originals.txt exports.txt > exports_without_raw.txt\n\nCompare with `comm; `Linux & Mac OS only\n\nNow the other threat, the one that ruins archives while you sleep: the silent re-encode.\n\nA remux is just a container swap. The streams stay the same. A re-encode changes the stream. Quality shifts, encoder changes, GOP structure changes, bitrate behavior changes. The duration can remain identical, which is why people get fooled.\n\nInterrogate the file like it owes you money:\n\nContainer and stream facts:\n\n\n    ffprobe -hide_banner -show_format -show_streams -of default=nw=1 \"file.mp4\"\n\nContainer and stream facts\n\nLook for encoder tags with MediaInfo (or JSON output if you script it):\n\n\n    mediainfo \"file.mp4\"\n\nLook for encoder tags\n\nPitfalls you should expect:\n\n  * **Variable bitrate** makes size comparisons unreliable.\n  * **GOP changes** can happen with “smart render” tools, even when the video feels unchanged.\n  * **Metadata stripping** happens when apps rewrite MP4 atoms, so dates vanish or shift.\n  * **Timezone edits** can make two copies look “different” to naïve tools while the content is identical.\n\n\n\nIf your “original” file has an encoder tag like `Lavf` or `HandBrake` and you don’t remember doing that, take it as a confession. Compare it to other copies. Find the cleanest lineage.\n\n## A repeatable audit routine\n\n### And a golden record that doesn’t lie\n\nYou don’t need heroics. You need a routine you can run when tired.\n\n  1. **Freeze a working copy** (snapshot, read-only mount, or backup clone).\n  2. **Run corruption checks** on video and audio (`ffmpeg -v error ... -f null -`).\n  3. **Generate checksums** for the golden set (`hashdeep ... > SHA256SUMS.txt`).\n  4. **Find true duplicates** (fdupes/rdfind/Czkawka), then decide what gets removed or linked.\n  5. **Hunt near-duplicates** with perceptual tools, then label derivatives (exports, proxies, social versions).\n  6. **Compare expected originals vs exports** (by extension and basename), then fix gaps while you still can.\n\n\n\nKeep your “golden record” simple, boring, and consistent:\n\n  * Folder structure: `Media/YYYY/YYYY-MM-DD_Event/Device/`\n  * Filenames: `YYYYMMDD_HHMMSS_Device_Sequence.ext` (no spaces, no mystery)\n  * Checksums: one `SHA256SUMS.txt` per event folder, plus an optional `SHA256SUMS.txt.sig` if you sign files\n  * Sidecars: keep edits as `.xmp` for photos, and keep exported videos in an `Exports/` subfolder so they never impersonate originals\n\n\n\nWhen you finish, you’re not just cleaning. You’re putting your name on a record. Media file integrity checks work by telling your future self, “I didn’t guess, I verified.”",
  "title": "A “media file truth” check",
  "updatedAt": "2026-02-06T00:22:29.000Z"
}