fdupes is great — until you can't install it. I built a zero-install duplicate finder.
fdupes, jdupes, rdfind, fclones — the duplicate-file finders are all excellent. They're also all native binaries you have to install first. Which is exactly what you can't do on the box that actually has the duplicate-file problem: the locked-down work laptop, the client's server, the CI runner, the throwaway container, the colleague's machine you're helping debug.
So I built duphunt : a duplicate finder that runs the second you have Node or Python, with nothing to install and no dependencies of its own.
$ npx duphunt ~/Downloads
2 duplicate group(s), 5 files, 8.1 MB reclaimable
4.1 MB × 2 4.1 MB reclaimable
/Users/me/Downloads/invoice.pdf
/Users/me/Downloads/invoice (1).pdf
2.0 MB × 3 4.0 MB reclaimable
/Users/me/Downloads/clip.mp4
/Users/me/Downloads/clip-copy.mp4
/Users/me/Downloads/old/clip.mp4
Groups are sorted biggest-waste-first, so the files worth deleting are right at the top.
How it works
- Group by size. Two files of different sizes can't be byte-identical, so files with a unique size are never even opened.
- Hash the collisions. Within each size group, each file gets a streamed SHA-256 (64 KB chunks — multi-GB files won't blow up memory).
- Report identical content. Same hash ⇒ true byte-for-byte duplicate. Grouped and ranked by reclaimable space.
It reports — it never deletes. You decide what goes.
Install
npx duphunt . # Node — nothing to install
pip install duphunt # Python — same tool, same results
Two builds (Node + Python) that hash with SHA-256 and produce identical output, so it slots into whatever a given machine already has.
Use it in CI
duphunt assets/ --exit-code # fail the build if duplicate assets sneak in
duphunt . --json # or pipe the groups into your own tooling
A few honest details
- Zero dependencies, both builds. stdlib only. A "find my duplicates" tool that pulled in a dependency tree of its own would be a bit much.
- Each physical file is counted once. Repeated or overlapping roots (
duphunt ~/a ~/a/b) and symlink aliases are de-duplicated by real path, so they never inflate the numbers — while genuine hard links still surface. (This one took a couple of rounds to get right.) - Empty files and symlinks are skipped by default (
--min-size 0and--followif you want them).
Links
- npm: https://www.npmjs.com/package/duphunt
- PyPI: https://pypi.org/project/duphunt/
- Source: https://github.com/jjdoor/duphunt
What do you reach for to find duplicate files today — and is "I can't install anything on this box" a problem you've hit too? Curious whether anyone would actually gate CI on a duplicate check.
Discussion in the ATmosphere