Speeding up fdupes

tl;dr: use jdupes

I was merging some fileserver content, and realised I would inevitably end up with duplicates. “Aha”, I thought “time to use good old fdupes“. Well yes, except a few hours later, fdupes was still only at a few percent. Turns out running it on a collected merged mélange of files which are several terabytes in size is not a speedy process.

Enter jdupes, Jody Bruchon’s fork of fdupes. It’s reportedly many times faster than the original, but that’s only half the story. The key, as with things like Project Euler is to figure out the smart way of doing things– in this case smart way is to find duplicates on a subset of files. That might be between photo directories if you think you might have imported duplicates.

In my case, I care about disk space (still haven’t got that LTO drive), and so restricting the search to files over, say, 50 megabytes seemed reasonable. I could probably have gone higher. Even still, it finished in minutes, rather than interminable hours.

/jdupes -S -Z -Q -X size-:50M -r ~/storage/

NB: Jdoy Bruchon makes an excellent point below about the use of -Q. From the documentation:

-Q --quick skip byte-for-byte confirmation for quick matching
WARNING: -Q can result in data loss! Be very careful!

As I was going to manually review (± delete) the duplicates myself, potential collisions are not a huge issue. I would not recommend using it if data loss is a concern, or if using the automated removal option.

jdupes is in Arch AUR and some repos for Debian, but the source code is easy to compile in any case.