Speeding up fdupes

tl;dr: use jdupes

I was merging some fileserver content, and realised I would inevitably end up with duplicates. “Aha”, I thought “time to use good old fdupes“. Well yes, except a few hours later, fdupes was still only at a few percent. Turns out running it on a collected merged mélange of files which are several terabytes in size is not a speedy process.

Enter jdupes, Jody Bruchon’s fork of fdupes. It’s reportedly many times faster than the original, but that’s only half the story. The key, as with things like Project Euler is to figure out the smart way of doing things– in this case smart way is to find duplicates on a subset of files. That might be between photo directories if you think you might have imported duplicates.

In my case, I care about disk space (still haven’t got that LTO drive), and so restricting the search to files over, say, 50 megabytes seemed reasonable. I could probably have gone higher. Even still, it finished in minutes, rather than interminable hours.

/jdupes -S -Z -Q -X size-:50M -r ~/storage/

NB: Jdoy Bruchon makes an excellent point below about the use of -Q. From the documentation:

-Q --quick skip byte-for-byte confirmation for quick matching
WARNING: -Q can result in data loss! Be very careful!

As I was going to manually review (± delete) the duplicates myself, potential collisions are not a huge issue. I would not recommend using it if data loss is a concern, or if using the automated removal option.

jdupes is in Arch AUR and some repos for Debian, but the source code is easy to compile in any case.

4 Replies to “Speeding up fdupes”

  1. A very popular feature request is going to be fulfilled in the next few months: https://github.com/jbruchon/jdupes/issues/38

    “Cache hash information across invocations,” basically a way to keep databases of the file hash data and file metadata so that comparisons of known files against new additions does not require touching the known files until the final byte-for-byte comparison.

    Also, be really careful with -Q, it does not check the files matched by hash for actual identical contents and could cause data loss in the event of a hash collision. My hash algorithm is not a “secure hash” so collisions are mathematically more likely than with, say, MD5 or SHA-family hashes. Admittedly, my tests show that my hash algorithm does not collide for most of the data I toss at it.

  2. That’s an excellent feature!

    I was thinking about something along those when I was originally watching fdupes slowly tick its way through files, but I must confess my thinking was more out of scope- “hey, what if it built a database of files, file sizes and hashes…”, but that is a slightly different tool!

    re: -Q, the warnings are a very good point. In my case I was happy about not doing a byte-for-byte comparison as the matches would be manually reviewed by me anyway, but I would definitely omit it if I was using automated deletion! I’ll edit a warning into the post.

    With the features and speedups you’ve implemented, here’s hoping your fork becomes the standard folks reach for when wanting to find/remove duplicate files. I came across it by searching something along the lines of ‘speed up fdupes’, and your blog post was the top result, so you’re doing well there!

  3. I really tried hard to make fdupes as fast as possible, and I only started “marketing” jdupes the past year or so. One of the biggest slowdowns with fdupes is that it uses the MD5 hash algorithm which is a lot of unnecessary calculations; see, the hash code is only used to EXCLUDE files since differing hashes guarantees file differences, so collisions are only supposed to be a “problem” in that they cause file checking to move along to the next (longer) comparison stage inappropriately, and with a hash with good distribution but not necessarily “secure hash” level distribution, collisions are rare enough to not matter (I’ve refined jodyhash to the point that a “jdupes -d” run never reports more than “0 hash fail” which is the collision stat in the debug stats.)

    One particular place jdupes is massively faster is when used over SSH. fdupes prints progress to the screen for every single change in progress and the latency of SSH compared to a local terminal means that fdupes can spend a total of several extra minutes (!!!) blocked from doing any work as it waits for the progress printout to reach the other end and be acknowledged. This was and still is my main way of using Linux machines so it was a really serious problem. One of the very first optimizations I made was to introduce a counter that would count to 256 instead of printing progress and on the 256th progress even it would actually update the progress display. It was not the best solution because if 255 little files were scanned and then a huge file was scanned next, the progress indicator could freeze unnecessarily for a pretty long time, but the overhead reduction for comparing even a small number of smaller files was insane! jdupes now uses a “has one second or more passed?” progress indicator check instead, with a partial file comparison progress indicator if it spins on one file long enough so that the progress is constantly showing that work is being done. It alleviates the severe overhead problem from fdupes and provides consistent feedback to the user so they don’t think the program has frozen up; I canceled several perfectly fine fdupes runs back in the day because the progress would stick on large files and I thought something was wrong.

    I added the -Z option because if I started a comparison of lots of files and it had done some work but I needed to kill it for a while to process something more important, I got very tired of losing all the matching work that had been done already. I added -Q for obvious reasons. One of the features that was missing in fdupes but that I really wanted was the hard link feature -L, so I took a suggested patch from the fdupes issue tracker and expanded upon it to make it more robust. Hard linking is great for cutting down piles of read-only data and I have found it to be extremely helpful for my Windows driver database since countless drivers include identical libraries and runtime installers (dpinst.exe, Visual C++ and .NET Framework installers, etc.) and I’ve cut down the disk space usage by several GB just by hard linking.

    About the database building, that’s exactly what it would be doing! It will be a simple flat text file with size, modify time, partial and full hash, and relative file path. If you want a hash database that uses a more secure hash algorithm, check out md5deep and friends.

  4. That’s some really useful insight into the development process, thank you for taking the time to reply.

    The speed gains you’ve achieved are very impressive. I didn’t realise there was so much potential for improvement. Especially printing to the screen! Getting output right is tricky — particularly as the UNIX philosophy is to produce little or no output — but clearly spending whole minutes blocking on output is pretty inefficient! I guess the [hashing XX%] that gets appended is the partial file comparison progress indicator?

    I lack the mathematical/computational wherewithal to comment on what makes a good hash but as you rightly say they are a precursor to a byte-for-byte comparison anyway (I’d say it has a **high negative predictive value** to put it in terms I’d understand!) so if making it a 64-bit int has a decent speedup that sounds like a clear winner.

    The added features are killer, even without the speedups. Filtering on size is a big one for me, but I hadn’t even considered the hard link aspects! I’ve yet to get around to using *dupes on my backups, which rely heavily on hard linking to stay space efficient for increments; and I’m keen now to try it on Windows as you suggested – my Windows install is on a smallish SSD partition, so space savings there are really important. -Z is a great one too.

    Looking forward to the database feature! I guess when I started thinking down those lines my brain went into full ‘feature-creep’ mode and starting going down the lines of an automated periodic+new write scan that could be integrated into the filesystem driver… but going too far down that road just gets you to data deduplication 😛

Tell us what's on your mind