Speeding up fdupes

tl;dr: use jdupes

I was merging some fileserver content, and realised I would inevitably end up with duplicates. “Aha”, I thought “time to use good old fdupes“. Well yes, except a few hours later, fdupes was still only at a few percent. Turns out running it on a collected merged mélange of files which are several terabytes in size is not a speedy process.

Enter jdupes, Jody Bruchon’s fork of fdupes. It’s reportedly many times faster than the original, but that’s only half the story. The key, as with things like Project Euler is to figure out the smart way of doing things– in this case smart way is to find duplicates on a subset of files. That might be between photo directories if you think you might have imported duplicates.

In my case, I care about disk space (still haven’t got that LTO drive), and so restricting the search to files over, say, 50 megabytes seemed reasonable. I could probably have gone higher. Even still, it finished in minutes, rather than interminable hours.

/jdupes -S -Z -Q -X size-:50M -r ~/storage/

NB: Jdoy Bruchon makes an excellent point below about the use of -Q. From the documentation:

-Q --quick skip byte-for-byte confirmation for quick matching
WARNING: -Q can result in data loss! Be very careful!

As I was going to manually review (± delete) the duplicates myself, potential collisions are not a huge issue. I would not recommend using it if data loss is a concern, or if using the automated removal option.

jdupes is in Arch AUR and some repos for Debian, but the source code is easy to compile in any case.

Wanted: One LTO-4/5/6/7 Drive!

I am something of a digital hoarder. I have files dating back to one of the earliest computers that anyone in my family owned. I think I even still have diskettes for an older word processor, the name of which escapes me at the moment. As such, I have slightly more than average storage requirements.

At present I handle these requirements via a Linux fileserver, using 3TB drives RAID6’d via mdadm. On top of that I use LVM to serve up some volumes for Xen, but that’s not strictly relevant to storage.

Looking at the capacities of LTO makes me quite covetous. LTO tapes are small, capacious and reliable– with a few tapes, I could archive a fair amount of data. I could also move the tapes outside my house- and lo, offline offsite backups!

Sadly, drives are expensive, unless you’re stepping back relatively small capacity* LTO-2 drives.

At present, given the cost of drives, some back-of-the-envelope calculations show that for any reasonable** dataset, simply buying hard drives (at time of writing, 3TB is cheapest per GP) is the most cost-effective means of archiving. Given that is where the focus of development is, I don’t think this is likely to change soon.

I’ll just have to wait for a going-out-of-business auction, and hope the liquidators overlook the value of the backup system…

Night Light

Back from hiatus again, really only to comment on how light it is here tonight. I went outside at half past one because I looked out the window and thought someone was shining a ight on my garden. It was really that light.

It must just be a reflection of the city lights off the clouds, but normally it doesn’t get quite so light. I mean I checked – I could read outside in the ambient light. And this isn’t me bragging about my super-duper night vision either – I’d had lights on in various rooms before I went outside.

I tried to take some photos to illustrate what I mean but if my camera’s screen is any indication they failed horribly, even on night setting. I mean, my camera is a wee PAS number; this is another of the times I’d like a decent camera. A dSLR with a full range of options on exposure and lenses would be nice, but that way lies madness and spending armfuls of cash.

GU Amnesty

Just to let you guys know that since joining them a month and a bit ago, not only am I involved with the GU Amnesty website as previously mentioned; but also we have a blog now. If you know me, you’ll know I can get passionate about things. Amnesty gives me a way to direct that energy (apoplexy is extremely exothermic) into something constructive. The people there are great too – just as energetic, enthusiastic and go-getting as me, probably even moreso. I sense great things ahead.

Protect the Human!

Edit: Fixed link

New Stuff: Gallery2

I’ve got a new toy of sorts – I installed Gallery2 on my server and I’m going to play with it to see how it works. First impressions seem to point towards it being pretty cool. I’ll give you a taste of one of the early images uploaded:

29

This is from Kenny – a demonstration of the Incinerate plasmid in Bioshock. Bioshock is a pretty cool thing too, but you’ll need to wait for my review.

Edit: Link to the gallery isn’t working properly at the moment, but you can find it at http://gallery.roberthallam.org

Spam Begone Update

Three months ago I installed Spam Karma 2 from dr Dave to deal with all the spam comments I was getting. As I commented then, the blog was being inundated with hundreds of spams, so much so that legitimate comments were being buried. In fact, when I recently wrote again on this matter, there had been 262 spam runs, for a total 5 406 of spam comments. There have now been a total of 1350 spam runs – about five times more. SK2 has let none of the thousands of spam comments through, has flagged up the valid ones, and I’m will continue to do a sterling job in the future.

Thanks again to dr Dave