Better Backups: Decide What You’re Going To Back Up

tl;dr: Picking what you are going to back up helps (i) keep the backup space usage minimal (ii) helps to inform choice of backup program

Following on from picking a backup system in the backups series, now that you’ve picked a system, what exactly should you back up?

You could make the argument that really, what you’re going to back up is part of your requirements gathering. Frequently-changing data (eg documents) is different from a snapshot of a Windows installation is different from an archive of the family photos.

My my case, I want to back up my home directory, which is a mix of things:

  • documents of all sorts
  • code (some mine, some open source tools)
  • application configuration data
  • browser history etc
  • miscellaneous downloads

It totals less than 20Gb, most of which is split between downloads, browser and code (around 3:1:1, according to ncdu). Some things like documents, code and browser data will change semi-frequently and old versions are useful; others will stay relatively static and version history is not so important (like downloads).

Some downloads were for a one-off specific purpose and removed. It would be possible to pare down further by removing some downloads and some code — wine is the largest directory in ~/code/, and I don’t remember the last time I used it — but it’s not enough that I feel it’s a priority to do.

Is there anything in this set of data that doesn’t need kept? Frequently-changing-but-low-utility files like browser cache would be worth excluding as they will cause the (incremental) backups to grow in size. Incidentally, cache was the next largest item in the ratio above!

Some of the files will change relatively frequently, and I’d like to keep the history of them. I have decided that I want to keep my entire home directory, minus browser cache. This help to inform me what things I need my backup program to do, and what to do with it when I decide.

Better Backups: Pick a System

You have backups, right?

— SuperUser’s chat room motto


This started out as an intro to bup. Somewhere along the way it underwent a philosophical metamorphosis.

I’m certainly not the first person to say that backups are like insurance. They are a bit of a hassle to figure out which one will work best, you set it up and forget about it, and hopefully you won’t need it*.

Many moons ago, I had backups taken care of by a a simple shell script. Later, this got promoted to a python script which handled hourly, daily weekly and monthly rotation; and saved space by using hard links (cp -al ...). It even differentiated between local and remote backups. That was probably my backup zenith, at least when time and effort are factored in.

Really, the more sensible approach is rather than reinvent the wheel, use an existing tried-and-tested solution. So I moved to rdiff-backup and it was good; being simple it meant I could set up ‘fire-and-forget’ backups via cron. I was able to restore files from backups that I had set up and then forgotten about.

With the recent expansion of the fileserver ongoing, now’s a good time to take stock and re-evaluate options. I have created a Xen DomU dedicated to backups (called pandora, aptly) with it’s own dedicated logical volume. From here, I need to decide:

1) whether to keep going with rdiff-backup or switch to eg bup or borg
2) figure out if different machines could use different schedules or approaches (answer: probably); and if so, what those would be (answer: …)

I don’t want to spend too long on this — premature optimisation being the root of all evil — but the aim is to create a backup system which is:

  • robust
  • reliable
  • maintenance-minimal

: If you *do use your backups or insurance a lot, it’s probably a sign that something is going wrong somewhere

Speeding up fdupes

tl;dr: use jdupes

I was merging some fileserver content, and realised I would inevitably end up with duplicates. “Aha”, I thought “time to use good old fdupes“. Well yes, except a few hours later, fdupes was still only at a few percent. Turns out running it on a collected merged mélange of files which are several terabytes in size is not a speedy process.

Enter jdupes, Jody Bruchon’s fork of fdupes. It’s reportedly many times faster than the original, but that’s only half the story. The key, as with things like Project Euler is to figure out the smart way of doing things– in this case smart way is to find duplicates on a subset of files. That might be between photo directories if you think you might have imported duplicates.

In my case, I care about disk space (still haven’t got that LTO drive), and so restricting the search to files over, say, 50 megabytes seemed reasonable. I could probably have gone higher. Even still, it finished in minutes, rather than interminable hours.

/jdupes -S -Z -Q -X size-:50M -r ~/storage/

NB: Jdoy Bruchon makes an excellent point below about the use of -Q. From the documentation:

-Q --quick skip byte-for-byte confirmation for quick matching
WARNING: -Q can result in data loss! Be very careful!

As I was going to manually review (± delete) the duplicates myself, potential collisions are not a huge issue. I would not recommend using it if data loss is a concern, or if using the automated removal option.

jdupes is in Arch AUR and some repos for Debian, but the source code is easy to compile in any case.

Browsing MySQL Backups

tl;dr: Seems the quickest way of doing this was to fire up a VM, install mysql-server and mysql-client and browse that way.

I have backups of things. This is important, because as the old adage goes: running without backups is data loss waiting to happen. I’m not sure if that’s the adage, but it’s something resembling what I say to people. I’m a real hit at parties.

I wanted to check the backups of the database powering this blog, as there was a post that could swear I remembered referring to (iterating over files in bash) but couldn’t find. I had a gzipped dump of the MySQL database, and wanted to check that.

zgrep bash mysql.sql.gz | less was my first thought, but that gave me a huge amount of irrelevant stuff.

A few iterations later and I was at zgrep bash mysql.sql.gz | grep -i iterate | grep -i files | grep -v comments and none the wiser. I had hoped there was some tool to perform arbitrary queries on dump files, rather than going through a proper database server, but that’s basically sqlite and to my limited searches, didn’t seem to exist for MySQL.

What I ended up doing was firing up a VM, installing mysql-server and mysql-client and dumping the dump into that server via zcat:

zcat mysql.sql.gz | mysql -u 'root' -p database

And then querying the database: select post_title, post_date from wp_posts where post_title like '%bash%' followed by select post_content from wp_posts where post_title like '%terate%';

And the post is back!