I thought we did that already
Update: Helps if you actually fix the underlying problem so it doesn’t happen again a week later. D’oh.
As we’ve already seen, once of the nice things about having visualisations for ZFS iostats with grafana, is that you can keep an eye on how processes are running:
Did you know you that if you’re lucky, you can also spot problems before they develop into something really unfortunate? Well, around lunchtime, I saw this activity on the array:
Now, since there wasn’t meant to be any activity, especially not anything intense and sustained, this clued me in to something being amiss. The free space dropping from around 850GB to 330GB was also a clue!
Since I recently reinstated some backups (see link above), I figured I should check on borg. Sure enough, there were a couple of processes running. Checking against systemd:
Sure enough, the io activity had been going on for about an hour and twenty minutes. I was relatively sure about the snapshot backups, as I’d tested them as a one-off when I sorted backups the other day. The last one, backing up Windows partitions deserved a closer look though.
Hmm. Now why would backing up an EFI partition, which are generally tiny (hundred of megabytes), take over an hour?
The script I wrote accessed the partitions using
/dev/disk/by-id/, ie the WWN identifiers. Rather than using
/dev/sd*, which can (but often doesn’t) change across boots I used an identifier to make sure I got the right disk/partition.
So what went wrong?
Aha! That identifier now points to a 500GB NTFS partition, which is definitely not an EFI partition. The penny dropped- I had moved my Windows install from that disk but not updated the backup script.
Luckily I caught the problem, and so I wasn’t bitten by it when if it came to restore! Cleanup was simple:
borg list --consider-checkpoints /path/to/repo
borg delete /path/to/repo/archive.checkpoint
borg compact /path/to/repo
It ran successfully, though took a while:
Sorted. Score another point for grafana visualisations!
Edit: Turns out by ‘sorted’ I meant ‘I deleted the problematic backup but didn’t actually fix the script that caused it’
This addendum brought to you by “why is borg complaining it can’t acquire a lock… because the target is out of space… oh.”