all posts games

Take 2 Claims ‘WZLJHRS’

Jack Howitzer as Jack Howitzer in ‘Jack Howitzer’

I played some GTA V: Online the other night — my three word review: ‘fun but clunky’ — and uploaded the footage of it as I usually do, leaving it as a draft to be later updated with my automation tools.

Later on I saw I had a notification on YouTube and thought “Ah! Someone’s subscribed, or commented, or similar”. Actually, I had a copyright claim from Take 2 Interactive for ‘WZLJHRS’. What?

“There are some visibility restrictions on your video. However, your channel isn’t affected. No one can view this video due to one or more of the Content ID claims below. WZLJHRS: Video cannot be seen or monetized; Blocked in all territories by Take 2 Interactive”

The just under two minute segment in question was a GTA teevee programme (‘Jack Howitzer’, a documentary/mockumentary about a washed up action movie actor) I watched while waiting for my friend to arrive at my office. It had some funny moments.

I am mindful of YouTube’s content ID system, and I mute game music pre-emptively having been bitten in the past by that. I didn’t suspect for a second that a fake TV show in a game would result in an entire video being blocked.

I will have to amend the video and reupload.

PS: WZLJHRS: WZL ? Weazel News network || JHRS ? Jack Howitzer show?

all posts

A Sunlit Kelpie

Cold but bright

It may have been down at around -2°C that morning, but the walk in Falkirk was great. The Kepies are really impressive!

all posts

Automating YouTube Uploads With OCR Part 10: Reflection, Lessons Learned, and Improvements

Every day is a school day

We set out to use OCR to extract metadata from frames of the loading and ending screens of Deep Rock Galactic to use to fill in details of videos destined for YouTube.

In other words we went from:



It’s always good to reflect when you’ve done something. Did it go well, or not as well as expected? What did you hope to achieve? Did you achieve that? What has it changed? There’s as many ways to reflect as there are things to reflect on.

In this project I wanted to achieve a greater degree of automation with my video creation workflow. Partly because it would save me time:

The ever-relevant XKCD (

The other reason is because copying text is no longer the provenance of monks in a scriptorium- it’s a repetitive, uncreative task. I enjoy spending time playing games with my friends, and those videos are there so that they and others can relive and enjoy them too; spending time copying text is not a good use of my time.

However, there’s a more pertinent image for this sort of task:

Pretty much spot on (

There were 47 videos in the test batch. Let’s say that I would have spent five minutes per video copying across the title, writing a description, figuring out the tags and such; doing that manually would have taken 235 minutes, or nearly four hours. That might sound like a lot, but it’s certainly less time than I worked on the automation.

The automatic OCR will have ongoing benefits – there are more videos to process.

But the best part is that I learned. I learned about tesseract and OCR, a bit about OpenCV, and honed my python programming skills.

Lessons Learned

OCR is good enough to extract text from video stills. I assumed this, but it is good to have it confirmed.

Cleaning up images makes a huge difference to OCR accuracy. I could probably have improved detection in the opening image to use just that if I had cleaned up earlier in the process; but using both loading and ending images gives more metadata, so it worked out okay.

It’s really easy to leak file descriptors. Late on, when I went to test with a wider variety of videos, I ran into this issue “OSError: [Errno 24] Too many open files“. Instead of using tempfile.mktemp, which unexpectedly kept the fd, I had to use tempfile.NamedTemporaryFile. That one took a bit of hunting down as it looked like pytesseract was failing, and coincidentally they had a couple of issues in previous versions due to the same issue (mktemp vs NamedTemporaryFile)! Most confusing.

What Would I Do Differently?

Implement automated testing. This would have hugely helped in the refinements stage, where regressions in detection accuracy occurred as I refined. There were a couple of reasons that put me off at the time, but they were more excuses than reasons:

  • this was a “quick and dirty” attempt to get a tool working, refinements to it can come later

    This an old, old excuse; proved false time and again. It’s sometimes phrased as “This is just a temporary fix, will do it properly later” and other variants. What it boils down to is “We’re going to do this the ‘wrong’ way for now, and change it later”.

    It sounds fine, if you actually sort it later, but invariably that doesn’t happen. Time and effort have to be focused somewhere, and it’s a harder sell to redo something that “works” (however hackily) than to implement a new feature, or get a product out the door.

    Here it was even worse: doing that work may well have improved the “quick and dirty” process.
  • the frame extraction + OCR processes aren’t quick, and tests should be quick to run; it’s also hard to break apart the pipeline

    This excuse is on slightly firmer ground, but not by much! It’s true that these things take time, but they can be broken down to components and tested individually using sample images (for example).

    It might not provide the coverage of a real life full data set, but it’ll catch the worst of regressions.

Future Improvements + Directions

Use only a start or end frame if one is missing. At the moment a video is skipped if either the start or end frame is not detected. That leaves the video to be done entirely manually- we could get at least some of the metadata from without the other.

Detect in-game menu screen. For times when I hit the record button too late (or OBS takes too long to spin up), I could go into the menu which has a couple of bits of metadata. I would need to remember to do this, but I usually realise I’ve hit record too late. Combined with the above improvement, we could increase video coverage.

Expand OCR to other games. This is non-trivial but an obvious way to go. Killing Floor 2 is the likeliest next candidate as at the moment it’s the one we play the most and also has metadata to capture.

Consider a further automated pipeline. As it stands, I have to run the program against videos manually; not a big deal. But a tool that detected new videos, automatically runs the OCR tool against them and puts them and the JSON output in a convenient place (± automatically uploading them to YouTube) would make the process more streamlined. This may be beyond my own need or indeed tolerance- I could see it being potentially frustrating if I wanted to manually handle a video differently.

Overall though, I am happy with how the tool turned out.

all posts automation ocr python

Automating YouTube Uploads With OCR Part 6: Automatically Detecting Loading / Ending Screens

We’re using OCR to extract game metadata from Deep Rock Galactic videos. We’re now at the point where if we give our script two images – one of the loading screen, one of the end screen – it does a pretty good job of pulling out the information.

Now we need a way to pull out the images automatically. Since the loading and end screens are a variable distance but close to the start and end of the recording, we need to find the screens rather than rely on a fixed time.

Off the top of my head, two potential approaches spring to mind:

  • use OCR to detect known elements of each screen
  • use trained CV to recognise the screens

While the second option sound fun and interesting, it’s not something I know a lot about. Perhaps we can return to that at some point. For now we’ll try OCR. Start at the beginning and seek forward, and start at the end and seek backwards. For OCR, we want something that will be i) reliably read and ii) doesn’t move, ideally.

End Screen

A couple of elements jump out as possibilities: the “MISSION TIME” string on the right is large and clear; “TOTAL HAZARD BONUS” is also reasonably clear, and the CONTINUE button looks like it is in a fixed position.

Loading Screen

The loading screen is trickier. Most of the elements look dynamic. Of the text, the mission name is probably the clearest, and we do have a list of possible mission names. The player names and classes are there- I should be in all of my videos, and we can also test for the presence of “DRILLER/SCOUT/GUNNER/ENGINEER” somewhere in the top quarter of the image.

Quick OCR

The approach used sets a start time, an end time and a step, generates frames for those, then OCR’s the frames and scores them based on what is present.

In the case of the loading screen, recognised players names and classes are scored, and then the frame with the highest score is picked. A score of 0 means the detection has failed, for example if the time period in the video does not contain a loading/ending screen.

This approach has the advantage of picking the best frame; though it is slower than picking the first acceptable frame. Minimising runtime isn’t crucial here however, as uploading the video takes orders of magnitude longer than the time to run the script. We could exit as soon as a frame scores 8 (four players names and four classes)

In the case of the ending screen, we match on “MISSION TIME:”, “TOTAL HAZARD BONUS” and “CONTINUE”, each word here scoring 1 point. Here, because the elements are known in advance and should always be present, we can have an early exit for a frame that scores the maximum of 6.

Putting Detection and OCR Together to Test

Our previous version took images to work on as arguments, which was fine when we were testing, but now we’re testing videos, so the code needs tweaked to handle that.

Throwing a bunch of videos at it, showed a couple of issues. One video had OCR fail on the mission name, so I tweaked the box and applied a bunch of enhancements (grayscale, posterize, invert, autocontrast, border) to get the text OCR’d correctly.

I also changed my name checking list to have the expected version of the names, rather than lower case, for the purposes of doing some Levenshtein distance checks. This led to some name combinations not being detected, so the any() logic needed changed:

if any(n in [name.lower() for name in names] for n in namecheck):


if any(name.lower() in [n.lower() for n in namecheck]                                                       
               for name in names):

Also, remember a few paragraphs when I said “time taken doesn’t really matter”, well it does when you’re making changes and retesting! When I set up the list of 10 videos to collect output from, I had to do other things a few times. As ever, the truth can be found in xkcd:

“My OCR is running!”

The output we got with minimal changes is pretty good:

                      file                                  names       mission_type                       biome    hazard        mission_name               minerals
0  2019-10-13 21-41-44.mkv  [BertieB, graham, MaoTheCat, ksume99]  Mining Expedition  Radioactive Exclusion Zone  Hazard 3          Open Trick  [Umanite, Enor Pearl]
1  2019-10-13 22-08-16.mkv           [BertieB, graham, MaoTheCat]  Mining Expedition              Glacial Strata  Hazard 3     Purified Legacy     [Umanite, Magnite]
2  2019-10-14 20-06-55.mkv                    [BertieB, Costello]  Salvage Operation                  Magma Core  Hazard 4     Unhealthy Wreck      [Magnite, Croppa]
3  2019-10-14 20-54-49.mkv  [BertieB, graham, Noobface, Costello]   Point Extraction                   Salt Pits  Hazard 4        Rapid Pocket   [Bismor, Enor Pearl]
4  2019-10-14 21-16-24.mkv          [BertieB, Costello, Noobface]  Salvage Operation                   Salt Pits  Hazard 4          Angry Luck      [Umanite, Bismor]
5  2019-10-15 18-04-10.mkv                    [BertieB, Costello]           Egg Hunt                 Fungus Bogs  Hazard 4      Ranger's Prize        [Croppa, Jadiz]
6  2019-10-15 18-26-15.mkv            [BertieB, Costello, graham]  Mining Expedition              Glacial Strata  Hazard 4     Second Comeback        [Jadiz, Bismor]
7  2019-10-17 19-14-51.mkv              [eVNS, BertieB, Costello]           Egg Hunt  Radioactive Exclusion Zone  Hazard 4       Colossal Doom  [Umanite, Enor Pearl]
8  2019-10-17 19-41-38.mkv            [BertieB, Costello, graham]           Egg Hunt                 Fungus Bogs  Hazard 4  Illuminated Pocket  [Magnite, Enor Pearl]
9  2019-10-17 20-13-07.mkv            [BertieB, Costello, graham]           Egg Hunt         Crystalline Caverns  Hazard 4         Red Oddness      [Umanite, Bismor]

There’s a couple foibles: ksyme99 is detected as ksume99 in 0, and there’s a spurious detection of ‘eVNS’ in 7. This suggests name detection could be improved, though recall we weren’t able to hard-cast the output as there’s the possibility of unknown player names. However, we can use our good friend Levenshtein distance to fix off-by-one-character issues like the above.

for name in names:                                                                                      
    if name in namecheck:  # Already good!                                                              
        for known_name in namecheck:                                                                    
            if distance(name, known_name) <= 2:                                                         
                names.remove(name)  # remove the 'bad name'                                             
                names.append(known_name)  # add the known good one

Anything with a Levenshtein distance of 1 or 2 gets clamped to a known player name. This sort of optimisation is very helpful if you have a set of regulars that you play with, but less so if every game is with different people.

This gets us some decent output! The spurious detection is an issue, and one that could be mitigated by some careful DSP. But the output is usable, so we’ll move on to the next step: integrating with our existing workflow!

all posts automation computer vision ocr programming python

Automating YouTube Uploads With OCR Part 4: Exploring Approaches To Improve Detection

My path in the woods diverged, and I took them all

We’ve been seeing if we can apply OCR to the loading screen of Deep Rock Galactic to generate metadata for YouTube uploads for automation.

Last time, we got a quick-and-dirty script that would pull out the various parts of one image successfully. Now we’d like to do that for any given loading screen- any number of dwarves, hazard level, biome, level mutators (which the original image lacked).

We picked nine loading screens to expand our detection to:

The results are mixed:

Starting DRG OCR...
             file                          names       mission_type    mission_name                       biome    hazard                                objective
0  tests/drg1.png   [graham, MaoTheCat, BertieB]           EGG HUNT     DEFECT CELL                  MAGMA CORE  HAZARD 3     COLLECT 6 EGGS\nCollect 25 Hollomite
1  tests/drg2.png                  [&l, [T, @&3]               IR e      OPEN TRICK  RADIOACTIVE EXCLUSION ZONE  HAZARD 3  (COLLECT 225 MORKITE\nCollect 10 Fossil
2  tests/drg3.png       [BertieB, L), MaoTheCat]    INING EXPEDITI!  URIFIED LEGAC)              GLACIAL STRATA  HAZARD 3   COLLECT 250 MORKITE\nCollect 10 Fossil
3  tests/drg4.png               [T, 3 Oz!, o\no]     VAGE OPERATION    HEALTHY WREC                  MAGMA CORE   LLrZtl]         SR RTINS\nCollect 15 Apoca Bloom
4  tests/drg5.png                [o383, (o383, ]        LT X g (o))    RAPID POCKET                    SALTPITS    HAZARD                  COLLECT 10 AQUARQS\n(=R
5  tests/drg6.png  [BertieB, Costello, Noobface]  SALVAGE OPERATION      ANGRY LUCK               DENSE BIOZONE    HAZARD             NIV T\nCollect 20 Boolo Cap.
6  tests/drg7.png            [®29, @&28, T VL R]                      ANGER’S PRIZE                 FUNGUS BOGS    HAZARD     COLLECT 6 EGGS\nCollect 25 Hollomite
7  tests/drg8.png         [IR A )], Costello, T]  MINING EXPEDITION    BRIGHT JEWEL              GLACIAL STRATA    HAZARD            (eI VRS\nCollect 20 Boolo Cap
8  tests/drg9.png             [. ®29, (o], I ‘4]                          HIOELR DY  RADIOACTIVE EXCLUSION ZONE    LLYZU]     COLLECT 6 EGGS\nCollect 20 Boolo Cap

or in image form:

The mission type was a source of issue before for text detection, but looking at the generated crop boxes, it seems text is getting cut off, which will also affect the mission name detection as they are presented together.

When we started this, I knew the number of players would have an impact on the locations of the text for the player names. However, given only up to four players can play at once, it wouldn’t be too bad to write detection for the four possibilities. But if other text is moving, that gets messy very quickly.

We have a couple of options at this point:

  • enlarge the detection boxes for the longest/biggest text we have in the examples and see if that works across all of them
  • think about using something like OpenCV to do text ROI (region of interest) detection (eg as pyimagesearch does it)

The first seems like it could be done quicker than the second, so we’ll give that a try first. We’re still in the “what approach works” stage (aka the quick-and-dirty stage) here!

Unfortunately, the approach wasn’t quite successful. It’s possible that the particular frames we picked from each video had an impact, but that’s not something we can easily test around with our current setup. Let’s see about adding OpenCV to the mix…


We’re going to reuse the approach taken by Adrian on pyimagesearch as the work has been done for us, and see where that gets us.


Well, the short answer is: not as far as I had hoped!

The boxes it detects on a full image detects either too little or too much, though the latter could probably be helped by some video pixel averaging to blur the background and keep the text crisp. However it also splits on non-word boundaries. All of these problems can be worked around, but perhaps there’s another approach we can add to the mix?

Another Image

As well as a start screen, there’s also an end screen:

Another successful mission!

The information is presented slightly differently, but importantly i) it presents the info more uniformly ii) background noise looks like less of an issue. Let’s put this one through the paces we did for the loading screen.

Overall naive OCR pulls out names well but misses about everything else. Mission name: yes. Mission type: nope. Minerals: yes. Promising! Heck, we could even pull out mission time and total hazard bonus if we wanted.

Let’s put OpenCV on the back burner for the time being, and see what a combined approach using two images gets us.

             file                          names       mission_type                       biome      hazard     mission_name                     minerals
0  tests/drg1.png   [graham, MaoTheCat, BertieB]           EGG HUNT                  MAGMA CORE  HAZARD 3 -       OPEN TRICK                 F ATl\n\nEL]
1  tests/drg2.png                  [&l, [T, @&3]       INING EXPEDI  RADIOACTIVE EXCLUSION ZONE  HAZARD 3 -  PURIFIED LEGACY              RGN AL\n\n48 17
2  tests/drg3.png       [BertieB, L), MaoTheCat]  MINING EXPEDITIO|              GLACIAL STRATA  HAZARD 3 -  UNHEALTHY WRECK    MAGNITE 3 CROPPA\n\n39 -3
3  tests/drg4.png               [T, 3 Oz!, o\no]        AL ol 2N ()                  MAGMA CORE    HAZARD 4     RAPID POCKET       2 nli) |2 el 1T\n\n3 4
4  tests/drg5.png                [o383, (o383, ]   POINT EXTRACTION                    SALTPITS    HAZARD 4       ANGRY LUCK               BISMOR UMANITE
5  tests/drg6.png  [BertieB, Costello, Noobface]  SALVAGE OPERATIO|               DENSE BIOZONE    HAZARD 4         I E Vi S             Tt AL v4\n\n3} 8
6  tests/drg7.png            [®29, @&28, T VL R]                                    FUNGUS BOGS    HAZARD 4  CECOND COMEBACK               S 6syTel) fivd
7  tests/drg8.png         [IR A )], Costello, T]  MINING EXPEDITION              GLACIAL STRATA    HAZARD 4    COLDSSAL DOOM       [IChley [ (e\n\n169 48
8  tests/drg9.png             [. ®29, (o], I ‘4]                     RADIOACTIVE EXCLUSION ZONE    HAZARD 4   TIRTIY TN T (3  COSLINCL IR MAGNITE\n\nX 3]

Improvement! We’re getting somewhere now, and we’ll see what we can do to clean the rest of it up using two images as a basis.

all posts

Services, Servers, DomUs and Containers, Oh My!

What a tangled web we weave…

I was confused. Staring at a console window, wondering how I’d installed a program†. The system package manager knew nothing about it, pip pretended like it never heard of it, and I hadn’t downloaded and compiled the source.

I’ve never really had what anyone would call a sane management approach to my servers and services. The closest I’ve got is using Xen as a hypervisor and trying to separate DomUs (VM guests) by service type, backed by LVM-on-mdraid storage*. That sounds alright in theory, but in practice most services have tended to congregate on the largest guest, turning that DomU into a virtualised general purpose server. A server that mixes and matches the system package manager, other package managers like pip, SteamCMD and manual installs.

Pug fugly, in other words.

Pug-fugly, as coined in Pgymoelian

* the storage also sounds good in theory but the implementation has led me to dub the machine the ‘Frankenserver’ (more on that another time)

This is fairly typical. You need to do X. Not only that, but you need to do it RIGHT NOW. Software ABC does that. So you download it from whatever source seems most convenient and up-to-date, glance at the ‘Quickstart guide’ (while saying thank goodness for those!), do a bit of minimal configuration and then you’re up and running, doing X.

But what’s the big deal? The service[s] work, after all.

The issue is a general one: it takes time to figure out the setup before you can usefully interact with it.

This doesn’t just apply to DevOps; but to coding, writing, maintenance and repair, DIY, cooking, house management.

Or more simply: Fail to plan, plan to fail.

I found that it was taking me time to get my head around:

  •  what I was dealing with
  • how it had been set up
  • why it wasn’t working
  • how to fix it
  • how to update it

I’d do those things, sort whatever, get the service working again, then six months later I’d have to figure it all out again.

“What a way to run a railroad…”

Clearly, there must be a better way.

In fact there are several better ways, depending on what you want to do. DevOps is huge business, and scales into the multinational megacorp range. But the home user can benefit too. There are clear benefits in using a well-organised system for pretty much anything, and managing servers, services and other applications is no exception. Used well, maintainability, security, reliability are all enhanced.

But how does one get started? There are plenty to choose from. Some folks I know love Docker, others opt for LXD on LXC (those options are not exclusive). There are also the configuration management tools, like Puppet, Ansible, Chef (etc).

Well, I briefly used Docker in the past, and now have it on one of the DomU guests, hosting a few services I used to run elsewhere. This seems like reason enough to dip my toes deeper in the waters and move more services to containers, or at least to automated processes.

It’s rarely glamorous, but writing good documentation can make a huge difference to the person that follows you. Even when that person is you.

As an aside, the other key ingredient other than having good systems in place is to have good documentation.

For example, before writing this up I wondered about installing a new spam plugin. I used to use Spam Karma 2, but that’s been unmaintained for a long while. But which one? Well, seems I’ve used Akismet and Anti Spam Bee in the past, but why did I stop using them? I have a vague recollection of the former re-moderating old comments and declaring them spam, and the latter not working in some way, but what?

Good documentation, make it your non-New Year’s Resolution.

So the take-home message here is that ad-hoc setup pop up and stick around for longer than they should; don’t do that, have a good system instead and document what you’re doing and why.

Because it’s good to have a goal, my aim is to get low-hanging fruit services moved over to Docker in the first instance (heh) to learn more about the tech. Fore there I can decide what I can move to containers, and maybe even see if LXC would fit my needs anywhere. I’d also like to see if I can apply this to the wee tools I write myself to help automate my workflows- rather than running them manually, perhaps I can develop them as services. And while doing all of this documenting what I am doing and why.

Then maybe one day I won’t have to ask “I have a (python-based) program installed that doesn’t seem to have been installed either by apt or pip, and obviously I can’t remember… is there any way to figure out how I actually installed it? :D”.

beets, for music library organisation/tagging/management

Featured image by steve gibson on Flickr

all posts

[solved] MySQL: Can’t find file ‘./db/table.frm’ (errno: 13)

tl;dr If you’re seeing this and the table does exist- check (and fix) permissions!

I was searching my backups for a database file which contained the entries for an old ranty blog I used to have before I cancelled the domain.

Lo and behold I had a named, dated .tgz file. Unusually, it contained a backup of the MySQL directory structure; rather than a mysqldump‘d set of SQL queries to reconstruct the databases and tables. No matter, I copied the database directory into /var/lib/mysql/db.

Browsing via the command-line interface indicated the database was present (SHOW DATABASES) and usable (USE db). So I tried to SELECT post_title FROM ranty_posts LIMIT 5. But no can do:

Can’t find file: ‘./db/ranty_posts.frm’ (errno: 13)

The problem is permissions, the file is there, slightly-misleading error message notwithstanding. Fortunately, it’s an easy fix- give mysqld the ability to read the files, eg:

# chown mysql:mysql /var/lib/mysql/db/ -R

Which will change the user and group ownership to mysql.

Database and table names changed to protect the innocent

all posts linux

[Fixed] MySQL: Table is marked as crashed and last (automatic?) repair failed (+ WordPress)

tl;dr: run myisamchk on the problematic table

I’ve run into the following error in my Apache error.log recently:

Table 'database.tablename' is marked as crashed and last (automatic?) repair failed

Fortunately the fix is simple: run myisamchk on the table which is marked as crashed:

$ sudo su
# service mysql stop
# cd /var/lib/mysql/databasename
# myisamchk -r tablename
MyISAM-table 'tablename' is not fixed because of errors
Try fixing it by using the --safe-recover (-o), the --force (-f)
 option or by not using the --quick (-q) flag
# myisamchk -r -o -f tablename
Data records: 107435
Found block that points outside data file at 16166832
# service mysql start

I’ve run into these errors before due to running out of disk space on the (admittedly tiny) VPS I had.

I also had this problem with a WordPress database able, causing the often-seen and unhelpfully terse:

Error establishing a database connection

Interestingly, this wasn’t getting bounced to error.log, and I had to use the WordPress database repair screen to track down which one needed the fix (which was the same myisamchk).

All sorted now!

all posts linux misc

Speeding up fdupes

tl;dr: use jdupes

I was merging some fileserver content, and realised I would inevitably end up with duplicates. “Aha”, I thought “time to use good old fdupes“. Well yes, except a few hours later, fdupes was still only at a few percent. Turns out running it on a collected merged mélange of files which are several terabytes in size is not a speedy process.

Enter jdupes, Jody Bruchon’s fork of fdupes. It’s reportedly many times faster than the original, but that’s only half the story. The key, as with things like Project Euler is to figure out the smart way of doing things– in this case smart way is to find duplicates on a subset of files. That might be between photo directories if you think you might have imported duplicates.

In my case, I care about disk space (still haven’t got that LTO drive), and so restricting the search to files over, say, 50 megabytes seemed reasonable. I could probably have gone higher. Even still, it finished in minutes, rather than interminable hours.

/jdupes -S -Z -Q -X size-:50M -r ~/storage/

NB: Jdoy Bruchon makes an excellent point below about the use of -Q. From the documentation:

-Q --quick skip byte-for-byte confirmation for quick matching
WARNING: -Q can result in data loss! Be very careful!

As I was going to manually review (± delete) the duplicates myself, potential collisions are not a huge issue. I would not recommend using it if data loss is a concern, or if using the automated removal option.

jdupes is in Arch AUR and some repos for Debian, but the source code is easy to compile in any case.

all posts linux

Compressing Teamspeak 3 Recordings Using sox

tl;dr: Loop through the files in bash, sox them to FLAC


I’ve been combining fileserver contents recently, and I came across a little archive of Teamspeak 3 recordings:

$ du -sh .
483G /home/robert/storage/media/ts_recordings/


I wrote a quick-and-dirty script to convert the files:


total=$(ls *.wav|wc)
ls *.wav | while read file; do
        sox -q ${file} ${file%.*}.flac
        if [ -e ${file%.*}.flac ]; then
                if ! [ -s {file%.*}.flac ]; then
                        rm ${file}
                        echo "${file%.*}.flac is zero-length!"
                echo "Failed on ${file}"

        if  ! ((n % 10 )); then
                echo "${n} of ${total}"

The script checks that the FLACs replacing the WAVs exist and are not zero-length before removing the original.

This was fine, but after finishing, I was still left with a bunch of uncompressed files in RF64 format, which unfortunately errored.

It turns out sox 14.4.2 added RF64 read support, so I grabbed that on my Arch machine, and converted the few remaining files (substituting wav ? rf64 twice in the script above.

The final result?

$ du -sh .
64G /home/robert/storage/raid6/media/ts_recordings/

400 gigs less space and still lossless? Ahh, much better.