Automating YouTube Uploads With OCR Part 9: Bringing it All Together

I love it when a plan comes together

We’ve been finding a way to automate YouTube uploads using tesseract to OCR frames from Deep Rock Galactic videos to extract metadata used in the video on YouTube.

We got to the stage where we have good, useful JSON output that our automated upload tool can work on. Job done? Well, yes- I could point the tool at it and let it work on that, but it would take quite a while. You see, to give a broad test base and plenty of ‘live-fire’ ammunition, I let a blacklog of a month’s videos build up.

Automating Metadata Updates

Why is that an issue for an automated tool? The YouTube API by default permits 10 000 units per day of access, and uploading a video costs 1600 units. That limits us to six videos per day max, or five once the costs of other API calls are factored in. So I’d rather upload the videos in the background using the web API, and let our automated tool set the metadata.

For that we need the videoIds reported by the API. My tool of choice to obtain those was shoogle. I wrapped it in a python script to get the playlistId of the uploads playlist, then grabbed the videoIds of the 100 latest videos, got the fileDetails of those to get the uploaded fileName… and matched that list to the filename of JSON entries.

So far so good.

Faster Thumbnails

But one of the personal touches that I like to do, and that will likely not be automated away is to pick a frame from the video for the thumbnail. So I need a way to quickly go through the videos, find a frame that would make a good thumbnail, and add that as a field to thumb for the correct video entry. I’ve used xdotool in the past to speed up some of the more repetitive parts of data entry (if you’ve used AutoHotKey for Windows, it’s similar to that in some ways).

I threw together a quick script to switch to the terminal with vim, go to the filename of current video in VLC (VLC can expose a JSON interface with current video metadata- the ones I’m interested in are the filename and the current seek position), create a thumb ? time entry with the current time and then switch back to VLC. That script can be assigned a key combo in Openbox, so the process is: find frame, hit hotkey, find frame in next video, hotkey, repeat.

Though the process is streamlined, finding a good frame in 47 videos isn’t the quickest! But the final result is worth it:

We have videos with full metadata, thumbnail and scheduled date/time set.


I included a video that failed OCR due to a missing loading screen (I hit record too late). There’s a handful of those- I found five while doing the thumbnails. I could do a bit of further work and get partial output from the loading/ending screen alone; or I could bit the bullet and do those ones manually, using it as a reminder to hit the record button at the right time!

Automating YouTube Uploads With OCR Part 8: Output

Nearly a working tool!

We’ve been using python and tesseract to OCR frames from a video footage of Deep Rock Galactic to extract metadata which we can use for putting the videos on YouTube.


Nearly all of the elements are captured, there’s just the mutators left to capture: warnings and anomalies. These appear in text form on the starting screen on either side of the mission block:

Here we have a Cave Leech Cluster and a Rich Atmosphere.

Since the text of these mutators is known to a list of ten or less for each, we can detect them using a wide box, then hard-casting them to whichever potential output it has the smallest Levenshtein distance to.

Tie-Breaking Frames

The loading/ending frame detection works well for most, but on the odd one or two it suffers. It’s best to ignore the frames which are completely/pretty dark (ie either transition or fade-in) , and the ones that are very bright (eg light flash) as that hurts contrast and so hurts OCR.

Using ImageStat from PIL we can grab the frame mean (averaged across RGB values), then normalise it to add to our frame scoring function in the detection routine.

We want to normalise between 0 and 1, which is easy to do if you want to scale linearly between 0 and 255 (RGB max value): just divide the average by 255. But we won’t want that. Manually looking at a few good, contrasty frames it seemed that the value of 75 was the best- even by 150 the frame was looking quite washed out. So we want to have a score of 0 at mean pixel value of 0 and 150; and a score of 1 at mean pixel value of 75:

# Tie break score graph should look something like:
# (tb_val)          
# |    /\            
# |   /  \           
# |  /    \          
# |_/      \_ (x)                
# 0    75    150                
# For sake of argument using 75 as goldilocks value
# ie not too dark, not too bright

75 is thus our ‘goldilocks’ value- not too dark, not too light. So our tiebreak value is:

tb_val = (goldilocks - (abs(goldilocks - frame_mean)))/goldilocks


Since we’ve gotten detection of the various elements to where we want them, we can start generating output. Our automated YT uploader works with JSON, and looks for the following fields: filename, title, description, tags, playlists, game, thumb ( ? time, title, additional), and scheduled.

Thumb time and additional we can safely ignore. Title is easy, as I use mission_type: mission_name. All of my Deep Rock Galactic uploads go into the one playlist. Tags are a bunch of things like hazard level, minerals, biome and some other common-to-all ones like “Deep Rock Galactic” (for game auto detection). The fun ones are description and scheduled.

Funnily enough, one of my earliest forays into javascript was a mad-libs style page which took the phrases via prompt() and put them in some text.

This was back in the days of IE4, and javascript wasn’t quite what it is today…

For the description, I took a bit of a “mad libs” style approach: use the various bits and pieces we’ve captured with a variety of linking verbs and phrases to give non-repetitive output. This mostly comes down to writing the phrases, sticking them in a bunch of lists and using random.choice() to pick one of them.

For obvious reasons, I don’t want to publish fifty-odd videos at once, rather spread them out over a period. I publish a couple of DRG videos on a Monday, Wednesday, Friday and at the weekend. To do this in python, I decided to use a generator, and call next() on it every time we need to populate the scheduled field. The function itself is fairly simple: if the time of scheduled_date is the earlier of the times at which I publish, go to the later one and return the full date; if it’s at the later time, increment by two days (if Monday/Wednesday), or one day and set the time to the earlier one.

We run this through json.dumps() and we have output! For example:

  "filename": "2019-10-17 19-41-38.mkv",
  "title": "Elimination: Illuminated Pocket",
  "description": "BertieB, Costello and graham get their orders from Mission Control and get dropped in to the Fungus Bogs to take on the mighty Dreadnoughts in Illuminated Pocket (Elimination)\n\nRecorded on 2019-10-17",
  "tags": [
    "Deep Rock Galactic",
    "Fungus Bogs",
    "Hazard 4",
    "Enor Pearl"
  "playlists": "Deep Rock Galactic",
  "game": "drg",
  "thumb": {
    "title": "Pocket Elimination"
  "scheduled": "2019-11-18 18:00"

Looks good!

Automating YouTube Uploads With OCR Part 3: Programming with pytesseract and pillow

Last time, a bit of investigating showed that with a little cropping, tesseract can give good OCR results on a still of Deep Rock Galactic’s loading screen.

However, we were cropping manually, which defeats the purpose of this exercise, which is to automate metadata generation.

Thankfully, most of the operations we want to do are purely crops, so it’s straightforward to write a basic python script to get tesseract to recognise the right words.

Lets jump right in with something quick and dirty. The goal here is to get some useful output quickly, so we can confirm that the approach is viable; proper code architecturing comes later.

Starting DRG OCR...
['BertieB', 'graham', 'ksyme99']
Collect 15 Apoca Bloom

We got nearly all of what we want from the image, except for the minerals which are pictographs, which tesseract to my knowledge doesn’t handle.

There was one gotcha though. While the mission type (Point Extraction) was handled fine when using the full-sized image, all the crop boxes I tried didn’t mange to OCR the text correctly. If I used a box which included the mission name, it read both okay; so it would have been possible to do a combined OCR and split on newline.

One of the techniques to get a more accurate result with tesseract is to surround a small box with a border, which gave the right result:

img_mission_type = ImageOps.expand(img.crop(mission_type_box), border=10, fill="white")                       
mission_type = pytesseract.image_to_string(img_mission_type) 

Our very quick-and-dirty script gets what we’re expecting. The next step is to clean it up and expand our testing base. We can also consider the final output – if we’re giving at set of images to improve the range it can deal with, we might we well get useful output from it!

We’ll start by adapting it to these nine images. The one at middle bottom might be an issue due to the exhaust form the drop ship changing the contrast quite significantly- either it’ll be made to work or we’ll have to choose a different frame from that video.

Running the script as-is on image 1 (top-left), we get:

Starting DRG OCR...
['graham', 'PR A', 'BertieB']
Collect 25 Hollomite

Not bad, but it’s tripped up on MaoTheCat and added an extra apostrophe to the mission name. Looking at the crop boxes, it seems one’s too high for the middle player, and the mission name box is getting just a tiny bit of the mission icon. Tweaking the boxes, we get:

Starting DRG OCR...
['graham', 'MaoTheCat', 'BertieB']
Collect 25 Hollomite

And the output from the original image remains accurate too. We will continue this process for the rest of the test images and see where it takes us…

Automating YouTube Uploads With OCR Part 2: Getting Started with tesseract

Last time, we decided that Deep Rock Galactic is a game which is ripe for extracting video metadata from, thanks to it’s beautiful loading screen filled with information:

For OCR we need look no further than tesseract! It’s open source, under development (since 1985 no less!) and easy to install in Arch.

Let’s jump right in and point it at the image above, default settings.

$ tesseract drg-ocr-1.png stdout                                
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 189                              

7 A

Sticktogether and help your fellow dwarves. Getting incapacitated too far away from your team might
‘mean they won' be able to getto you.



' x Collect 15 Apoca Bloom [ErE hhb :: ‘


-1581440568 -1581440568 654 3 2

Oh, er. Now, for an image that’s a still from a video that’s not too bad, actually! It missed the names, classes, and biome, and thinks “Alone!” is “Aloney”; but on the plus side it got the mission type, name, objectives and hazard level.

Not a bad start, and I reckon we can clean that up when we get to actually processing the image with a bit of smarts.

Perhaps using a smaller region would help?

Let’s see:

Detected 34 diacritics


‘ /‘ e I

4Bert1ea j 3 Eraham
DHILLEH b scout

' /xé./f,,
" // II/ s

Eh, sort of? Given we’ve done no processing or cleanup, tesseract isn’t doing terribly.

Let’s make it real easy!

$ tesseract drg-ocr-name-bertieb.jpg stdout


We haven’t done any of the things that can improve tesseract’s accuracy, like image clean up or changing page segmentation mode. Despite that, we’re getting good, usable results from simply cropping.

The next stage is automation!