Automating YouTube Uploads With OCR Part 10: Reflection, Lessons Learned, and Improvements

Every day is a school day

We set out to use OCR to extract metadata from frames of the loading and ending screens of Deep Rock Galactic to use to fill in details of videos destined for YouTube.

In other words we went from:

To:

Why?

It’s always good to reflect when you’ve done something. Did it go well, or not as well as expected? What did you hope to achieve? Did you achieve that? What has it changed? There’s as many ways to reflect as there are things to reflect on.

In this project I wanted to achieve a greater degree of automation with my video creation workflow. Partly because it would save me time:

The ever-relevant XKCD (https://xkcd.com/1205/)

The other reason is because copying text is no longer the provenance of monks in a scriptorium- it’s a repetitive, uncreative task. I enjoy spending time playing games with my friends, and those videos are there so that they and others can relive and enjoy them too; spending time copying text is not a good use of my time.

However, there’s a more pertinent image for this sort of task:

Pretty much spot on (https://xkcd.com/1319/)

There were 47 videos in the test batch. Let’s say that I would have spent five minutes per video copying across the title, writing a description, figuring out the tags and such; doing that manually would have taken 235 minutes, or nearly four hours. That might sound like a lot, but it’s certainly less time than I worked on the automation.

The automatic OCR will have ongoing benefits – there are more videos to process.

But the best part is that I learned. I learned about tesseract and OCR, a bit about OpenCV, and honed my python programming skills.

Lessons Learned

OCR is good enough to extract text from video stills. I assumed this, but it is good to have it confirmed.

Cleaning up images makes a huge difference to OCR accuracy. I could probably have improved detection in the opening image to use just that if I had cleaned up earlier in the process; but using both loading and ending images gives more metadata, so it worked out okay.

It’s really easy to leak file descriptors. Late on, when I went to test with a wider variety of videos, I ran into this issue “OSError: [Errno 24] Too many open files“. Instead of using tempfile.mktemp, which unexpectedly kept the fd, I had to use tempfile.NamedTemporaryFile. That one took a bit of hunting down as it looked like pytesseract was failing, and coincidentally they had a couple of issues in previous versions due to the same issue (mktemp vs NamedTemporaryFile)! Most confusing.

What Would I Do Differently?

Implement automated testing. This would have hugely helped in the refinements stage, where regressions in detection accuracy occurred as I refined. There were a couple of reasons that put me off at the time, but they were more excuses than reasons:

  • this was a “quick and dirty” attempt to get a tool working, refinements to it can come later

    This an old, old excuse; proved false time and again. It’s sometimes phrased as “This is just a temporary fix, will do it properly later” and other variants. What it boils down to is “We’re going to do this the ‘wrong’ way for now, and change it later”.

    It sounds fine, if you actually sort it later, but invariably that doesn’t happen. Time and effort have to be focused somewhere, and it’s a harder sell to redo something that “works” (however hackily) than to implement a new feature, or get a product out the door.

    Here it was even worse: doing that work may well have improved the “quick and dirty” process.
  • the frame extraction + OCR processes aren’t quick, and tests should be quick to run; it’s also hard to break apart the pipeline

    This excuse is on slightly firmer ground, but not by much! It’s true that these things take time, but they can be broken down to components and tested individually using sample images (for example).

    It might not provide the coverage of a real life full data set, but it’ll catch the worst of regressions.

Future Improvements + Directions

Use only a start or end frame if one is missing. At the moment a video is skipped if either the start or end frame is not detected. That leaves the video to be done entirely manually- we could get at least some of the metadata from without the other.

Detect in-game menu screen. For times when I hit the record button too late (or OBS takes too long to spin up), I could go into the menu which has a couple of bits of metadata. I would need to remember to do this, but I usually realise I’ve hit record too late. Combined with the above improvement, we could increase video coverage.

Expand OCR to other games. This is non-trivial but an obvious way to go. Killing Floor 2 is the likeliest next candidate as at the moment it’s the one we play the most and also has metadata to capture.

Consider a further automated pipeline. As it stands, I have to run the program against videos manually; not a big deal. But a tool that detected new videos, automatically runs the OCR tool against them and puts them and the JSON output in a convenient place (± automatically uploading them to YouTube) would make the process more streamlined. This may be beyond my own need or indeed tolerance- I could see it being potentially frustrating if I wanted to manually handle a video differently.

Overall though, I am happy with how the tool turned out.

Using Discourse Dev with Traefik (without ‘Bad Gateway’ + ‘blocked host’)

tl;dr:

  • Traefik grabs the first port it sees, which on the dev image is 1080- we want port 9292. Use --label=traefik.http.routers.discourse-dev.port=9292
  • You need to set a dev host using en env var in the container: -e DISCOURSE_DEV_HOSTS=your_dev_hostname \

With the dev version of Discourse working, I wanted to let its connectivity be managed by the traefik proxy. But whichever way I sliced it, I would get a Bad Gateway error. The usual suspect for this is not setting a port, or having the service on a different network from traefik itself. However, this issue persisted for me.

I had to add the following to (discourse_source_root)/bin/docker/boot_dev, in the docker run ... section:

    --network=traefik_default \
    --label=traefik.port=80 \
    --label=traefik.docker.network=traefik_default \
    --label=traefik.http.routers.discourse-dev.rule=Host\(\`$DEVHOST\`\) \
    --label=traefik.http.services.discourse-dev.loadBalancer.server.port=9292 \

I set DEVHOST=<my dev host> earlier in the file, or you can use the host there directly. The last line points traefik at the correct port (9292) in the discourse-dev container.

Accessing by host then produces a page with a blocked host error:

Blocked host: discourse_dev_host
To allow requests to discourse_dev_host, add the following to your environment configuration:
config.hosts << “discourse_dev_host”

Setting DISCOURSE_DEV_HOSTS permits access on those hosts. We need to do this in the container, so add the following to the same section in the same file:

-e DISCOURSE_DEV_HOSTS=$DEVHOST \

Which permits access via that (or those) hostname(s).

[solved] ‘Connection closed’ in Discourse Dev Install

tl;dr: this was a temporary issue solved by a later commit, if you checked out discourse after 28 Oct but before 4 Nov, git pull to update


Having installed a production Discourse forum, I wanted to get a local dev instance up and running for testing.

There are good instructions for doing just that using Docker. Don’t do what I did: follow the production install method and assume that will work by pointing the prod hostname at it in /etc/hosts.

Unfortunately, when I followed the instructions to set up the dev instance, I was greeted with an ‘Unable to connect’ screen. (ERR_FAILED). Even using telnet from the same host failed:

bertieb@ubunutu-vm:~/discourse$ telnet 127.0.0.1 9292
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
Connection closed by foreign host.

Dang. I tried this across fresh Arch and Ubuntu Server (19.10 + 18.04.3 LTS) installs and got the same thing.

Installing the non-Docker version worked but only for localhost; then a comment on that guide’s topic pointed me at a recent change to interface binding. Checking out the commit before that change let me connect from other hosts in both the Docker and non-Docker versions.

As of 2019-11-04 a later commit sorted this issues and added a specific flag (-b) for permitting connections from other hosts.

Automating YouTube Uploads With OCR Part 9: Bringing it All Together

I love it when a plan comes together

We’ve been finding a way to automate YouTube uploads using tesseract to OCR frames from Deep Rock Galactic videos to extract metadata used in the video on YouTube.

We got to the stage where we have good, useful JSON output that our automated upload tool can work on. Job done? Well, yes- I could point the tool at it and let it work on that, but it would take quite a while. You see, to give a broad test base and plenty of ‘live-fire’ ammunition, I let a blacklog of a month’s videos build up.

Automating Metadata Updates

Why is that an issue for an automated tool? The YouTube API by default permits 10 000 units per day of access, and uploading a video costs 1600 units. That limits us to six videos per day max, or five once the costs of other API calls are factored in. So I’d rather upload the videos in the background using the web API, and let our automated tool set the metadata.

For that we need the videoIds reported by the API. My tool of choice to obtain those was shoogle. I wrapped it in a python script to get the playlistId of the uploads playlist, then grabbed the videoIds of the 100 latest videos, got the fileDetails of those to get the uploaded fileName… and matched that list to the filename of JSON entries.

So far so good.

Faster Thumbnails

But one of the personal touches that I like to do, and that will likely not be automated away is to pick a frame from the video for the thumbnail. So I need a way to quickly go through the videos, find a frame that would make a good thumbnail, and add that as a field to thumb for the correct video entry. I’ve used xdotool in the past to speed up some of the more repetitive parts of data entry (if you’ve used AutoHotKey for Windows, it’s similar to that in some ways).

I threw together a quick script to switch to the terminal with vim, go to the filename of current video in VLC (VLC can expose a JSON interface with current video metadata- the ones I’m interested in are the filename and the current seek position), create a thumb ? time entry with the current time and then switch back to VLC. That script can be assigned a key combo in Openbox, so the process is: find frame, hit hotkey, find frame in next video, hotkey, repeat.

Though the process is streamlined, finding a good frame in 47 videos isn’t the quickest! But the final result is worth it:

We have videos with full metadata, thumbnail and scheduled date/time set.

Glorious.

I included a video that failed OCR due to a missing loading screen (I hit record too late). There’s a handful of those- I found five while doing the thumbnails. I could do a bit of further work and get partial output from the loading/ending screen alone; or I could bit the bullet and do those ones manually, using it as a reminder to hit the record button at the right time!

Automating YouTube Uploads With OCR Part 8: Output

Nearly a working tool!

We’ve been using python and tesseract to OCR frames from a video footage of Deep Rock Galactic to extract metadata which we can use for putting the videos on YouTube.

Mutators

Nearly all of the elements are captured, there’s just the mutators left to capture: warnings and anomalies. These appear in text form on the starting screen on either side of the mission block:

Here we have a Cave Leech Cluster and a Rich Atmosphere.

Since the text of these mutators is known to a list of ten or less for each, we can detect them using a wide box, then hard-casting them to whichever potential output it has the smallest Levenshtein distance to.

Tie-Breaking Frames

The loading/ending frame detection works well for most, but on the odd one or two it suffers. It’s best to ignore the frames which are completely/pretty dark (ie either transition or fade-in) , and the ones that are very bright (eg light flash) as that hurts contrast and so hurts OCR.

Using ImageStat from PIL we can grab the frame mean (averaged across RGB values), then normalise it to add to our frame scoring function in the detection routine.

We want to normalise between 0 and 1, which is easy to do if you want to scale linearly between 0 and 255 (RGB max value): just divide the average by 255. But we won’t want that. Manually looking at a few good, contrasty frames it seemed that the value of 75 was the best- even by 150 the frame was looking quite washed out. So we want to have a score of 0 at mean pixel value of 0 and 150; and a score of 1 at mean pixel value of 75:

# Tie break score graph should look something like:
# (tb_val)          
# |    /\            
# |   /  \           
# |  /    \          
# |_/      \_ (x)                
# 0    75    150                
#                   
# For sake of argument using 75 as goldilocks value
# ie not too dark, not too bright

75 is thus our ‘goldilocks’ value- not too dark, not too light. So our tiebreak value is:

tb_val = (goldilocks - (abs(goldilocks - frame_mean)))/goldilocks

Output

Since we’ve gotten detection of the various elements to where we want them, we can start generating output. Our automated YT uploader works with JSON, and looks for the following fields: filename, title, description, tags, playlists, game, thumb ( ? time, title, additional), and scheduled.

Thumb time and additional we can safely ignore. Title is easy, as I use mission_type: mission_name. All of my Deep Rock Galactic uploads go into the one playlist. Tags are a bunch of things like hazard level, minerals, biome and some other common-to-all ones like “Deep Rock Galactic” (for game auto detection). The fun ones are description and scheduled.

Funnily enough, one of my earliest forays into javascript was a mad-libs style page which took the phrases via prompt() and put them in some text.

This was back in the days of IE4, and javascript wasn’t quite what it is today…

For the description, I took a bit of a “mad libs” style approach: use the various bits and pieces we’ve captured with a variety of linking verbs and phrases to give non-repetitive output. This mostly comes down to writing the phrases, sticking them in a bunch of lists and using random.choice() to pick one of them.

For obvious reasons, I don’t want to publish fifty-odd videos at once, rather spread them out over a period. I publish a couple of DRG videos on a Monday, Wednesday, Friday and at the weekend. To do this in python, I decided to use a generator, and call next() on it every time we need to populate the scheduled field. The function itself is fairly simple: if the time of scheduled_date is the earlier of the times at which I publish, go to the later one and return the full date; if it’s at the later time, increment by two days (if Monday/Wednesday), or one day and set the time to the earlier one.

We run this through json.dumps() and we have output! For example:

{
  "filename": "2019-10-17 19-41-38.mkv",
  "title": "Elimination: Illuminated Pocket",
  "description": "BertieB, Costello and graham get their orders from Mission Control and get dropped in to the Fungus Bogs to take on the mighty Dreadnoughts in Illuminated Pocket (Elimination)\n\nRecorded on 2019-10-17",
  "tags": [
    "Deep Rock Galactic",
    "DRG",
    "PC",
    "Co-op",
    "Gaming",
    "Elimination",
    "Dreadnought",
    "Fungus Bogs",
    "Hazard 4",
    "Magnite",
    "Enor Pearl"
  ],
  "playlists": "Deep Rock Galactic",
  "game": "drg",
  "thumb": {
    "title": "Pocket Elimination"
  },
  "scheduled": "2019-11-18 18:00"
}

Looks good!

Adding Discourse to a mix of nginx-hosted sites [How]

I set up a Discourse server today. It was pleasantly straightforward. The official docs work well enough, though there are a few things I did:

  • integrated with existing nginx sites by cribbing from this guide (short version: forward Discourse requests to a socket)
  • set up email delivery via MailJet- their admin interface makes getting credentials and setting up + verifying SPF and DKIM records simple
  • set up certbot to generate LetsEncrypt certs (thanks Arch wiki) and get HTTPS rolling (bonus: https for existing sites for free!)
  • added SSO for GitHub and Discord (short version: create the applications on the respective sites, support for these Oauth is baked in)
  • typoed DenyUsers as DenyUser, locking myself out of ssh access

Maybe skip the last one if you’re doing the install yourself?

Automating YouTube Uploads With OCR Part 7: Refinement

In which things aren’t quite as smooth as they seem

We’ve been using OCR to extract metadata from Deep Rock Galactic loading and end screens, and we’re at the stage of doing most of that automatically.

I was quite pleased with the progress we’ve made. But as so often is the case, I went to demo the program and it failed. The mission name seemed to be an issue. I had a look and the image the end_screen_detector had supplied was very dark- it managed to find one on a fade out; though this was too dark for decent OCR.

Improving Detection

To solve that, I used ImageStat to pull out the mean pixel value, and eventually settled on giving the image an extra point for its score if it was over a threshold. For those paying close attention, this also mean that I had to change my “no image found” logic to trigger on values <= 1, as a bright frame with no elements matched would score 1. Acceptable.

A Bit of Visual Debugging

However, I continued to run into issue with the mission name. After debugging some individual images, I decided to simply show() the region that was being fed to tesseract across a number of input videos:

As I’ve said before: “ah”, and “hmm”. It seems as though the mission text can move, after all. It’s not enough that a cursory glance would see it – maybe ~10-15 pixels – but enough to completely butcher the text for OCR purposes.

I’m not sure if there’s a way to know a priori in which position the name will be – assuming there are only two possibilities, for now – so we’ll have to do something similar to what we did for the player names, which can also move.

Using Confidence and Pixel Nudging

For the names we relied on detecting a known name, but perhaps there’s a better approach here?

image_to_data Returns result containing box boundaries, confidences, and other information

This looks like what we want! We’ll send two regions (for now) to pytesseract, and take the one with the highest confidence; then send that text to be hard-cast to the known possible mission names. I say “for now” as if there are two ways that we can see, there may be more that crop up through testing, so I am writing this to work with any number of mission name regions.

A few pixels difference can have a huge impact on accuracy!

Much better! Although Ranger’s Prize seems to have flipped over to using the worse of the two, it still is detected as “Ranger’s Prize” by OCR, so I’ll let that pass for now.

We could use the ‘best confidence’ approach to improve the detection of names, and help reject over-detections. It could probably also be applied to mission types:

Looks like there’s also a bit of movement on the mission type. Let’s see if we can improve things here too. While the detection rate is good – only error is the Elimination being cast to Egg Hunt – we can improve this which helps when we expand to even more videos.

Much better.

I also ripped out the guts of name matching so that it could use the confidence method for those. They took a little more thinking about because of an extra dimension (n name regions to look up instead of one region) but that was changed over and retains the Levenshtein distance of <= 2 for clamping OCR output to a name.

Law of Unintended Consequences

However, that unearthed a side-effect: the confidence of 2 name detection will always be higher than 4 name detection, even when there are 4 names as the middle two names occupy the same space for two players and four players. So instead of using the max value for confidence, we can apply a threshold instead, and if the detection for 4 names is over that threshold we can apply that.

Digging deeper, I found that the confidence varied hugely between the 4-name and 2-name detectors for the same word in the same position:

... 'nametext1': 'graham', 'nameconf1': 26.0 ...
... 'nametext0': 'graham', 'nameconf0': 96.0 ...

Same text, same position, *slightly* different bounding box by a few pixels. It’s possible that the 2-player name boxes are off by a few pixels (though if that was the case, the 4-player detection should be better, not worse!); but it’s more likely that my own box drawing had differences between the two scenarios, as that was a manual process.

As noted: a few pixels can make a huge difference in confidence and accuracy of OCR.

Automating YouTube Uploads With OCR Part 6: Automatically Detecting Loading / Ending Screens

We’re using OCR to extract game metadata from Deep Rock Galactic videos. We’re now at the point where if we give our script two images – one of the loading screen, one of the end screen – it does a pretty good job of pulling out the information.

Now we need a way to pull out the images automatically. Since the loading and end screens are a variable distance but close to the start and end of the recording, we need to find the screens rather than rely on a fixed time.

Off the top of my head, two potential approaches spring to mind:

  • use OCR to detect known elements of each screen
  • use trained CV to recognise the screens

While the second option sound fun and interesting, it’s not something I know a lot about. Perhaps we can return to that at some point. For now we’ll try OCR. Start at the beginning and seek forward, and start at the end and seek backwards. For OCR, we want something that will be i) reliably read and ii) doesn’t move, ideally.

End Screen

A couple of elements jump out as possibilities: the “MISSION TIME” string on the right is large and clear; “TOTAL HAZARD BONUS” is also reasonably clear, and the CONTINUE button looks like it is in a fixed position.

Loading Screen

The loading screen is trickier. Most of the elements look dynamic. Of the text, the mission name is probably the clearest, and we do have a list of possible mission names. The player names and classes are there- I should be in all of my videos, and we can also test for the presence of “DRILLER/SCOUT/GUNNER/ENGINEER” somewhere in the top quarter of the image.

Quick OCR

The approach used sets a start time, an end time and a step, generates frames for those, then OCR’s the frames and scores them based on what is present.

In the case of the loading screen, recognised players names and classes are scored, and then the frame with the highest score is picked. A score of 0 means the detection has failed, for example if the time period in the video does not contain a loading/ending screen.

This approach has the advantage of picking the best frame; though it is slower than picking the first acceptable frame. Minimising runtime isn’t crucial here however, as uploading the video takes orders of magnitude longer than the time to run the script. We could exit as soon as a frame scores 8 (four players names and four classes)

In the case of the ending screen, we match on “MISSION TIME:”, “TOTAL HAZARD BONUS” and “CONTINUE”, each word here scoring 1 point. Here, because the elements are known in advance and should always be present, we can have an early exit for a frame that scores the maximum of 6.

Putting Detection and OCR Together to Test

Our previous version took images to work on as arguments, which was fine when we were testing, but now we’re testing videos, so the code needs tweaked to handle that.

Throwing a bunch of videos at it, showed a couple of issues. One video had OCR fail on the mission name, so I tweaked the box and applied a bunch of enhancements (grayscale, posterize, invert, autocontrast, border) to get the text OCR’d correctly.

I also changed my name checking list to have the expected version of the names, rather than lower case, for the purposes of doing some Levenshtein distance checks. This led to some name combinations not being detected, so the any() logic needed changed:

if any(n in [name.lower() for name in names] for n in namecheck):

Became:

if any(name.lower() in [n.lower() for n in namecheck]                                                       
               for name in names):

Also, remember a few paragraphs when I said “time taken doesn’t really matter”, well it does when you’re making changes and retesting! When I set up the list of 10 videos to collect output from, I had to do other things a few times. As ever, the truth can be found in xkcd:

“My OCR is running!”

The output we got with minimal changes is pretty good:

                      file                                  names       mission_type                       biome    hazard        mission_name               minerals
0  2019-10-13 21-41-44.mkv  [BertieB, graham, MaoTheCat, ksume99]  Mining Expedition  Radioactive Exclusion Zone  Hazard 3          Open Trick  [Umanite, Enor Pearl]
1  2019-10-13 22-08-16.mkv           [BertieB, graham, MaoTheCat]  Mining Expedition              Glacial Strata  Hazard 3     Purified Legacy     [Umanite, Magnite]
2  2019-10-14 20-06-55.mkv                    [BertieB, Costello]  Salvage Operation                  Magma Core  Hazard 4     Unhealthy Wreck      [Magnite, Croppa]
3  2019-10-14 20-54-49.mkv  [BertieB, graham, Noobface, Costello]   Point Extraction                   Salt Pits  Hazard 4        Rapid Pocket   [Bismor, Enor Pearl]
4  2019-10-14 21-16-24.mkv          [BertieB, Costello, Noobface]  Salvage Operation                   Salt Pits  Hazard 4          Angry Luck      [Umanite, Bismor]
5  2019-10-15 18-04-10.mkv                    [BertieB, Costello]           Egg Hunt                 Fungus Bogs  Hazard 4      Ranger's Prize        [Croppa, Jadiz]
6  2019-10-15 18-26-15.mkv            [BertieB, Costello, graham]  Mining Expedition              Glacial Strata  Hazard 4     Second Comeback        [Jadiz, Bismor]
7  2019-10-17 19-14-51.mkv              [eVNS, BertieB, Costello]           Egg Hunt  Radioactive Exclusion Zone  Hazard 4       Colossal Doom  [Umanite, Enor Pearl]
8  2019-10-17 19-41-38.mkv            [BertieB, Costello, graham]           Egg Hunt                 Fungus Bogs  Hazard 4  Illuminated Pocket  [Magnite, Enor Pearl]
9  2019-10-17 20-13-07.mkv            [BertieB, Costello, graham]           Egg Hunt         Crystalline Caverns  Hazard 4         Red Oddness      [Umanite, Bismor]

There’s a couple foibles: ksyme99 is detected as ksume99 in 0, and there’s a spurious detection of ‘eVNS’ in 7. This suggests name detection could be improved, though recall we weren’t able to hard-cast the output as there’s the possibility of unknown player names. However, we can use our good friend Levenshtein distance to fix off-by-one-character issues like the above.

for name in names:                                                                                      
    if name in namecheck:  # Already good!                                                              
        continue                                                                                        
    else:                                                                                               
        for known_name in namecheck:                                                                    
            if distance(name, known_name) <= 2:                                                         
                names.remove(name)  # remove the 'bad name'                                             
                names.append(known_name)  # add the known good one

Anything with a Levenshtein distance of 1 or 2 gets clamped to a known player name. This sort of optimisation is very helpful if you have a set of regulars that you play with, but less so if every game is with different people.

This gets us some decent output! The spurious detection is an issue, and one that could be mitigated by some careful DSP. But the output is usable, so we’ll move on to the next step: integrating with our existing workflow!

Automating YouTube Uploads With OCR Part 5: Refinements and Improving Accuracy

Having limited output possibilities helps immensely

We’ve been using pytesseract to help us OCR screen in Deep Rock Galactic to get metadata for YouTube uploads.

Last time we explored a number of approaches to get the output on the right track. We settled on using a second image from the end screen which had clearer text to augment the processing.

Colour Inversion

Let’s see if we can improve that further with box refinements and what the tesseract wiki suggests.

Yes:

             file                          names             mission_type                       biome      hazard        mission_name                      minerals
0  tests/drg1.png   [graham, MaoTheCat, BertieB]               1 EGG HUNT                  MAGMA CORE  HAZARD 3 -          OPEN TRICK       .ENDH PEARL UMANITE\n98
1  tests/drg2.png                  [&l, [T, @&3]      > miNiNG ExPeDITIBN  RADIOACTIVE EXCLUSION ZONE  HAZARD 3 -     PURIFIED LEGACY         ‘ MAGNITE UMANITE\n17
2  tests/drg3.png       [BertieB, L), MaoTheCat]        MINING EXPEDITION              GLACIAL STRATA  HAZARD 3 -     UNHEALTHY WRECK          ‘ MAGNITE CROPPA\n41
3  tests/drg4.png               [T, 3 Oz!, o\no]         ALVAGE OPERATION                  MAGMA CORE    HAZARD 4        RAPID POCKET      BISMOR ENOR PEARL\n22 24
4  tests/drg5.png                [o383, (o383, ]       ~ POINT EXTRACTION                    SALTPITS    HAZARD 4          ANGRY LUCK         BISMOR UMANITE\n94 19
5  tests/drg6.png  [BertieB, Costello, Noobface]        SALVAGE OPERATION               DENSE BIOZONE    HAZARD 4      RANGER'S PRIZE             ‘ CROPPA JADIZ\n8
6  tests/drg7.png            [®29, @&28, T VL R]                | EGGHUNT                 FUNGUS BOGS    HAZARD 4     CECOND COMEBACK             ‘BISHUH JADIZ\na8
7  tests/drg8.png         [IR A )], Costello, T]  y\n\n MINING EXPEDITION              GLACIAL STRATA    HAZARD 4       COLDSSAL DOOM  ‘ UMANITE ENOR PEARL\n169 48
8  tests/drg9.png             [. ®29, (o], I ‘4]           EGG HUNT __ .l  RADIOACTIVE EXCLUSION ZONE    HAZARD 4  ILLUMINATED POCKET     .ENDH PEARL I MAGNITE\n29

Inverting the image to be black-on-white helps hugely. In fact, given many of the fields have very restricted possibilities, we probably have enough to work with, once we take care of variable number of names.

Handling Different Numbers of Players / Names

In DRG there are 1-4 players. My games are usually 3 or 4 players, sometimes 2, very very rarely solo. As the players names appear in different positions depending on the number of players we need to either

i) use fixed boxes for each number and see which one has sensible output

ii) use OpenCV to detect text to OCR

The first way is manageable in a relatively straightforward manner. Since there is a small number of regular players including myself, we can check for the presence of any of those in the output and keep it if it seems sensible.

Doing that gets us to:

There’s a bit of overdetection, particularly in the last row, which actually only had two players. We can clean things up by:

i) if a name is BertieB with anything else, it’s BertieB as my name doesn’t change (Note this may not be true for everyone- some folks like to change their username)

ii) non-alphanumeric names can be pruned

iii) names of 1-3 chars are likely noise detected as text*

* The last one could probably be dealt with by appropriate thresholding, but that’s a topic for another time.

Doing that, we get:

Which is a huge improvement. We could hard-lock the output to a subset of names (which 99% of my games are with), but that would be a headache to remember to check in the case of playing a game on a public server or people who want to join in my stream. This is “good enough” for the time being!

Levenshtein Distance

Using the Levenshtein distance – the number of edits needed to transform a string into another – we can compare the OCR’d text to the five mission types, and pick whichever is closest. We can do the same thing with the biomes, minerals, and mission names. It should work excellently for the first three as there are few choices; however it should still work acceptably well for the mission names, even though there are over a hundred first at last names.

Our code is simple:

def hard_cast_text(detected_text, choices):                                                                       
      """Hard cast detected_text to one of list of choices"""                                                       
      from Levenshtein import distance                                                                              
      distances = {}                                                                                                
                                                                                                                    
      for choice in choices:                                                                                        
          distances[choice] = distance(choice,lower(),                                                              
                                       detected_text.lower())                                                       
                                                                                                                    
      return min(distances, key=distances.get)

This could probably be made a one-liner if I thought long and hard enough about it. But we’re here to automate, not golf python.

The minerals needed a little extra to handle enor pearl being two words and certain detections being closer in Levenshtein distance to, say, jadiz. Another scoring system that weights the beginning of strings more heavily may have helped there, but keeping it to Levenshtein means I can strip out the external library and implement my own if I so wish.

Our output for these nine tests looks good:

             file                                  names       mission_type                       biome    hazard        mission_name               minerals
0  tests/drg1.png           [graham, MaoTheCat, BertieB]           Egg Hunt                  Magma Core  Hazard 3          Open Trick  [Umanite, Enor Pearl]
1  tests/drg2.png  [BertieB, graham, MaoTheCat, ksyme99]  Mining Expedition  Radioactive Exclusion Zone  Hazard 3     Purified Legacy     [Magnite, Umanite]
2  tests/drg3.png           [BertieB, graham, MaoTheCat]  Mining Expedition              Glacial Strata  Hazard 3     Unhealthy Wreck      [Croppa, Magnite]
3  tests/drg4.png                    [BertieB, Costello]  Salvage Operation                  Magma Core  Hazard 4        Rapid Pocket   [Bismor, Enor Pearl]
4  tests/drg5.png  [BertieB, graham, Noobface, Costello]   Point Extraction                   Salt Pits  Hazard 4          Angry Luck      [Bismor, Umanite]
5  tests/drg6.png          [BertieB, Costello, Noobface]  Salvage Operation               Dense Biozone  Hazard 4      Ranger's Prize        [Jadiz, Croppa]
6  tests/drg7.png           [BertieB, Costello, bTRRABN]           Egg Hunt                 Fungus Bogs  Hazard 4     Second Comeback        [Bismor, Jadiz]
7  tests/drg8.png            [BertieB, Costello, graham]  Mining Expedition              Glacial Strata  Hazard 4       Colossal Doom  [Umanite, Enor Pearl]
8  tests/drg9.png                    [BertieB, Costello]           Egg Hunt  Radioactive Exclusion Zone  Hazard 4  Illuminated Pocket  [Magnite, Enor Pearl]

Next step? Further automation, of course!

Automating YouTube Uploads With OCR Part 4: Exploring Approaches To Improve Detection

My path in the woods diverged, and I took them all

We’ve been seeing if we can apply OCR to the loading screen of Deep Rock Galactic to generate metadata for YouTube uploads for automation.

Last time, we got a quick-and-dirty script that would pull out the various parts of one image successfully. Now we’d like to do that for any given loading screen- any number of dwarves, hazard level, biome, level mutators (which the original image lacked).

We picked nine loading screens to expand our detection to:

The results are mixed:

Starting DRG OCR...
             file                          names       mission_type    mission_name                       biome    hazard                                objective
0  tests/drg1.png   [graham, MaoTheCat, BertieB]           EGG HUNT     DEFECT CELL                  MAGMA CORE  HAZARD 3     COLLECT 6 EGGS\nCollect 25 Hollomite
1  tests/drg2.png                  [&l, [T, @&3]               IR e      OPEN TRICK  RADIOACTIVE EXCLUSION ZONE  HAZARD 3  (COLLECT 225 MORKITE\nCollect 10 Fossil
2  tests/drg3.png       [BertieB, L), MaoTheCat]    INING EXPEDITI!  URIFIED LEGAC)              GLACIAL STRATA  HAZARD 3   COLLECT 250 MORKITE\nCollect 10 Fossil
3  tests/drg4.png               [T, 3 Oz!, o\no]     VAGE OPERATION    HEALTHY WREC                  MAGMA CORE   LLrZtl]         SR RTINS\nCollect 15 Apoca Bloom
4  tests/drg5.png                [o383, (o383, ]        LT X g (o))    RAPID POCKET                    SALTPITS    HAZARD                  COLLECT 10 AQUARQS\n(=R
5  tests/drg6.png  [BertieB, Costello, Noobface]  SALVAGE OPERATION      ANGRY LUCK               DENSE BIOZONE    HAZARD             NIV T\nCollect 20 Boolo Cap.
6  tests/drg7.png            [®29, @&28, T VL R]                      ANGER’S PRIZE                 FUNGUS BOGS    HAZARD     COLLECT 6 EGGS\nCollect 25 Hollomite
7  tests/drg8.png         [IR A )], Costello, T]  MINING EXPEDITION    BRIGHT JEWEL              GLACIAL STRATA    HAZARD            (eI VRS\nCollect 20 Boolo Cap
8  tests/drg9.png             [. ®29, (o], I ‘4]                          HIOELR DY  RADIOACTIVE EXCLUSION ZONE    LLYZU]     COLLECT 6 EGGS\nCollect 20 Boolo Cap

or in image form:

The mission type was a source of issue before for text detection, but looking at the generated crop boxes, it seems text is getting cut off, which will also affect the mission name detection as they are presented together.

When we started this, I knew the number of players would have an impact on the locations of the text for the player names. However, given only up to four players can play at once, it wouldn’t be too bad to write detection for the four possibilities. But if other text is moving, that gets messy very quickly.

We have a couple of options at this point:

  • enlarge the detection boxes for the longest/biggest text we have in the examples and see if that works across all of them
  • think about using something like OpenCV to do text ROI (region of interest) detection (eg as pyimagesearch does it)

The first seems like it could be done quicker than the second, so we’ll give that a try first. We’re still in the “what approach works” stage (aka the quick-and-dirty stage) here!

Unfortunately, the approach wasn’t quite successful. It’s possible that the particular frames we picked from each video had an impact, but that’s not something we can easily test around with our current setup. Let’s see about adding OpenCV to the mix…

OpenCV

We’re going to reuse the approach taken by Adrian on pyimagesearch as the work has been done for us, and see where that gets us.

(…)

Well, the short answer is: not as far as I had hoped!

The boxes it detects on a full image detects either too little or too much, though the latter could probably be helped by some video pixel averaging to blur the background and keep the text crisp. However it also splits on non-word boundaries. All of these problems can be worked around, but perhaps there’s another approach we can add to the mix?

Another Image

As well as a start screen, there’s also an end screen:

Another successful mission!

The information is presented slightly differently, but importantly i) it presents the info more uniformly ii) background noise looks like less of an issue. Let’s put this one through the paces we did for the loading screen.

Overall naive OCR pulls out names well but misses about everything else. Mission name: yes. Mission type: nope. Minerals: yes. Promising! Heck, we could even pull out mission time and total hazard bonus if we wanted.

Let’s put OpenCV on the back burner for the time being, and see what a combined approach using two images gets us.

             file                          names       mission_type                       biome      hazard     mission_name                     minerals
0  tests/drg1.png   [graham, MaoTheCat, BertieB]           EGG HUNT                  MAGMA CORE  HAZARD 3 -       OPEN TRICK                 F ATl\n\nEL]
1  tests/drg2.png                  [&l, [T, @&3]       INING EXPEDI  RADIOACTIVE EXCLUSION ZONE  HAZARD 3 -  PURIFIED LEGACY              RGN AL\n\n48 17
2  tests/drg3.png       [BertieB, L), MaoTheCat]  MINING EXPEDITIO|              GLACIAL STRATA  HAZARD 3 -  UNHEALTHY WRECK    MAGNITE 3 CROPPA\n\n39 -3
3  tests/drg4.png               [T, 3 Oz!, o\no]        AL ol 2N ()                  MAGMA CORE    HAZARD 4     RAPID POCKET       2 nli) |2 el 1T\n\n3 4
4  tests/drg5.png                [o383, (o383, ]   POINT EXTRACTION                    SALTPITS    HAZARD 4       ANGRY LUCK               BISMOR UMANITE
5  tests/drg6.png  [BertieB, Costello, Noobface]  SALVAGE OPERATIO|               DENSE BIOZONE    HAZARD 4         I E Vi S             Tt AL v4\n\n3} 8
6  tests/drg7.png            [®29, @&28, T VL R]                                    FUNGUS BOGS    HAZARD 4  CECOND COMEBACK               S 6syTel) fivd
7  tests/drg8.png         [IR A )], Costello, T]  MINING EXPEDITION              GLACIAL STRATA    HAZARD 4    COLDSSAL DOOM       [IChley [ (e\n\n169 48
8  tests/drg9.png             [. ®29, (o], I ‘4]                     RADIOACTIVE EXCLUSION ZONE    HAZARD 4   TIRTIY TN T (3  COSLINCL IR MAGNITE\n\nX 3]

Improvement! We’re getting somewhere now, and we’ll see what we can do to clean the rest of it up using two images as a basis.