Transcribing A Video’s Audio Using Google Cloud Text-to-Speech

Navigating a series of pitfalls

Background: I created a highlights video recently for the monthly subscriber crown that I run. I figured the highlights video itself was worth posting, but I like to have the speech captioned. I’ve previously written a small script to generate the appropriate transparent PNGs for overlaying, but that was for a short, manually transcribed highlight (this one).

I forget how I stumbled across it, but I had a paged saved on how to transcribe audio in python for checking out. I figured this would be a good way to put that into practice- let it transcribe the audio from the video, and then fix it up since it’s rather noisy audio.

Having recently installed pyenv and pyenv-virualenv for another purpose I set up a virtualenv and grabbed the deps:

(transcribe) $ pip install pylint pydub SpeechRecognition

though I actually could have omitted pydub as that’s only used to convert / split audio, for which I prefer to use ffmpeg manually anyway. I threw together a quick script, and used the recognize_google method.

First Pitfall: Broken Pipe

Traceback (most recent call last):                                                                                  
  File "/home/robert/.pyenv/versions/3.9.5/lib/python3.9/urllib/request.py", line 1346, in do_open                  
    h.request(req.get_method(), req.selector, req.data, headers,                                                    
  File "/home/robert/.pyenv/versions/3.9.5/lib/python3.9/http/client.py", line 1253, in request                     
    self._send_request(method, url, body, headers, encode_chunked)                                                  
  File "/home/robert/.pyenv/versions/3.9.5/lib/python3.9/http/client.py", line 1299, in _send_request               
    self.endheaders(body, encode_chunked=encode_chunked)                                                            
  File "/home/robert/.pyenv/versions/3.9.5/lib/python3.9/http/client.py", line 1248, in endheaders                  
    self._send_output(message_body, encode_chunked=encode_chunked)                                                  
  File "/home/robert/.pyenv/versions/3.9.5/lib/python3.9/http/client.py", line 1047, in _send_output                
    self.send(chunk)                                                                                                
  File "/home/robert/.pyenv/versions/3.9.5/lib/python3.9/http/client.py", line 969, in send                         
    self.sock.sendall(data)                                                                                         
BrokenPipeError: [Errno 32] Broken pipe                                                                             
                                                                                                                    
During handling of the above exception, another exception occurred:                                                 
                                                                                                                    
Traceback (most recent call last):                                                                                  
  File "/home/robert/.pyenv/versions/transcribe/lib/python3.9/site-packages/speech_recognition/__init__.py", line 84
0, in recognize_google                                                                                              
    response = urlopen(request, timeout=self.operation_timeout)                                                     
  File "/home/robert/.pyenv/versions/3.9.5/lib/python3.9/urllib/request.py", line 214, in urlopen                   
    return opener.open(url, data, timeout)                                                                          
  File "/home/robert/.pyenv/versions/3.9.5/lib/python3.9/urllib/request.py", line 517, in open                      
    response = self._open(req, data)                                                                                
  File "/home/robert/.pyenv/versions/3.9.5/lib/python3.9/urllib/request.py", line 534, in _open                     
    result = self._call_chain(self.handle_open, protocol, protocol +                                                
  File "/home/robert/.pyenv/versions/3.9.5/lib/python3.9/urllib/request.py", line 494, in _call_chain               
    result = func(*args)                                                                                            
  File "/home/robert/.pyenv/versions/3.9.5/lib/python3.9/urllib/request.py", line 1375, in http_open                
    return self.do_open(http.client.HTTPConnection, req)                                                            
  File "/home/robert/.pyenv/versions/3.9.5/lib/python3.9/urllib/request.py", line 1349, in do_open                  
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 32] Broken pipe>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/robert/code/transcribe/transcribe.py", line 36, in <module>
    transcribe(sys.argv[1])
  File "/home/robert/code/transcribe/transcribe.py", line 27, in transcribe
    print(recogniser.recognize_google(audio))
  File "/home/robert/.pyenv/versions/transcribe/lib/python3.9/site-packages/speech_recognition/__init__.py", line 84
4, in recognize_google
    raise RequestError("recognition connection failed: {}".format(e.reason))
speech_recognition.RequestError: recognition connection failed: [Errno 32] Broken pipe

Hmmm. An SO answer suggested that that method wasn’t great (hacky) and probably wouldn’t work with long files. So I went for another approach, using ‘proper’ Google API access.

Second Pitfall: Missing Python Libraries (GAPC & oauth2client)

Request error: missing google-api-python-client module: ensure that google-api-python-client is set up correctly.

Pitfall Three: ApplicationDefaultCredentialsError – Type Field Should be Defined (Wrong JSON)

I got the above error even after pip install google-api-python-client! Turns out I needed oauth2client as well. However, even that didn’t like the (already existing) JSON I supplied it with from client-secrets.json:

Traceback (most recent call last):
  File "/home/robert/.pyenv/versions/transcribe/lib/python3.9/site-packages/oauth2client/client.py", line 1289, in f
rom_stream
    return _get_application_default_credential_from_file(
  File "/home/robert/.pyenv/versions/transcribe/lib/python3.9/site-packages/oauth2client/client.py", line 1395, in _
get_application_default_credential_from_file
    raise ApplicationDefaultCredentialsError(
oauth2client.client.ApplicationDefaultCredentialsError: 'type' field should be defined (and have one of the 'authori
zed_user' or 'service_account' values)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/robert/code/transcribe/transcribe.py", line 48, in <module>
    transcribe(sys.argv[1])
  File "/home/robert/code/transcribe/transcribe.py", line 34, in transcribe
    print(recogniser.
  File "/home/robert/.pyenv/versions/transcribe/lib/python3.9/site-packages/speech_recognition/__init__.py", line 91
6, in recognize_google_cloud
    api_credentials = GoogleCredentials.from_stream(f.name)
  File "/home/robert/.pyenv/versions/transcribe/lib/python3.9/site-packages/oauth2client/client.py", line 1294, in f
rom_stream
    _raise_exception_for_reading_json(credential_filename, 
  File "/home/robert/.pyenv/versions/transcribe/lib/python3.9/site-packages/oauth2client/client.py", line 1427, in _
raise_exception_for_reading_json
    raise ApplicationDefaultCredentialsError(
oauth2client.client.ApplicationDefaultCredentialsError: An error was encountered while reading json file: /tmp/tmpuf
o022el (provided as parameter to the from_stream() method): 'type' field should be defined (and have one of the 'aut
horized_user' or 'service_account' values)

Fortunately the answer was, as ever, contained in a five year old SO answer sitting at +2:

The JSON file that you provided is for an OAuth client. When you using application default credentials, you need to provide a JSON service account key. Try going to this page, clicking “Create service account”, filling out the form, and choosing “Furnish a new private key.”

Patrick on SO

Once I figured out my way around the Google Cloud Platform, I was able to create a service account, then create a new key for it and finally download it to pop into the script.

Pitfall Four: Payload Size Exceeds Limit

However, while I that authorised me to make the request properly, it didn’t mean the request would go through:

Request error: <HttpError 400 when requesting https://speech.googleapis.com/v1/speech:recognize?alt=json returned "Request payload size exceeds the limit: 10485760 bytes.". Details: "Request payload size exceeds the limit: 10485760 bytes.">

Side note: at this point, I looked at a fork, since the original was seemingly unmaintained with outdated services. That didn’t help.

I found the support page for the Cloud Text-to-Speech API and looked up the payload limit:

You have exceeded the 10 MB size limit for a single request sent to the API using a local file. You can move your audio file to a Google Cloud Storage (GCS) bucket to avoid the 10 MB limit. See the quotas & limits page for more information.

Welp. The file I had was 92MB. I tried FLAC to see if that would help, but that only brought it down by about ~20 MB, so no luck there.

‘Pitfall’ Five: Cloud Storage

At this point it was becoming apparent that my choices were to split up the file into 10MB chunks, or figure out Google Cloud Storage. This was actually relatively straightforward- the trickiest bit was finding what they offered for free. I found a reference somewhere that the first 5GB/month was free. Ideal!

From here I switched from a python script to shoogle to interact with the API. I wasn’t sure that SpeechRecognition was smart enough to use the longrunningrecognize API method, which is what I needed. I’ve used shoogle plenty before, so it was ready to go.

{
  "body": {
    "config": {
      "language_code": "en-GB"
    },
    "audio": {
      "uri": "gs://example-storage/audio.wav"
    }
  }
}
$ shoogle execute speech:v1.speech.longrunningrecognize testreq.json 
[ERROR] Server error response (403): {
  "error": {
    "code": 403,
    "message": "The request is missing a valid API key.",
    "status": "PERMISSION_DENIED"
  }
}

# try the service JSON file we created before...

oauth2client.clientsecrets.InvalidClientSecretsError: Invalid file format. See https://developers.google.com/api-client-library/python/guide/aaa_client_secrets Expected a JSON object with a single property for a "web" or "installed" application

Mmmph. It turns out that I couldn’t use the service API secrets JSON file I’d created previously, I had to go through an oauth2 dance with shoogle in my browser.

Pitfall Six: Mutichannel Audio

[ERROR] Server error response (400): {
  "error": {
    "code": 400,
    "message": "Must use single channel (mono) audio, but WAV header indicates 2 channels.",
    "status": "INVALID_ARGUMENT"
  }

Now this one I’d seen before! The support page suggests specifying multiple channels, but I didn’t wan tot do that. Their idea of multiple channels was for a channel-per-speaker type setup, which my audio was not. Stereo can be downmixed to mono in ffmpeg with something like: ffmpeg -i <input video> -vn -ac 1 audio.wav . I submitted the longrunningrecognise request and got my “name” (really a numeric id) back.

After some faff — I dived into the shoogle source code as I didn’t realise I’d duplicated a prefix — with figuring out the right API incantation, I got my result back in JSON form, one entry per sentence or so. I wrote a quick script to join the lines in sentence structure:

“Controller. Play some good possession basketball lads is that a basketball when am I getting massive. I don’t know how you tackle new standard front yyy jumps for the block the shot stop playing American football even though I have no idea to play that is log of didn’t do anything. Continue to the growing invisible I can’t see oh my god why. Inside the bar and. That table there visible going back on the fifth year ago shooting fail I swear to god it’s Elizabeth for you this is. Very early I don’t know how to do it’s nothing to judging by his invisible. Give me the look in His Eyes towards people don’t. Touchstone. OK how can we not intercept that just go for the blogs over the high blocks Roblox Roblox. No it’s fine on Door Store. Excellent counter. Icelandic I just held the XX when I got close to the best oh I’m. I don’t know when my team has two big men and Lee commentator sound like the Herbert from Family Guy controller Club. Can I get the bullet we have an instant replay on there I was the other ground leaving the James. Another one. I don’t know I don’t. What paints did Ascot look forward to it sideways. Has the ball old cold it’s like ok so.”

So in conclusion:

Pitfall Seven: The Results May Be Entirely Unhelpful

The output I got above had some gems:

  • lads is that a basketball when am I getting massive
  • Continue to the growing invisible I can’t see oh my god why
  • I swear to god it’s Elizabeth for you this is
  • Roblox Roblox
  • I don’t know when my team has two big men and Lee

but really there wasn’t a lot of output for 8-some minutes of input. To be fair to Googles Cloud Text-to-Speech, it wasn’t really designed for this kind of input. My input audio was:

  • noisy – it was an old recording when i was still using a headset mic and very little audio processing
  • had two people talking/shouting over each other at times
  • had video game commentary running throughout
  • had game audio in the background

I might give it a try with a more modern recording, which uses a proper mic, is calmer, less overtalking, and has a decent audio setup with ducking.

Tell us what's on your mind