Evaluating Automatic speech recognition for Closed Captioning 

In the captioning business you are often asked why more of it isn’t fully automated. People observe the excellent recognition of Siri/Google/Lex and the high accuracy claims of many ASR providers[1] and assume that this must be transferrable.

In reality, the quality of text that even the best ASR engines produce is far below what would be considered acceptable by end-users of captioning, except in some very narrow and ideal conditions. Ideal conditions would be a single speaker, a high quality sound recording with no background noise, and a limited and predictable vocabulary. The weather clip shown on the attached table is a good example. In truth, even in these ideal conditions the results are often mediocre at best.

But although it is far from perfect, the accuracy of the best ASR is improving year-on-year, and we are already crossing the threshold at which it becomes cheaper and quicker to correct ASR generated text than it is to produce text from scratch for many types of programmes. In the case of Live Captioning – where accuracy compared to captions for pre-recorded television is already affected by the need to produce the captions in real-time – the bar that a fully automated ASR solution needs to reach is lower but the fully automated ASR is also inferior due to reduced context. For now, the gap between fully-automated ASR and current production methods is still significant, with fully automated ASR giving results far beneath end-users’ and regulators’ expectations.

As a result, in Red Bee Media, we found we needed a way to assess new ASR solutions to see how they might improve our productivity for offline captioning production. And we needed to be able to assess the current gap between the fully automated ASR and current human-led live captioning production methods. To achieve this we realised we needed a standard metric for ASR Accuracy that could be generated quickly and automatically. Having such a method means we are able to quickly baseline any new ASR solutions or configurations against all of the previous ASR testing that we have performed.

Academic publishing on ASR generally uses the Word Error Rate (WER)[2] score as a metric for accuracy, which is a measure of sequence similarity based on the Levenshtein distance[3]. This is a perfectly appropriate method and our preferred measurement is very closely related, but we have narrowed it down to a percentage score of “matched words”. For our purposes we find it gives a slightly more direct correlation with the effort required to correct the text.

The accuracy figures quoted in academic papers and by commercial ASR providers are highly dependent on the material used. Academic papers often use standard open source sets such as the “switchboard” data set[4]. It often seems that these solutions are over-trained to these particular datasets as in real world use the accuracy is generally significantly lower. Many of the better ASR providers now steer clear of quoting accuracy figures at all due to the difficulties in objective measurement.

To minimise these issues, we built our automated comparison system so that it uses a wide variety of broadcast media covering a range of acoustic environments, dialects and speech styles. For all of our test data sets we have generated verbatim transcripts, which we can measure against, and the ASR output is transformed to a common format in order to make it directly comparable. This approach allows us to quickly assess any new solution and immediately see its strengths and weaknesses. We have been using these systematic assessments to drive our internal development and also to help us choose which providers to partner with.

Closed Captioning graphMatched words scores for a small representative sample of ASR providers and media-types.

One of the big surprises we have found is that most of the large internet companies’ general ASR solutions do not generally perform well on broadcast media, and this is especially notable with non-American dialects (see Solution A and Solution B in the table). Aside from the natural North American bias in their English-language training data, we believe that this is down to the fact that their training data is also biased towards command and control voice interaction, rather than transcription of scripted and unscripted dialogue.

As you can see in the above table there is a wide variety of accuracy to be found in the market, and variety across media types with even the best solutions rarely exceeding 90% “matched words” accuracy. But the best ASR is still worth using alongside human intervention, and with this automated test methodology we are confident that we will always be using the best ASR available to produce the best and cheapest captions.

I’ll leave you with a humorous illustration of captions that are missing “the other 7%”, courtesy of the BBC.

https://www.youtube.com/watch?v=ONWNWBoqTuM

P.S. Just to be a bit of a killjoy, I actually measured the accuracy of the clip’s Syncopaticaption solution as per our “matched words” metric, and it was more like 75% than 93%… which is not completely uncompetitive, all the same.

[1] https://www.ibm.com/blogs/watson/2017/03/reaching-new-records-in-speech-recognition/, https://9to5google.com/2017/06/01/google-speech-recognition-humans/

[2] https://martin-thoma.com/word-error-rate-calculation/

[3] https://en.wikipedia.org/wiki/Levenshtein_distance

[4] https://www.voicebot.ai/2017/03/13/ibm-claims-new-speech-recognition-record-watson/

 

Hewson Maxwell, Technology Manager, Access Services