How good are live subtitles? Well, it depends who you ask! Of course you can watch some to get a sense of the quality, but to get a quantifiable answer you have to do some serious maths.

First of all, what should we be measuring? Well, there’s the textual accuracy for a start. Counting the number of words that have come out wrongly is one way of showing the accuracy of live subtitles. For example:

“There were for men standing in a line.”

Clearly the word “for” is wrong here. That’s one mistake in eight words, therefore the text is 87.5% accurate.

At Red Bee we’ve assessed subtitlers using this method for years. Don’t worry, we don’t send out subtitles of such low accuracy! That was just an over-simplified example to keep my word count down. Our current target is a minimum of 98% accuracy. Hopefully that sounds pretty good, especially if you’re aware of the complexities of respeaking and stenography.

However, at Red Bee we know that’s not the full story. Pure textual accuracy is important, but what about comparing the subtitles to what was actually said? We call this content inclusion. For example the newsreader might say:

“In the USA, four people have eaten crumpets and turned blue.”

But the subtitles might read:

“Some people have eaten crumpets.”

Clearly that would be terrible subtitling but how should we quantify the terribleness? In this case three facts are missing – the USA, the number four and the turning blue. So maybe we could dock the subtitler one point for each fact. But that would mean that all three facts were equally weighted, when you could argue that “some people” isn’t so different to “four people” and is a much less serious omission than the fact that they turned blue.

This issue of the relative seriousness of errors also applies to assessing textual accuracy. In the first example, you could probably tell that the sentence was supposed to say “four men”, so the error is arguably very minor. If it said the following, the mistake would be more serious because the sentence would be harder to read:

“There wherefore men standing in a line.”

So quantifying the quality of live subtitles while factoring in all of these issues is tricky. Ofcom is running a project which aims to do just that and they have turned to the NER model for help. This really is serious maths. It was developed by Juan Martínez and Pablo Romero-Fresco at Roehampton University and it looks like this.

You see – simple when you know how! Let me explain. N is the number of words, E is the number of omissions (Edition Errors) and R is the number of mistakes in the text (Recognition Errors). So by looking at the word count and deducting the number of omissions and mistakes, you can calculate a percentage accuracy that factors in both content inclusion and textual accuracy. What’s even more cunning about this model is that different types of errors are given different values. So a serious mistake of the “wherefore” or “turned blue” variety scores 1, while a standard error scores 0.5 and a minor error like “for/four” scores 0.25.

Another great feature of this model is that it acknowledges that not every word that drops from a speaker’s lips should be subtitled, partly because we often speak much faster than most people can comfortably read and partly because subtitling every verbal tick can make a speaker look, well, um, a bit sort of dim. The model calls omissions of unimportant fluff like that Correct Editions and they just score 0 points.

The trouble with NER is that while our enthusiasm for assessing subtitles is limitless, the time available is not. Calculating a NER accuracy for ten minutes of subtitling takes about two hours. When you think that Red Bee produces 165 hours of live subtitling every single day, the chance of getting a fair picture of subtitling accuracy across the board using NER seems slim.

How would you assess live subtitles? Let us know in the comments below!

Rachel Thorn, Strategic Planner.