It would seem that in a lot of countries, Spain included, the battle for high quantities of subtitles on television is slowly being won, but this in itself has led to additional challenges. Subtitling high percentages of TV output means subtitling lots of live programming through respeaking and/or stenography, so challenges around accuracy and delay come increasingly to the forefront in any discussions on subtitling.
But how do we measure the quality of our live subtitles? How can we objectively measure their accuracy?
Of course, there are many factors which influence subtitle quality, not least delay, positioning, speaker identification, reading speed and other factors which influence viewers’ comprehension, like types of error. The Word Error Rates which have long been used to calculate accuracy in speech recognition are tried and tested, but in the case of respeaking for live subtitling, calculating the error rate may not give the full picture. While some errors might render a phrase impossible to understand, others may produce a smile, but not affect comprehension.
However, the challenge is that a full contextual analysis of multiple variables takes a very long time. It might be feasible for companies or researchers to do this sort of analysis on some programmes, but with broadcasters like the BBC or Sky producing 600 hours of live subtitling a week, how can we analyse the quality of such large amounts of data accurately?
Pablo Romero Fresco and Juan Martínez think they have the solution in the NER Model, presented in Barcelona, a system for measuring and classifying recognition and edition errors as Serious, Standard or Minor. With this system, Serious errors, which change the meaning of a text are penalised more severely than Standard errors, which may make it difficult to understand, or Minor errors, which don’t affect viewers’ ability to receive the message. The model also includes space for an overall assessment, a more subjective evaluation of the subtitle flow, delay and coherence.
The model is convincing, not least because it takes into account the different degrees of editing required for different programmes and countries and the possibility of edited but accurate respeaking. The problem is that any analysis of this type requires access to the full accurate transcription of the original, and this has to be produced. Apart from the time it would take to produce a full accurate transcription to compare with the respoken text, how do we analyse it to check that it is full and accurate? Romero and Martínez suggested that a first transcription be produced by automatic speech recognition and then corrected by a human, but if ASR has accuracy rates well below those of respeaking and humans naturally are prone to error, what model can we use to analyse the quality of the transcript we’re comparing with the respeaking?
Looks like we’ve arrived somewhere between a rock and a hard place.
Anyone got any ideas? Let us know in the comments box below or tweet us on @RedBeeMediaESP
María Jesús Granero, Live Subtitler, Red Bee Media Spain