Only a few short years ago, every part of the process of preparing subtitles for recorded television was driven manually by the subtitler.
The text had to be typed in, then separated out into tidy subtitles, then coloured according to who was speaking and, finally, carefully timed to the video. The process was extremely time-consuming, labour intensive and expensive.
The proportion of television output that is legally required to be subtitled is increasing every year and is now nearing one hundred percent for many major broadcasters, so there is an ever greater pressure to find ways to reduce the costs of producing these subtitles. As workflows are likely as efficient as they will ever be, the main improvements that are likely to come in the future will come from technology.
The single most time-consuming part of the subtitling process is the conversion of dialogue into accurate text. As a result, the change that will most transform this process is accurate speaker-independent voice recognition. Speaker independent systems are not trained to any one voice and are able to operate over the original soundtracks of the media, despite the presence of background noise and music.
Currently, outside of highly specialised language areas with limited background noise like the weather, the accuracy of these systems tends to be around 60-65%, and produces text with limited punctuation, or none at all. The accuracy percentages would need to be nearing 95% before the benefits gained would outweigh the time consumed by checking and fixing all of the mistakes.
The shift to speaker independent recognition is a few years away, but anyone who has seen the massive improvements to voice control systems over the last few years can be in no doubt that its time will come. In the meantime, the current system of “respeaking”, whereby the subtitler repeats everything they hear in a clear voice and with spoken punctuation allows for extremely quick text input and for accuracies of around 98%.
Using diarisation, or speaker identification, for automatic colouring of subtitles is increasingly a possibility, as the speaker recognition algorithms now available are very strong. However, they have yet to be included in any subtitling packages, and this is probably because of the fact that while the subtitlers themselves are still generating the text, it is not a lot of extra work to include colour changes as they do so.
This speaker information can, however, be hugely useful to speaker-independent recognition systems as it allows them to split the files in to single speaker segments. It could also then be a great work-saver once these systems are implemented, by avoiding the need to then re-colour the automatically generated text. As such, these two developments are likely to arrive hand-in-hand.
Currently, a lot of productivity is lost in delivering large video files to subtitlers. Even at modern bandwidths, with modern file delivery mechanisms, this process can lead to the loss of minutes per file in regional offices and tens of minutes for homeworkers. Over the last couple of years many budding online platforms and streaming software clients have emerged that offer the possibility of completely eliminating these download times. They also offer greater security for client media, as the video is not stored on the subtitler’s machine. The current batch offer fairly basic functionality, but we can expect to see them improving over the next few years until they take over completely from offline clients.
Increasingly broadcasters are looking to use subtitles as meta-data, allowing them to accurately search their own output for research and compliance purposes. Subtitles are very useful for this, but as-is they offer only full text search. To extract better data from them, Named Entity recognition algorithms are likely to be deployed – allowing for databases storing all occurrences of proper nouns (names, places and so forth). This will offer quicker and more focused searching than whole-text search of subtitle output, and also allow for meta-analysis on the popularity and connectedness of topics, personalities, and so forth.
Although these automated processes will become vastly improved over the next few years, and will become accurate enough to be worth deploying – they will always make mistakes that a human wouldn’t make. They will continue to struggle with situations of unclear or overlapping dialogue, with television programmes that use new slang, and with (infrequently occurring) specialist subjects that require a lot of research. They will struggle to accurately produce the written style of any given channel. And so, while the role of subtitlers in generating the text for subtitles will shrink over the coming years, their roles in quality control and as the arbiter of (human) taste and judgement are to become more vital than ever.
Hewson Maxwell, IT Manager (Spain), Access Services.