This week at NAB marks a significant milestone for us: the debut of our closed captioning service in the US. And how times have changed.

The team that now leads Ericsson’s closed captioning service started using speech recognition in production processes as far back as 1999 to align automatically text and audio to reduce the amount of effort and cost that goes into creating offline caption files. Assisted Subtitling, as it was known in the UK, won a Royal Television Society Award for Innovation for BBC R&D and created a foundation of interest and knowledge in speech recognition technology that underpins much of what Ericsson is doing in captioning today.

16 years on, the vast majority of our captions, both live and offline, are created wholly or in part using speech recognition technologies to meet the very stringent quality expectations that we have from broadcasters, the audience and from regulators like the FCC and Ofcom. We see speech recognition as a fundamental technology for us and our customers. It is an area full of bold claims and some smoke and mirrors, but it is nonetheless an area of genuinely important innovation potential to improve quality, reduce cost, facilitate the efficient re-use of captions for multiple platforms and to add further value to broadcasters from their captioning spend.

Speech recognition is used in multiple ways. It’s useful, first and foremost, simply as a way of producing text quickly and efficiently. Captioners now listen to programme sound, both live and recorded, and “respeak” the dialogue, using software trained to their individual voice to create text. One great advantage of this approach is that, because the software is being developed so intensively by various interested parties outside of captioning, it continues to become better and better for our purposes. It is consequently far easier to find large numbers of people capable of producing broadcast-standard captions and much quicker to train them than with comparative methods.

We continue to use speaker-independent software to perform the alignment or synching of text as described above, taking some of the less interesting labour out of the offline captioning process. Speaker-independent speech recognition, or automatic speech recognition (ASR), is also good at identifying pre-existing text from audio – this gives us ways of automatically associating previously created captions with video clips or VOD programmes, a key requirement for broadcasters and content owners today.

Speech recognition is now being used for monitoring purposes to check for presence of captions and offers good potential to be used as an automated QC solution to help maintain quality standards without imposing onerous and costly reviewing obligations on broadcasters or service providers.

Using ASR to create text from scratch is a far harder task, particularly in the TV environment where the language within content is so eclectic and unpredictable. Speech recognition thrives on preparation and knowing what’s coming and live TV, or TV in general, rarely works that way. It’s certainly not impossible though, and work we are doing with academic and industry partners around the world suggests that, by breaking down language into very specific genres and by having access to large volumes of high-quality training material, you can begin to see a way to achieve levels of quality which will reduce the effort and cost of captioning even further.

We’re a long way off fully automated captioning. But speech recognition and other automated technologies such as machine translation and semantic analysis offer the industry ever greater potential and must not be ignored. And for captioners, it could enable a shift from dialogue transcription towards content curation, allowing more time to enhance caption data with other useful metadata as content owners seek to make their programmes ever more discoverable and commercially valuable.

David Padmore, Head of Access Services