You may have heard the headlines — “We’ve reached human parity” (Microsoft, October 16) as they reach an accuracy of over 94%; Google openly planning to compete with Dragon developers Nuance; Amazon attempting to revolutionise access to the internet via Echo and Alexa. It seems like everyone’s at the Speech Recognition game — surely the end is nigh for traditional methods of creating captions?
I think captioners can rest easy for a good while yet — for a few simple reasons. The first is simply the scale of the task that regulators and audiences set the captioner; typically a pre-recorded programme must be captioned 100% accurately, and a live show should hit at least 98%. Taking the pre-recorded example, how hard can that be for a machine? Surely there’s all the time in the world to get it right?
Consider what 100% actually means; not only does every word have to be identified and spelt correctly (no mean feat on a show such as Mastermind, where deliberately obscure questions can trigger equally obscure and possibly wrong answers). Imagine writing down every word you utter during any given day; would you go for something akin to the dialogue in a play — accurate with all its ‘disfluencies’ (those crutch-like ‘Ums’ and ‘Errs’ that let your brain change gear whilst letting your mouth free-wheel). Do you talk in nice, tidy grammatical sentences? Do you pause neatly for mental punctuation? I guessed as much. If you simply transcribe such speech verbatim you’ll get a very accurate representation of the words uttered, but that won’t make for comprehensible captions and it could well be illegibly fast.
Speech recognition also thrives on good quality audio; not just a clear voice, but an absence of echo, background noise, music and so forth. It is possible with care and a complex workflow to ensure that the music and the speech remain separate in a recording — but that doesn’t help with poor acoustics or a duff recording. Much more research is needed to assist with improving ASR in complex audio environments — and we’re helping a PhD student at the University of Edinburgh to research precisely this.
The automatic insertion of punctuation is in its infancy; some inroads have been made by our research partners at Edinburgh, using techniques more commonly found in Machine Translation. Whilst ASR uses a largely probability-based approach to working out what’s been said, punctuation needs something more rule-based. Questions are another matter entirely; cadence can be a good indicator for some speakers (as most languages will let you ignore the formalities of question words) — but that’s not a universal rule.
Identifying speaker changes is another area that needs more research; for many of our clients we need to be able to accurately identify either a change of speaker (denoted by chevrons or a change in text colour) or by identifying the speaker themselves. Whilst automated ‘diarisation’ reaches good levels of accuracy, it doesn’t yet reach the level of accuracy required for broadcast.
Does this mean we can’t use ASR at all? I think not. Not all content is the same; it’s not all shouty gameshows, talkshows where each guest cuts across everyone else and sports output captured in the open, with the roar of the crowd and the rumble of the music bed. Some material is recorded cleanly, with professional speakers speaking at a moderate pace on a subject matter with plenty of background data to assist with the more tricky terms. If we have enough of this kind of data we can train ASR engines to make a pretty good job of transcription. We can utilise the vast archives of media with matching captions to create speech recognition engines, punctuation models and caption ‘translation’ systems to replicate the kind of output that a human could produce. We can then use audio ‘alignment’ tools to break this transcription up into readable blocks and time-align them to the original speaker’s voice, leading to fully automated captions.
No doubt if I review this article in ten years’ time I’ll cringe at the bold assertions made about the progress of automated captioning, but I feel confident that genres such as comedy will remain a bastion of human-generated captioning even in 2027. Comedy is typically based around word play, incongruity and surprise. Speech engines are most comfortable with the opposite of this — they know what they’ve been trained on, and a new comic turn of phrase will almost certainly bring about an unintentionally comic transcription. I’m pretty sure that a human captioner will be wrestling with the likes of Have I Got News For You for many years to come.
Matt Simpson, Head of Product Management, Access Services, Broadcast and Media Services