A Behind-the-Scenes Peek at Automatic Captions on YouTube

On November 19, 2009 Google announced that it was launching automatic captions in YouTube. The University of Washington recently joined the list of early testers of this feature, so I thought I’d offer a sneak peek of how it works.

The typical YouTube video player has a button in the lower right corner that for a while now has allowed users, by hovering with a mouse, to turn on captions or to automatically translate captions into any of nearly 50 supported languages. Both of these features obviously depend on the video being captioned. (Unfortunately, that button in the lower right corner does require a mouse – it can’t be triggered at all by keyboard alone in any browser).

What’s new is that this same button (on select videos in select channels) now includes an option to "Transcribe Audio". If you select that option, Google will do its best to automatically transcribe the video. You can sample this yourself with almost any video on the uwhuskies YouTube channel.

screen shot of a YouTube video, including a mouse cursor hovering over the caption menu

What about accuracy, you ask? Well, it’s right about where I would expect it to be. It does better with people speaking clearly in high fidelity acoustic environments than it does in every other scenario, and unfortunately most of our videos fall into the latter category. As a result, an interview with a scientist who says "We’re projecting sea levels rising on the beach" is translated as "We’re protecting seals’ right to free speech". Anyone who’s ever used Dragon Naturally Speaking or similar products knows that speech recognition technology can produce some very interesting results.

The hope here is that with massive amounts of new data coming in, Google’s speech recognition technology will improve over time. However, my hunch is that in order to learn from its mistakes, Google needs to know which words it has translated inaccurately, and how those words or phrases should have been translated. For this to happen, I suspect that we users need to diligently edit the automated captions, and upload them back to Google. By doing this, we’ll be helping to accelerate the viability of the service, and, if the automated caption is accurate at all, we’ll be able to edit those and get actual usable captions more quickly than if we were to transcribe our videos manually from scratch.

On the administrative side, when logged in to your YouTube account, if your account supports automatic captioning, you will see a link to "Machine Captions" somewhere within the "Captions and Subtitles" area for a particular video. Selecting that link displays your captions in a grid. Each row of data includes start time, end time, and caption text. When I first saw this, I excitedly thought Google was providing a caption editor, but it currently doesn’t actually support direct editing. The data is read-only, and provides a way to quickly navigate the video. By clicking on a caption, the video jumps to that point and starts playing.

screen shot of the YouTube admin view, with captions displayed in a grid

Google does, however, provide a "Download" link, which allows account owners to download the machine-generated .sbv caption file, which looks like this:

My name is on the hill

I’m a professor of medicine at the technology
and literature and this and that university

This file can then be imported into your favorite captioning tool, edited, and uploaded back to YouTube as the official caption track, replacing the machine-generated caption track.

It’s not perfect. We still don’t have a quick, easy, and cheap way to accurately transcribe and caption all our videos. But it’s a start, and I look forward to keeping an eye on this, and using it, as it matures.