Audio Description using the Web Speech API

When HTML5 was published, it introduced the <video> and <audio> elements, as well as the <track> element. The latter provides a standard means of synchronizing text with media for a variety of purposes. The HTML5 spec specifically defines five kinds of track: captions, subtitles, chapters, metadata, and descriptions. The latter is particularly interesting, and is the topic of this post.

Description, historically known as "audio description" or various other terms, is a supplemental narration track designed for people who are unable to see the video. It describes content that is presented visually, and is needed if that content is not otherwise accessible via the program audio. Historically we've outsourced description to professionals, but with prices starting at $15 per video minute, we've never gotten the kind of participation we need if video everywhere is going to be fully accessible.

With HTML5, description can now be written in any text editor. All five finds of tracks, including description, are written in the same file format: WebVTT, which is essentially a time-stamped text file. Imagine that you have a video that beings with a short interview with someone notable, say the president of your university. The president's name and affiliation appears visually in an on-screen graphic but they're never specifically identified in the program audio, so a non-visual person has no idea who's speaking and whether to take them seriously. This is a really simple use case, but common among videos I see in higher education: The video can easily be made accessible by creating a WebVTT file that includes the speaker's name and affiliation, with start and end times indicating when this text should be spoken. There's a bit of thought that must go into timing, as you want to avoid colliding with other important audio, but otherwise it's a really simple task. The result would be a file that looks something like this:


00:00:05.000 --> 00:00:08.000
Ana Mari Cauce, President,
University of Washington

Save that as a VTT file. Done. Thirty seconds and your video has description.

With text-based description, it's easy to edit the content or the timing (not true if the description narration is mixed into the video). Plus computers can read it themselves; we don't need to hire professional voice actors to do that. This makes description affordable, easy to do for most of the videos we're using in academia, and increases the likelihood that video owners will actually do it.

But how can text-based description be delivered to users?

Reading Description with Screen Readers

Able Player, my free, open source HTML5 media player, has provided support for HTML5 description tracks for many years (since soon after HTML5 was published). Originally, it did so by writing the description content to an ARIA live region at the designated start time. An ARIA live region is an element on a web page coded with aria-live="polite" or aria-live="assertive". When content within that element changes, screen readers announce the change.

Given the above example, the speaker's name and affiliation are scripted to begin 5 seconds into the video. So at that time, "Ana Mari Cauce, President, University of Washington" would be written to a live region, and would be announced by the user's screen reader. We've gotten a lot of positive feedback about this feature over the years, but it does have some interesting problems:

  1. It assumes the individual who needs description is using a screen reader. This is a reasonable assumption, but not necessarily true.
  2. Some videos have very little available quiet time for injecting description content. The solution for this problem is called extended audio description: When description starts, the video pauses, then playback resumes when the description is finished. Able Player has supported this for years, but it's always been problematic because we have no way of knowing when the user's screen reader is finished reading the description. Most users I know crank up the speed of their screen readers and read at a downright dizzying rate. However, we can't assume that's true of everyone, and even if we did, how fast is fast enough? And how fast is too fast? If we pause the video at the start time indicated in the VTT file, then resume playback again at the end time, we're trusting all description authors to accurately guess how much time it will take for screen readers to read their description. In Able Player, we opted not to guess, and instead users must manually resume playback using the spacebar. This might not be a huge deal, but it could be pretty inconvenient if there's a lot of description.
  3. Screen readers really shouldn't be burdened with this. A sighted user can do other things while watching a video, and screen reader users might want to do so as well. However, if their screen reader is in the middle of reading a description and the user presses any key, that interrupts the description. Depending on which key they pressed, the video will likely continue to play but it will no longer be accessible.

Reading Description with the Web Speech API

Earlier this month, Able Player 4.0 was released. It includes several major new features, including robust support for accessible media playlists, support for Vimeo, and—most relevant to this blog post—a new way of delivering text-based description.

The Web Speech API makes it possible to have both speech synthesis and speech recognition within a website, provided by the web browser. The latter is still not well supported among browsers, but speech synthesis has widespread support (all major browsers have supported it for a while, the lone exception being Internet Explorer).

Able Player performs a quick check to see if the user's browser supports it. If it doesn't, it falls back to the old way of doing things using ARIA live regions. But if it does, the browser reads the description text in a pre-defined synthesized voice, keeping the screen reader free to do other things. Also, the Web Speech API has a callback function so we know exactly when it's finished reading and can automatically resume playback at that point. Therefore, automatic extended description is now much more feasible than before.

This also means description is now available to anyone who wants it. You don't have to be using a screen reader.

There's a lot of choice that's possible with the Web Speech API (e.g., voice, volume and rate of speech can all be adjusted). I'm considering adding some speech configuration options with the Able Player preferences dialog, but decided against this in version 4.0 because there's a lot of variation across operating systems and devices in terms of available voices, and some voices are more configurable than others. For now, the voice is automatically set to the voice that tops the list when querying the web speech API for available voices in the language of the web page.

If anyone uses Able Player 4.0 or higher to watch video that has text-based description, I'd love to get your feedback. Feel free to comment on this post, or contact me privately via my Contact form.

Leave a Reply

Your email address will not be published. Required fields are marked *