When HTML5 was published, it introduced the <video> and <audio> elements, as well as the <track> element. The latter provides a standard means of synchronizing text with media for a variety of purposes. The HTML5 spec specifically defines five kinds of track: captions, subtitles, chapters, metadata, and descriptions. The latter is particularly interesting, and is the topic of this post.
Description, historically known as "audio description" or various other terms, is a supplemental narration track designed for people who are unable to see the video. It describes content that is presented visually, and is needed if that content is not otherwise accessible via the program audio. Historically we've outsourced description to professionals, but with prices starting at $15 per video minute, we've never gotten the kind of participation we need if video everywhere is going to be fully accessible.
With HTML5, description can now be written in any text editor. All five finds of tracks, including description, are written in the same file format: WebVTT, which is essentially a time-stamped text file. Imagine that you have a video that beings with a short interview with someone notable, say the president of your university. The president's name and affiliation appears visually in an on-screen graphic but they're never specifically identified in the program audio, so a non-visual person has no idea who's speaking and whether to take them seriously. This is a really simple use case, but common among videos I see in higher education: The video can easily be made accessible by creating a WebVTT file that includes the speaker's name and affiliation, with start and end times indicating when this text should be spoken. There's a bit of thought that must go into timing, as you want to avoid colliding with other important audio, but otherwise it's a really simple task. The result would be a file that looks something like this:
00:00:05.000 --> 00:00:08.000
Ana Mari Cauce, President,
University of Washington
Save that as a VTT file. Done. Thirty seconds and your video has description.
With text-based description, it's easy to edit the content or the timing (not true if the description narration is mixed into the video). Plus computers can read it themselves; we don't need to hire professional voice actors to do that. This makes description affordable, easy to do for most of the videos we're using in academia, and increases the likelihood that video owners will actually do it.
But how can text-based description be delivered to users?