Constant vs Variable Bit Rate MP3 Encoding and Timed Text

HTML5 introduces all sorts of exciting possibilities with timed text that can be synchronized with media. The new <track> element, used in conjunction with <audio> and <video>, makes it possible to add captions, subtitles, descriptions, chapters, and metadata to your media. As a musician I’m particularly interested in the possibilities of synchronizing text with audio, to create (for example) karaoke-style lyrics to accompany my songs (follow the bouncing ball!) I have even cooler ideas than that, but I’m not quite ready to share them yet (please stay tuned).

One problem I discovered however is that the audio file sometimes gets out of sync, especially when users scrub ahead or back. Troubleshooting this has been maddening, but after many attempts to isolate the problem I finally did so: variable bit rate MP3 encoding.

Constant vs Variable Bit Rate Encoding

When encoding audio or video, there is typically an option to use either Variable Bit Rate (VBR) or Constant Bit Rate (CBR) encoding.

VBR uses an algorithm to efficiently compress the media, varying between low and high bitrates depending on the complexity of the data at a given moment. CBR, in contrast, compresses the file using the same bit rate throughout. VBR is more efficient than CBR, and can therefore deliver content of comparable quality in a smaller file size, which sounds attractive, yes?

Unfortunately, there’s a tradeoff if the media is being streamed (including progressive download), especially if timed text is involved. As I’ve learned, VBR-encoded MP3 files do not play back with dependable timing if the user scrubs ahead or back.

I created a test page that demonstrates the problem.

The test page includes an audio mix featuring the lovely and talented vocalist Vicki, who counts off the seconds for five minutes, accompanied by a groovy dance beat. The mix is encoded into various audio formats using both VBR and CBR. In all browsers, all mixes play reliably from start to finish if you don't scrub. However, in the VBR MP3 mixes, if you scrub ahead and back a few times the time announced by Vicki (which is the correct time) becomes out of sync with the time announced by the media player. Chrome seems to be better than other browsers at keeping it close, but all other browsers get way out of sync, off by as many as ten seconds in my observations.

This problem appears to be unique to MP3. Another format, Ogg Vorbis—which is supported by Firefox, Opera, and Chrome—is a VBR format by design. However, browsers that support this format do not exhibit the same disconnect they exhibit with the VBR MP3.

All of my test files were encoded using Audacity with one exception. To test whether this is a problem with the MP3 format itself, or with the MP3 file as encoded by Audacity, I encoded a second VBR MP3 using MediaHuman Audio Converter, and it too has the same problem. This isn’t a definitive test however. Both tools use FFmpeg to do their encoding, and FFmpeg uses the LAME MP3 Encoding Library. The latter has a ton of options, so the devil may very well be in the details. Isolating the problem further is a bigger problem than I have time to tackle.

Implications for Timed Text

Timed text cues are triggered based on the current time as reported by the HTML5 media API (specifically the currentTime property). So the timed text will be in sync with the media player, but if the media player is reporting the current time inaccurately, the timed text will be displayed at the wrong times relative to the media file.

This is a huge problem, but a problem with a simple solution: Only use CBR MP3 (with Ogg Vorgis as a fallback option) if your audio application includes timed text.

What about video?

This problem appears to be unique to audio. I conducted a similar test of video using MP4 (H.264) and WebM files, encoded at both VBR and CBR, and experienced no problems with any of these files in any browser.

One thought on “Constant vs Variable Bit Rate MP3 Encoding and Timed Text

  1. Good catch! Glad you're highlighting this, although it isn't entirely surprising - MP3 audio is just a sequence of sync frames, so there's no navigation info to help the decoder get to a particular point in the timelime. It would be different if it was an MPEG Transport Stream or Program Stream.

    I believe there's navigation info in the video formats you mention, although I don't know for sure. Ask the experts at Zencoder/Brightcove/VideoJS, or Matt Ward from Google (@therealwardo).