Accessible Documents in HTML, Word, and PDF

Yesterday I gave a couple of presentations at the Digital Accessibility Expo, a wonderful event in its second year, hosted by the University of Illinois at Chicago. Slides for my presentations are here:

The latter presentation, with the full title "Accessibility of Online Instructional Tools and Documents", included a demonstration of a set of mock physics syllabi created in HTML, Microsoft Word, and Adobe PDF in order to test how well each of these formats supports certain accessibility features, and how well these features are preserved when converting from one file format to another.

The complete file set is provided below, with links to each file and an explanation of its accessibility features or lack thereof. All files contain exactly the same content and visible structure. The goal of course is to translate that visual structure into an actual underlying semantic structure that assistive technologies can understand. So far I've only tested these files with JAWS 11 (additional tests with Window-Eyes and NVDA are forthcoming), but I think the JAWS tests give us a pretty good indication of the current accessibility of the file formats.

Accessible HTML version

HTML is historically the most accessible of the three file formats, especially when factoring in screen reader support for accessibility features. Therefore, this HTML file is the benchmark by which all other files in this set should be measured.

  • Human language: Can be defined (using the lang attribute) for the overall document or for specific elements within the document. Since this sample document includes both, JAWS reads this document in English until it discovers a French paragraph at the end, at which point it switches into French mode.
  • Alt text: The banner image includes alternate text (alt="Accessible University"). JAWS identifies this as an image and announces its alt text.
  • Headings: Headings are explicitly identified in this document using <h1> and <h2> elements. Screen readers provide a mechanism for jumping between these headings, so users can get a quick appreciation of the outline of the page and/or jump directly to the content that meets their needs.
  • Lists: Course objectives are organized using an unordered list. As soon as the list gets focus, JAWS announces it as a list with three items, which is helpful information so users know how many list items to expect.
  • Tables (Simple): The Class Schedule is organized using a simple data table, with table headers marked up using <th scope="col">. As JAWS users navigate through the table using JAWS table keys (e.g., Alt+Control+arrows), JAWS announces the column headers associated with each data cell.
  • Tables (Complex): Grades are organized using a complex data table (with nested columns). Given its complexity, it requires additional accessibility markup, including a summary attribute (JAWS reads this upon entering the table); plus id attributes on each <th> and headers attributes on each <td>. The headers attributes provide a space-separated list of ids for each header that is associated with that data cell. With this information, JAWS announces all relevant headers as users navigate through the table using JAWS table keys, which enables users to understand the complex relationships communicated by this table.
  • Validation: This file validates to the XHTML Strict 1.0 specification.
  • File size: A mere 5 kb

Microsoft Word version

This file demonstrates the potential of Word to explicitly communicate structure and relationships between elements. It has many of the same capabilities of HTML, but falls short on table accessibility.

  • Human Language: In Word 2003 the default language of a document is defined with Tools > Language. In Word 2007, it's defined within under Review > Proofing, where there's a button labeled "Set Language". Language can be defined for the entire document, or for selected text. JAWS fully supports this when reading Word documents.
  • Alt text: The banner image includes alternate text. In Word 2003, this is accomplished by right clicking on the image and selecting "Format Picture", then "Alt Text". This is different (and arguably less intuitive) in Word 2007, where alt text is assigned by right clicking on the image and selecting "Size", then "Alt Text". JAWS reads the alt text as the image is encountered within the flow of the Word document.
  • Headings: Headings are explicitly identified by selecting "Heading 1" and "Heading 2" styles from Word's format menu. JAWS supports navigation among headings in Word using QuickKeys: Insert+Z toggles QuickKeys mode, then the "h" key navigates among headings just as it does in HTML).
  • Lists: Course objectives are organized using an unordered list using Word's bullet button. JAWS precedes each item in a bulleted list by saying "bullet" but it doesn't call a list a list, nor does it announce the number of items in the list as it does in HTML.
  • Tables: In both of the data tables in this document, one or more rows of column headers can be identified by selecting the row of headers, right clicking, selecting "Table Properties", selecting the "Row" tab, and checking the box labeled "Repeat as header row at the top of each page". We can't do the same with row headers (which might, for example, be located in the first column), nor can we explicitly associate individual data cells with headers, nor can we attach a summary to the table (all of these things are possible in HTML). JAWS users can read tables linearly but with no real accessible markup inherent in Word docs, complex tables are especially challenging to understand.
  • File size: 52 kb

Microsoft Word 2007 version

This file has the same features as the previous file, but was saved as a Word 2007 .docx file rather than a Word 2003 .doc file. This has no apparent effect on accessibility, but it is significantly smaller (34 kb).

Adobe PDF - Scanned Image version

This file was scanned and saved as PDF using Adobe Acrobat. It is simply an image, with no text nor structural markup. When opening this file, Adobe Acrobat asks the users whether they would like to try to convert the document to text. However, this process should be conducted by the author or distributor (e.g., library), rather than burdening the user with it. This is especially true when the audience may include screen reader users, who are unable to correct errors that occur in the conversion to text.

Adobe PDF - Scanned Document version

This file was scanned and saved as PDF using Adobe Acrobat, but in doing so was identified as a document, so the conversion to text happened automatically. This is more accessible than the preceding version since there is actual text for screen readers to read. However, there is no structural markup (no explicit human language, no alternate text on images, no headings, no lists, and no accessible table markup).

Adobe PDF - Converted from Word

This is a tagged PDF file, created from the semi-accessible Word file described above. It was exported using the Adobe PDFMaker Plugin, which is added to the Microsoft Office menu and toolbar when Adobe Acrobat (Standard or Professional) is installed over an existing installation of Office. Note that this is different than printing to a PDF file, which does not produce a tagged PDF. Also, it is currently only available in Windows. Word for Mac can not produce buylevitra tagged, accessible PDF files directly.

  • Human Language is not passed to the PDF. In Adobe Acrobat, if we select File > Properties > Advanced, we find that Language is unspecified.
  • Alt Text: The alternate text for the image is passed to the PDF, and is read by JAWS.
  • Headings: The <h1> and <h2> headings are passed to the PDF, and JAWS supports them just as it does in HTML.
  • Lists: The unordered list structure that is used to organize the Course Objectives is passed to the PDF, and JAWS supports it just as it does in HTML.
  • Tables: For both data tables in this sample file, by checking the box labeled "Repeat as header row at the top of each page" within Word, each table header is in fact passed to the PDF as a table header. However, PDF also provides support for the scope, id, and headers attributes, just like HTML; and none of this information is available in the Word file, and therefore is not available in this PDF. Consequently, JAWS does not identify column headings as we navigate through the table using JAWS table keys. Even more critically, JAWS misreads the complex Grades table entirely. For each data cell, it erroneously announces the header that is one column to the right, which in this context could have a critical adverse effect on the student's performance in the class.

Accessible Adobe PDF version

This file was created by touching up an existing PDF file using the Advanced > Accessibility features within Adobe Acrobat. It is possible to attain this result with any of the inaccessible PDF documents described above, but the more accessible a document is from the onset, the fewer steps are required to make it fully accessible. Therefore, we started with the version created from Word using the PDFMaker Plug-in. By starting with this version, there were only three additional steps required:

  • Human Language (Document): The language of the document can be identified within File > Properties > Advanced.
  • Human Language (Paragraph): To set the language of an individual element, the first task is to find that element within the Tags Panel (View > Navigation Panels > Tags), which is only available within Adobe Acrobat Professional. Within the Tags Panel, navigate the tag structure to the relevant <p> tag, right click on that tag, select Properties, and identify the language of the paragraph as "French". Once done, JAWS now reads the French paragraph in French, as it should.
  • Tables: PDF supports all the same table accessibility features that HTML supports. By right clicking on a table within Adobe Acrobat, we can "Edit Table Summary" or access the "Table Editor", which allows us to identify the id and scope (column, row, or both) of each header cell and (in the complex table) identify which of the header cells is associated with each data cell. Much of these features can also be accessed via Advanced > Accessibility > "Touch Up Reading Order", which is a very useful feature for much more than the name implies, and can be the hub for most accessibility improvements within a PDF document. This process is explained very nicely in the California State University's PDF Accessibility Tutorials. There are some differences in JAWS's support for these features in PDF and in HTML. In PDF, JAWS reads the relevant column headers when id and headers attributes are assigned, as in the complex table. However, without id and headers attributes, even if scope is identified as "Column", JAWS does not announce column headers as users navigate from cell to cell, nor does JAWS announce the table summary when the table first receives focus, as it does in HTML.
  • File size: 51 kb

HTML from Word version

This file was created by converting the above Word document to HTML by selecting Save As > Web Page from within Word 2007.

  • Human Language: The HTML that results from this conversion does indeed have lang attributes, and they're supported by JAWS.
  • Alternate Text: The alternate text for the image is passed to the HTML file, and is read by JAWS.
  • Headings: The <h1> and <h2> headings are passed to the HTML file, and are supported by JAWS.
  • Lists: The bulleted list is not passed as an unordered list to HTML as it was when converting to PDF. Instead, each list item is a paragraph containing a bullet symbol followed by text.
  • Tables: For both data tables, checking the box labeled "Repeat as header row at the top of each page" within Word does not result in table headers being passed to the HTML file as they were when converting to PDF. Consequently, these HTML data tables are not accessible to screen reader users.
  • File size: 81 kb
  • HTML Validation: 284 errors

Filtered HTML from Word version

This file was created by converting the accessible Word document to HTML by selecting "Save As" > "Web Page, Filtered" from within Word 2007. The result is a file that at 31 kb is considerably less bulky than its unfiltered 81 kb sibling, though still much bulkier than my 5 kb original. Accessibility (both good and bad) is unchanged over the unfiltered version. I tried to validate this file, but it has no doctype and the W3C Markup Validator had no idea what kind of document it was.

Accessible PDF from HTML

This file was created by converting the accessible HTML document to PDF from within Adobe Acrobat, by selecting File > "Create PDF" > "From Web Page". The most critical among the options in this process is in the Settings dialog: A checkbox labeled "Create PDF Tags", which is not checked by default. This creates a tagged PDF, but it does not import all of the accessibility features from the HTML file. Based on what I'm seeing in this particular document, the problem can be summed up with two statements:

  • All accessibility-related HTML elements are successfully converted to PDF tags, including <h1>,<h2>,<ul>,<li> and <th>, but...
  • No accessibility-related HTML attributes are passed to the PDF, including lang, alt, summary, scope, id, and headers.

Summary

I'm hesitant to reach sweeping conclusions from this test since I'm working with a very simple, one-page document with a one-column layout. I also have said nothing here about DAISY, which for longer documents should also be included in the mix of viable options. Of the three file formats discussed in this post though, I will say this: Unless you have a compelling reason to distribute documents in PDF or Word, use HTML. It's non-proprietary, cross-platform, and as we've seen here, highly supportive of accessibility and well supported by assistive technologies. That said, PDF and Word are improving, and PDF in particular has an underlying tagged structure that is comparable to HTML in many ways. If our results with JAWS can be generalized to reflect the overall state of screen reader support, there is still room for improvement but if documents only require basic accessibility features such as alt text and headings, these features are already well supported, and PDF and Word can be equally accessible to HTML. Unfortunately accessibility features in one document format don't necessarily survive the conversion to another, so there's also room for improvement in conversion tools.

6 comments on “Accessible Documents in HTML, Word, and PDF

  1. Hi Terrill,
    Great presentation at the UIC Digital Accessibility Expo!
    Thanks for posting these resources.

    Terry Morris

  2. Hi Terrill,

    Thanks for posting the information and resources. Great stuff!

    I've been working with PDFs for a while, but haven't worked with Word documents with images until this year. The Word 2007 documents have many images; all images are captioned using the built-in Word 2007 feature for captioning images. A few questions for you:

    1. If the image captions are descriptive, does ALT text need to be added for the image? If yes, is the ALT text unique? In other words, ALT text should not duplicate the text in the caption, is that correct?
    2. How does Acrobat PDF Maker convert the ALT text and captions?
    3. How does JAWS (other screenreaders) read the caption and the ALT text for an image?

  3. Good questions, Deborah. My answers:

    1. The goal with ALT text is to provide non-visual users with access to the content. So if an image communicates information to visual users, that same information needs to be communicated to non-visual users. Ordinarily the method for communicating that information is to add ALT text to the image, but if you're communicating that information with a caption, the ALT text really isn't necessary - it would just be redundant. In HTML, the image should still have an ALT attribute, but it should be empty (i.e., alt=""). This is a standard method for telling screen readers to ignore the image.

    Note that when you add an image in Word (at least in 2007; not sure about 2003) it automatically adds the file name as alt text. If the file name is intuitive, this might actually be better than nothing, but if the filename is IMG_0126.jpg, that's worthless information. Bottom line: Make it a habit to add alt text to your images in Word. If the information isn't already communicated in a caption, add equivalent text; otherwise, delete what Word adds automatically and leave it blank.

    2. When you convert from Word to PDF using the Adobe PDFMaker plug-in, the ALT text is converted as ALT text (a property of the image), and the caption is converted as a paragraph. The same is true if you "Save as Web Page" from Word.

    3. Screen readers typically identify an image as an image, then they announce its alt text. For example, this blog includes a photo of me near the top of the page, to which Blogger by default adds alt="My photo" (which, annoyingly, can't easily be modified within the Blogger dashboard). JAWS announces this as "Graphic My photo". It announces images within PDF the same way. If an image has no ALT text in PDF, or alt="" in HTML, screen readers ignore that image entirely. Since a caption is just converted as a paragraph, that's read along with all other text within the flow of the document. It's not explicitly associated with the image.

  4. Could you help me with a question regarding "headers", NOT "headings" in Word 2007?

    Are headers and footers accessible in Word 2007, and if not, how do I make them so?

  5. Hi Terrill,

    Thanks for these great test cases and information. I'm currently using the accessible PDF version in my work to improve NVDA's support for PDF in Adobe Reader.

    I noticed an error in the Accessible Adobe PDF version. The Homework cell spans two rows both visually in the PDF and according to the markup in the HTML version. However, the PDF markup doesn't indicate a RowSpan for the Homework cell. Instead, the cell beneath it (containing "1") is marked with a ColSpan of 2, which doesn't match the visual appearance or the HTML version.