Converting Word to PDF or HTML: Options for Accessibility

I write a lot. I'm writing this blog post in the rich text editor that's provided with WordPress, and I trust it will output nice clean HTML. This is good way of working, but often my writing involves much lengthier documents, and often I'm writing in collaboration with others. The tool of choice for these projects tends to be Microsoft Word, and often the final document will be published either in PDF or HTML.

I also evangelize a lot, encouraging authors to take a few simple steps to ensure the documents they produce are accessible to readers with disabilities. Most authors I work with are writing and editing in Word, and when their document is finished they will publish and distribute it in PDF, motivated primarily by the desire to ensure consistency across platforms, and in some cases to protect it from modification.

I'm also a Mac user, as are many of the people I work with. This blog post documents my quest for Mac-only solutions for converting Microsoft Word documents to accessible tagged PDF and clean, accessible HTML, but my lessons learned on this quest might apply to users of other platforms as well.

First, let's consider what I mean by "accessible document". Here are a few basic features that an accessible document should have:

  • Headings and subheadings that clearly communicate the overall structure of the document. This helps screen reader users to understand how the document is organized, and users can easily and efficiently navigate the document by jumping between headings.
  • Lists, explicitly marked up as lists so screen reader users can easily understand the nature of the content (e.g., this is a list, it has X items in it).
  • Alt text for images so people who are unable to see the images can access the message these images are trying to communicate.
  • Data tables that have explicit markup that enables screen reader users to understand the relationships between column and/or row headers and data cells.
  • Language explicitly identified, both of the document and of any parts that deviate from the primary document language, so screen readers can accurately pronounce the text (this is especially important in mixed language documents such as those used in a foreign language course). In Word 2011 for Mac, this is done by selecting all document text (Command + A), then going to Tools > Language, and choosing the language from the list. Then, repeat this procedure for each block of text that deviates from the default language.

Here's a sample accessible Word document that I often use in trainings on this topic at the University of Washington. It has been prepared with accessibility in mind, and includes all of the features listed above. It includes an image with alt text, a main heading and multiple secondary headings, an unordered list, two tables with column headings identified, document language identified as English, and a single sentence identified as French.

Regarding alt text, Word 2010 (Windows) and 2011 (Mac) provide two fields (Title and Description) instead of one, with the idea that images sometimes require both a short and long description. This is analogous to the alt and longdesc attributes that have historically been used for this purpose in HTML. In my sample document, I've used both fields for testing purposes.

Regarding data tables, my sample document includes both a simple and complex table. On the simple table, the header row can be flagged as a header row within Table Properties (Row tab), which helps screen readers to understand which cells contain the column headers. The complex table is more challenging to make accessible for screen reader users, and no word processing application, including Word, is up to the challenge. The complex table has multiple rows of nested column headers, and as such requires more extensive markup that explicitly communicates the relationship between each of the column and/or row headers and each of the individual data cells. This level of detail is supported in both HTML and tagged PDF, but not in Word. Therefore, making the complex table accessible after converting to PDF or HTML will always require some post-production work, either in an HTML editor or Adobe Acrobat Pro.

With all this in mind, I conducted the following tests using a few of the most common word processing applications:

  1. For applications other than Word, import my original Word document, and examine whether the structure and accessibility features of the original document are preserved.
  2. Convert the document to PDF and examine whether the document's structure and accessibility features are passed on to the PDF.
  3. Convert the document to HTML and examine whether the document's structure and accessibility features are passed on to the HTML file.

The tools I tested

Each of these tools is free and cross-platform except Word Perfect. Corel Word Perfect Office X7 is Windows-only and sells for $250. That's two strikes, and I confess I'm only including this tool for nostalgic purposes. I learned to crawl as a computer user on Word Perfect, so I thought I'd try their 30-day trial and see how they perform these days on accessibility. If their product is more accessible than all the others, perhaps this could be the key that turns their business around and enables them to regain prominence in the word processing market!

Open Office and Libre Office are very similar products, as the latter was forked from the former in 2010. After four years of independent development though, there are a few important differences between the two products as we'll see.

Importing from Accessible Word

Importing Headings

All tools tested use heading styles similar to Word, and the import process successfully preserves the original heading structure.

Importing Lists

Lists are also preserved, although not always as gracefully as headings. Open Office recognizes the list as a list (the unordered list button is pressed when the list has focus) but for my sample document it was unable to display the bullets correctly, and a new bullet had to be selected from the list settings). This was not a problem in either Libre Office nor Google Docs. In Word Perfect, the list is displayed correctly as a bulleted list, but the numbered list button is pressed, which I suppose makes my list an ordered unordered list. As it turns out, this ultimately doesn't matter because Word Perfect doesn't export list tags to either PDF or HTML, but I'm getting ahead of myself.

Importing Alt Text

Alt text is, frankly, a bit of a mess. I intend to explore this in more depth in a future blog post, but for now here's a summary of what happens to Title and Description when a Word doc is imported into other tools:

  • Like Word, Google supports two fields named Title and Description, but they don't map accurately to Word's fields. The Title field from the original document is imported into Google's Description field, and Google's Title field is blank. The Description field from the original document is lost.
  • Open Office and Libre Office have two features for specifying alt text. If you right click on an image within the document, the context menu includes a "Description" option which brings up a dialog that includes both Title and Description fields. There's also a separate "Alternative (text only)" field available in the Picture Properties dialog. This latter field actually maps to the Title field. In Open Office, none of these fields is populated from the original Word file after importing. Libre Office has fixed this problem and both Title and Description are correctly populated after import.
  • Word Perfect only has one alt text field, located in an "HTML Options" menu that's accessible from a context menu by right clicking on an image. This field is not populated by either of the alt text fields in the original Word document.

To summarize, Libre Office is the only tool that successfully imports all alt text from the original Word document.

Importing Table Headers

The Table Properties dialog in Google Docs does not include any options related to row or column headers, so this information is lost on import.

All other tools tested include a "Heading rows repeat" option or something similar, accessible within a Table Properties dialog. And for all of these tools, this setting survives the import from Word.

Importing Language

In Google Docs, it's possible to define the primary language of the document (using File > Language in the menu), but it is not possible to define language of parts. Curiously, after importing my sample document, the primary document language was identified as French, apparently affected by the one French sentence at the bottom of the document. This doesn't seem to be due to position though as the same thing happens if the French sentence is moved to the top.

In WordPerfect, the language of the document is correctly identified as English, but the language of the French paragraph did not survive the import.

Open Office and Libre Office provide the best, most intuitive support for multi-lingual documents (better than Word). From the Tools menu, users can select the language for all text, for the current paragraph, and/or for the current selection. After importing my sample Word document, the language of all parts is accurately identified.

Screen shot of Language sub-menu in OpenOffice

Exporting to PDF

In Windows, when you create an accessible document using Microsoft Word 2010, then save as PDF, the default output is a "tagged PDF", which has most of its semantic structure and accessibility features intact. Unfortunately "tagged PDF" is not supported in Word 2011 for Mac. In Word for Mac you can Save as PDF, but the output is not a tagged PDF, and therefore does not have the underlying structure necessary to support accessibility. All of the original document's structure and accessibility features have been lost.

So to date my message has always been: You can create a reasonably accessible document in Word for Mac, but if you need to convert it to PDF you have to do that final step in Windows.

This message is typically met with groans of displeasure from Mac users. And for some, it isn't even an option because they don't have convenient access to Windows. Also, there are some technical problems with this approach as fonts and other styles don't always survive the migration from one platform to the other. So, we desperately need a Mac-only solution for exporting to accessible tagged PDF.

For comparison sake, Word 2013 for Windows

Although Windows is off the table as an option, it's worth examining the tagged PDF that Word 2013 for Windows produces. It's not perfect, but it's pretty good. The heading structure is preserved, unordered lists are tagged as lists, images include alt text (using the Description field, not the Title field), and table headers are correctly tagged as TH. One problem is that language is not preserved for foreign language content within the document. In my sample document, the overall document is identified as English, but the French paragraph is not identified as French. This is a significant problem for mixed language documents.

The Other Contenders

Google Docs supports downloading to PDF, but the result is not a tagged PDF. All structure and accessibility features have been lost, just as with Word 2011 for Mac.

All other tools tested provide an option to export to tagged PDF. It's a checkbox located within the Settings dialog, which is accessible from the dialog that appears after you initiate the export process. In all tools this option is not checked by default. Once checked, the output from Open Office and Libre Office is the most accessible output of all tools tested. It is comparable to the output produced by Word 2013 for Windows (heading structure is preserved, list structure is preserved, and table headers are identified). The output from these tools differs from Word's output in two significant ways:

  1. The Title field is passed to the PDF as alt text, whereas Word uses the Description field. I think the Title field is more appropriate for this purpose, although in any of these products only one of the two alt text fields is used; the other is lost (again, that's another blog post).
  2. Language is correctly identified for the entire document and for individual foreign language parts.

I should mention that in my sample document there is actually a problem in the heading structure of my Open Office output: The main heading was exported to PDF as H6 rather than H1, despite its clearly using a Heading 1 style within Open Office. This may be a bug that's unique to my document, perhaps a problem with the original Word file or introduced during the import; I tried creating a new test file from scratch and my Heading 1 was correctly exported to H1 in that PDF. That said, this was not a problem with my sample document in Libre Office, so Open Office does seem to be at least partly at fault.

Word Perfect's tagged PDF is not a good one. Heading structure is not passed to the PDF (the main heading is tagged as a <caption> and all others are tagged as paragraphs); alt text is not passed to the image; table headers are all tagged as TD; and the language for the entire document is identified as French. So, Word Perfect is not a serious contender. That said, for nostalgia buffs it's worth noting that they're the only tool that allows users to Save As WordPerfect 4.2, eleven versions of WordStar, and even Microsoft Word 1.0 (even Word doesn't provide that option!)

Screen shot of WordPerfect's Save As dialog, with dozens of legacy versions of classic tools supported

Exporting to HTML

The thought of authoring web pages using a word processing application may make some of you cringe, and rightfully so. Microsoft Word is particularly notorious for generating code that is loaded with seemingly endless lines or proprietary markup even for the simplest of documents. That said, I understand authors wanting to use word processing tools. They have many features that are specifically designed to support writing and editing, and are particularly well-suited for collaborative authoring. Once a document is finished though, how should authors publish it for online distribution? Most authors seem to prefer PDF. I encourage authors to consider exporting to HTML instead, as it's much easier to make accessible and responsive, and it's truly an open standard whereas PDF's openness is debatable. But if we're going to encourage authors to distribute their content in HTML, they need to be able to convert their document easily.

All of the tools tested, including Word 2013 for Windows and Word 2011 for Mac, now provide an option to export to a relatively clean HTML document, without proprietary markup. In recent versions of Word for Windows, this is the "Web page, filtered" option. In Word 2011 for Mac, the naming is a bit more confusing. There are two options for saving a document as a Web Page:

  1. Save entire file into HTML
  2. Save only display information into HTML

The second of these options produces the cleaner output.

In Open Office and Libre Office, there are two methods for saving to a web page:

  1. Select File > Export, then choose XHTML
  2. Select File > Save As, then choose "HTML Document (OpenOffice Writer) (.html)"

The first of these options (XHTML) produces a single file with image data embedded into the HTML. However, this is not well-implemented and from my testing yields buggy results that don't render well across browsers. Save As HTML seems to be the better of the two options.

In Word Perfect, selecting File > "Publish To" > HTML brings up the "Publish to HTML" dialog. This includes an "HTML version" dropdown which provides a list of doctypes including HTML5. There's also a "Use relative font sizing" option. Both of these features are unique among all the tools I tested.

Screen shot of WordPerfect Publish to HTML dialog

With an HTML5 doctype, the HTML that Word Perfect generates is 100% valid, the only tool tested that can make this claim. However, don't get too excited. All headings and lists are tagged as paragraphs; there are no <thead> or <th> elements in the tables; and there are no lang attributes. The only accessibility feature that survives the export to HTML is alt text for images. Valid code isn't necessarily good code.

All other tools generate HTML output with heading structure intact. It is worth noting that Open Office exports the main heading in my sample document as H1, not H6 as it did when it converted the same file to PDF.

Google, Open Office, and Libre office also generate output with list structure intact. However, Microsoft Word does not. In both Word 2013 for Windows and Word 2011 for Mac, list items are tagged as paragraphs and contain symbol characters for bullets.

Regarding alt text, all tools other than Word Perfect have embraced the Microsoft model and offer two fields for alt text, Title and Description. However, there are differences in which fields they use for output and how they deliver them. This is not the same for HTML as it is for PDF. Here's the summary for HTML:

  • Word uses a mashup of Title and Description. The alt attribute includes both, with each prefaced by a label (i.e., "Title:" followed by the title, then "Description:" followed by the description).
  • Google uses the Description field for the alt attribute, and the Title field for the title attribute.
  • Open Office and Libre Office use the Title field only; the Description field is lost.

Word is the only tool that exports table headers as <th> elements. Google doesn't support table headers at all, and Open Office and Libre Office generate HTML in which the header rows are wrapped within a <thead> element, but the actual header cells are tagged as <td>, not <th>.

Google is the only tool (other than Word Perfect) that does not support language (somewhat surprising for the maker of Google Translate). There are no lang attributes in Google's output. In contrast, Open Office, Libre Office, and Word each have a lang attribute on the <body> element and another lang attribute on the French paragraph.

Conclusions

Libre Office is the best of the tools tested at generating accessible tagged PDF. Its tagged PDF is even better than Word for Windows, given its support for language. So if you're a Mac user, or a user on any platform who needs to export multi-lingual documents from a word processing application to PDF, Libre Office may be your best bet. It also seems to import Word documents gracefully, so it might be a viable tool even in a collaborative environment where others are using Word. This would of course require more extensive testing.

Neither Google Docs nor Word Perfect exports to an accessible tagged PDF, so neither tool can be taken seriously.

For HTML output, either Libre Office or Microsoft Word could work. Open Office produced identical HTML output to Libre Office in my test, but I favor Libre Office of the two, given Open Office's clumsiness with importing my Word file and the apparent bug in which it exported my H1 as an H6 in the PDF test. If your documents include lists, Libre Office is better than Word since Word botches the lists. But if your documents include data tables, Word is the better choice since Libre Office doesn't markup HTML table headers with <th>. If your documents include both lists and tables, pick a tool and resign to the fact that you will have to do some work in an HTML editor to correct problems after exporting. And depending on which tool you use, note how it exports alt text and use the appropriate field(s) (again, watch for a future post specifically focusing on this issue).

Styles, Styles, and More Styles!

All of the tools tested produce HTML files that include gobs of inline or internal (embedded) CSS styles. But I don't want any styles! When publishing to HTML, authors often will be publishing their documents to an existing website where style is defined in a CSS file. This is a strong use case for separation of content and structure from presentation. It would be great if tools would provide an option to "export content and structure only" but so far I haven't found one.

I've tried using filters such as Word2CleanHTML.com, Word HTML Cleaner, and Atrise ToHTML, but none of these tools produces accessible output. They do clean up Word's code, but they do so at the cost of accessibility.

Another tool that may have once offered a solution is Virtual 508. This tool was originally developed at the University of Illinois and was formerly known as the Accessible Web Publishing Wizard for Microsoft Office. Unfortunately it doesn't seem to be actively developed anymore, as the home page proudly boasts of their new release that now supports Office 2007. I sent them an inquiry to check on status of the product but I never received a reply.

What about Word 2015 for Mac?

Maybe, just maybe, the much-anticipated next version of Word for Mac will export to tagged PDF and generate good clean HTML that minimizes the need for post-export processing. Please stand by, and hope (and advocate!) for a working solution from Microsoft.

My Test Files

In case you want to do some testing of your own, here are all the files I produced while doing this research:

5 comments on “Converting Word to PDF or HTML: Options for Accessibility

  1. Thanks for this, T. I will be reading back through this again today and may have some specific comments and questions.

    The last time I did a similar review with MS Office for the Mac (v 11, SP1) was three years ago and my findings were similar to those reported here. I too recommended folks use LibreOffice on the Mac if they wanted to create accessible documents. I also recall getting very inconsistent results regarding the doc to pdf conversions. It is also "disappointing" that Office on the Mac does not work with VoiceOver.

    BTW, what did you use to "test" the PDFs for accessibility. There are some known bugs in the accessibility checker for Adobe Acrobat Pro that I do not believe have (yet) been fixed. At least one of them had to do with the language of the document (reporting it was missing when it wasn't).

    • Hi @John, to test the PDFs I reviewed the tag structure using the Tags Pane in Adobe Acrobat Pro. Regarding language specifically, I knew which paragraph had been marked as French, so I checked the properties for that paragraph tag, and checked the overall document language in File > Properties > Advanced.

  2. Yet another option for converting from Word to HTML is the "Paste from Word" button in CKEditor. After copying and pasting the content from my sample Word document, then viewing the source, I see that headings, lists, and table headers are preserved; and alt text is a Word-like mashup of Title and Description. The only thing that's missing is a lang attribute on the French paragraph.

  3. Pingback: Weekly Roundup of Web Design and Development Resources: July 5, 2014