Cleaning up Word’s HTML with Regular Expressions

Today I'm celebrating Independence Day by declaring independence from presentational HTML markup!

In my previous blog post I explored strategies for converting Microsoft Word docs to accessible PDF and HTML. For HTML, I found that Word produces a relatively clean HTML file if you save to "Web page, filtered" in Word 2010 or 2013 for Windows or "Save only display information into HTML" in Word 2011 for Mac.

The HTML that Word produces preserves heading structure, includes image Title and Description together as a combined alt attribute, preserves language of the document and any foreign language parts, and preserves table headers.

However, there are two problems with Word's output:

  1. Lists are tagged as paragraphs.
  2. There is no way to suppress style output; consequently the document contains embedded CSS plus width, align, and other presentational attributes that are designed to preserve the visual appearance of the document.

Regarding the latter problem, I don't want any presentational markup in my document! I plan to plug the content into a website that already has an external style sheet. I want complete separation of content and structure from presentation.

I tried several online Word to HTML converters without satisfaction. Even HTML Tidy (e.g., HTML Tidy Online) doesn't seem to have an option to achieve my desired level of tidiness. The only solution I've found is to manually strip out all the unwanted markup using an HTML editor. This sounds tedious but most HTML editors include an option that allows users to search and replace content using regular expressions, which greatly streamlines the process.

Continue reading

Converting Word to PDF or HTML: Options for Accessibility

I write a lot. I'm writing this blog post in the rich text editor that's provided with WordPress, and I trust it will output nice clean HTML. This is good way of working, but often my writing involves much lengthier documents, and often I'm writing in collaboration with others. The tool of choice for these projects tends to be Microsoft Word, and often the final document will be published either in PDF or HTML.

I also evangelize a lot, encouraging authors to take a few simple steps to ensure the documents they produce are accessible to readers with disabilities. Most authors I work with are writing and editing in Word, and when their document is finished they will publish and distribute it in PDF, motivated primarily by the desire to ensure consistency across platforms, and in some cases to protect it from modification.

I'm also a Mac user, as are many of the people I work with. This blog post documents my quest for Mac-only solutions for converting Microsoft Word documents to accessible tagged PDF and clean, accessible HTML, but my lessons learned on this quest might apply to users of other platforms as well.

Continue reading

Accessible Dropdown Menus Revisited

Back in March 2012 I wrote a blog post titled Accessible Dropdown Menus summarizing my observations with various accessible dropdown menu models, including Suckerfish, Son of Suckerfish, Superfish, Dropper Dropdown, UDM4, Simply Accessible, YUI MenuNav Node Plugin, and the Menubar widget example developed by the Open Ajax Alliance. Of all these, I liked the Open Ajax Alliance (OAA) example the best. Here's the original OAA menubar, and my customized OAA Menubar. Over the last 21 months, my thinking has evolved a bit, aand I've done quite a lot of further experimenting. Since lots of folks have stumbled upon and are referencing my earlier tests, I thought I should post an update documenting how my thinking has changed. Here goes... Continue reading

Choosing a Web Portal: NetVibes vs Protopage

Given the death of Google Reader in July and the eminent death of iGoogle in November, I've been shopping for alternatives. I need a single service that will serve as my dashboard and web portal, providing me with news updates, RSS feeds, and convenient access to bookmarked websites all from a single location. It needs to be cloud-based, browser-agnositc, free, and ad-free.

I've been combing over online reviews and soliciting input from trusted friends and colleagues for months, and a few weeks ago I finally narrowed my focus to two services, NetVibes and Protopage. Since then I've been using both together in separate tabs in various browsers. And finally, I've made my decision.

Here are my impressions based on the features that matter most to me.

Continue reading

Spam and Literature

I've long been an enthusiast of collage in various media—visual art, literature, music. This Fall I hope to unveil my latest musical project, a grand musical mashup that I've been actively working on for over a year, but my interest in such things originated much earlier. Back in the 80's I was into William S. Burroughs's literary cut-ups, and as a computer guy I've come to appreciate the vast potential for digital creativity in this space. For example, check out The Morality Rock Story: Defending Urination, political art inspired (and produced) by Dragon NaturallySpeaking. There's a well-developed and fascinating course on this topic on remixthebook.com.

I've recently come to appreciate the contribution that spammers are making to this artistic genre. In an effort to slip past spam filters, they're producing digital cut-ups from a wide variety of source materials and combining them in ways that are sometimes quite intriguing, maybe even... beautiful. I'm not alone in thinking this. There's an entire movement of people who are into "spam lit", as examined in theguardian.com.

This blog post is inspired by the spam email I received this morning. The focus on college basketball, and the insightful quote from Hofstra's web design manager, are on target given my interests and profession. I wonder if that's coincidence or intelligence? Perhaps they're tapping into the same data about individuals that Google and Facebook are using to provide targeted advertising. Anyway this is fascinating reading, and I feel compelled to share it. Hopefully the author won't sue me. The subject of the email was Introducing, but I'm opting to title this work...

Continue reading