Evolving the MediaWiki platform: Why we replaced Tidy with a HTML5 parser

We’ve replaced Tidy, a tool that fixes HTML errors, with a HTML5-based tool. Given a HTML5 library, replacing Tidy is pretty straightforward, but it took us three years to finally flip the switch. In this blog post, we’ll explore how the time was spent which also throws light on the complexities of making changes to certain pieces of the technical infrastructure powering Wikimedia wikis.

This switch was a collaborative effort between the Parsing, Platform, Community Liaison teams and other individuals (from the Product side) at the Wikimedia Foundation and volunteer editing communities on various wikis. Everyone involved played important and specific roles in getting us to this important milestone. In this post, the authors use “we” as a narrative convenience to refer to some subgroup of people above.

First: Some background

What is Tidy?

Tidy is a library that fixes HTML markup errors, among other things. It was developed in the 1990s when HTML4 was the standard and different browsers did not deal with ill-formed HTML markup identically leading to cross-browser rendering differences.

Badly formed markup is common on wiki pages when editors use HTML tags in templates and on the page itself. (Example: unclosed HTML tags, such as a <small> without a </small>, are common). In some cases, MediaWiki itself introduces HTML errors. In order to deal with this, Tidy was introduced into MediaWiki around 2004 to ensure the output of MediaWiki was well-formed HTML and to ensure it renders identically across different browsers.

Tidy played a very important role in MediaWiki’s early development by reducing the complexity of the core wikitext parser since the parser didn’t have to worry about generating clean markup and could instead focus on performance. It also freed us from having to write and maintain a custom solution which was important given that most MediaWiki development then relied on volunteers.

Why did we replace it?

The technological landscape of today is very different from 2004, the early days of MediaWiki. First, HTML5 is the standard today, and the parsing algorithm for HTML5 is clearly specified, which has led to compatible implementations across browsers and other libraries. This algorithm also clearly specifies how broken markup should be fixed.

HTML4 Tidy was no longer maintained for a number of years. It was based on HTML4, so it drifted away from the latest HTML5 recommendations. Additionally, it creates changes that are unrelated to fixing markup errors. For example, it removes empty elements and adds whitespace between HTML tags, which can sometimes change rendering.

While Tidy has now been revived as tidy-html5 to make it compatible with HTML5, this version continues to do additional fixups that are orthogonal to fixing up markup errors. Given that, there are compelling infrastructural reasons why we picked our home-grown solution.

We want to control the maintenance of this tool so that we choose when and how to upgrade it. This is important because we want to control changes to wikitext behavior and be prepared for that change.
In MediaWiki’s support for visual editing, when edited HTML is converted back to wikitext, spurious changes to wikitext have to be avoided. To support this, Parsoid (an alternate wikitext parser slated to be the default over the next couple years) uses a standard HTML5 parsing library and relies on being able to precisely track and map generated HTML to input wikitext. Tidy-html5, with the additional changes it makes to the HTML, complicates this and wouldn’t be a suitable HTML5 library for Parsoid.
As MediaWiki itself evolves, we want to use DOM-based strategies for parsing wikitext compared to the current string-based parsing. Tidy does not provide us a DOM currently.
Lastly, we want to retain the flexibility of being able to replace one HTML5 library with another without impacting correctness and functionality. Custom HTML changes can lock us in without easy replacement options (as this current project demonstrates).

So, while Tidy has served us well all these years, it was time to upgrade to a different solution that is more compatible with the technological path for MediaWiki.

What did we replace it with?

After a bunch of experimentation, we eventually replaced Tidy with RemexHtml, which is a PHP implementation of the HTML5 tree-building algorithm. This was developed and is being maintained by the Parsing and Platform teams at the Wikimedia Foundation. RemexHtml draws on (a) the domino node.js library that is used by Parsoid and (b) earlier work developing Tidy-compatibility passes in Html5Depurate which was the original Tidy replacement solution.

Since Tidy has been enabled on Wikimedia wikis since 2004, wiki markup on all these wikis have come to subtly depend on some Tidy functionality. In order to ease the transition out of Tidy, MediaWiki implements some Tidy-compatibility code on top of RemexHtml. For example, while RemexHtml does not strip empty elements, MediaWiki tags them with a CSS class so wikis can choose to hide them to mimic Tidy’s stripping.

What made this replacement difficult?

There were a bunch of issues that made this project fairly difficult:

We needed to identify a suitable HTML5 library to replace Tidy.
We had to build the testing infrastructure to accurately assess how this change would affect page rendering on wikimedia wikis.
We had to then address the impact of any changes we identified during testing.

Identifying a suitable replacement library

Firstly, there were no suitable implementations of the HTML5 tree building algorithm in PHP. The only real candidate was html5-php, but it doesn’t implement some key parts of the spec. So, we initially narrowly focused this on what would work for wikis on the Wikimedia cluster versus what would make sense for MediaWiki as a software package. By late 2015, we had Html5Depurate, which was a wrapper on top of validator.nu, a Java HTML5 library. T89331 documents this discussion for those interested in the details.

To run ahead of the narrative a bit, but to bring this topic to completion, in the 2016–2017 timeframe, two independent and somewhat orthogonal efforts on the Parsing team coalesced into RemexHtml, a PHP implementation of the HTML5 parsing algorithm. We adopted this as the Tidy replacement solution since it had good performance and wider applicability beyond Wikimedia.

Testing infrastructure to identify impacts on page rendering

Since wikis have come to depend on Tidy, if we were to replace Tidy with a HTML5 tool, we expected rendering on some pages would change in some way. To assess this impact, we set up a mass visual diffing infrastructure with two MediaWiki instances (running in Cloud Services): one using Tidy and another using Tidy’s replacement. These instances ran a multi-wiki setup with 60k+ pages from 40 wikis. On a third server, we fetched pages from both these VMs, used PhantomJS to snapshot these pages, and used UprightDiff to identify differences in rendering and assign that difference an actionable numeric score.

By end of May 2016, after some rounds of fixes and tests, we found multiple categories of rendering differences, and while 93% of pages were unaffected in our test subset, the 7% of pages affected was more than what we had anticipated. So, it was clear to us that wiki pages had to be fixed before we could actually make the switch. So, this now brought to the fore the third issue above: how do we actually make this happen?

What tools and support did we provide editors / wikis?

We were left with three problems to address:

Identify which pages need fixing
What needed fixing in those pages
How fixes could be verified

In practice, our work didn’t follow this clean narrative order, but nevertheless, we ended up providing editors two tools, ParserMigration and Linter, which addressed these problems.

ParserMigration

In order to let editors figure out how any particular page would be affected by the change, around July 2016 we developed the ParserMigration extension, which lets editors preview pages side-by-side with Tidy and with Tidy’s replacement (originally Html5Depurate, now RemexHtml). This lets them edit the page and verify that the edit eliminates any rendering differences by showing them updated previews.

Linter

Parsoid has the ability to analyze HTML and identify problematic output and then map it back to the wikitext that produced it. Based on this, a GSoC student had prototyped a linting tool in 2014, and in October 2016 we decided to develop that prototype into a production-ready solution for the Tidy replacement project. Through late 2016 and early 2017, we built the Linter extension to hook into MediaWiki, receive linter information from Parsoid, and display wikitext issues to editors on their wikis via the wiki’s Special:LintErrors page.

We analyzed the earlier visual diff test results and added linter categories in Parsoid to identify wikitext markup patterns that could cause those rendering differences. We started off with three linter categories in July 2017 and eventually ended up with nine high-priority linter categories by January 2018 based on additional testing and feedback from early deployment on some wikis.

Ongoing community engagement and phased deployment

In parallel with developing tooling, from late 2016 through mid-2017 we worked on a plan to engage editors on various wikis to fix pages and templates in preparation for Tidy replacement. We prepared a FAQ, started writing Linter help pages, developed a deployment plan and timeline, polished and deployed RemexHtml, Linter, ParserMigration, drafted a public announcement about this upcoming change and sent it out on a couple of mailing lists on July 6, 2017 and Tech News. We provided wikis a one-year window to start making changes to their wikis to prepare for the change.

We planned a phased replacement of Tidy on wikis based on status of potential rendering problems as identified by Linter. We continued engaging with wikis in their forums, on MediaWiki, and Phabricator. English Wikipedia tests helped us find a rendering change we had overlooked.

We identified Italian and German Wikipedias as two large early adopter wikis and with their consent, switched them over on December 5th 2017. This early deployment gave us very good feedback—both positive and negative. German deployment went flawlessly and Italian deployment exposed some gaps and let us identify additional Linter categories for flagging pages that needed fixing. Deployment to Russian Wikipedia in January 2018 forced us to change wikitext semantics around whitespace to reproduce Tidy behavior.

Editors and volunteers for their part developed help pages and additional tools to help fix pages. Starting April 2018, for English Wikipedia, we helped individual wikiprojects with lists of pages that need fixing, and we contacted wikis with the highest numbers of errors and helped connect them to volunteers and information about how to resolve the errors in advance. One of us even fixed 100s of templates on 10s of wikis (primarily small wikis) to nudge them along.

All along, we continued to gather weekly stats on how wikis were progressing with fixing pages, and also ran weekly visual diff tests on live wiki content to collect quantitative data about how this reduced rendering changes on pages. We continued to post occasional updates on the wikitech-ambassadors list and Tech News to keep everyone informed about progress.

Summary

This ongoing community engagement, communication, testing, monitoring, and phased deployment effort was crucial in letting us meet our one-year deployment window for wikis to switch over with a minimum of disruption to readers. Equally importantly, an active embrace by various wikis of this effort has let the Foundation make this much-needed and important upgrade of a key piece of our platform.

What next? What does this enable?

Since the preferred implementation of TemplateStyles has a dependency on RemexHtml, unblocking its deployment is the most immediate benefit to editors of the switch to RemexHtml. Some bug fixes in MediaWiki have benefited from RemexHtml’s real HTML5 parsing, and we hope that editors will find the use of standard HTML5 parsing rules a boon when chasing down rendering issues with their own wikitext. Going forward, there are two parallel efforts that benefit from RemexHtml: balanced templates, which could make output more predictable and faster for readers and editors, and a planned port of Parsoid from Node.js to PHP and the final replacement of the legacy Parser. Eventually, as we start thinking of ways to evolve wikitext for better tooling, performance, reasoning ability, and fewer errors, DOM-based solutions will be very important.

But all that is in the future. For now, we are happy to have successfully reached this milestone!

Subbu Sastry, Principal Software Engineer
Tim Starling, Lead Platform Architect
Wikimedia Foundation

Read more From the archives, HTML, MediaWiki, MediaWiki, Technology, Tidy, error, html5, technical infrastructure

News

Evolving the MediaWiki platform: Why we replaced Tidy with a HTML5 parser

First: Some background

What is Tidy?

Why did we replace it?

What did we replace it with?

What made this replacement difficult?

Identifying a suitable replacement library

Testing infrastructure to identify impacts on page rendering

What tools and support did we provide editors / wikis?

ParserMigration

Linter

Ongoing community engagement and phased deployment

Summary

What next? What does this enable?

Read further in the pursuit of knowledge

New interaction timeline improves investigation of harassment cases

What should journalists know about Wikipedia? A Poynter Institute NewsU webinar

California Supreme Court upholds critical intermediary liability provision

Help us unlock the world’s knowledge.

Contact us

Follow

Photo credits