Molly White
Software engineer, editor and arbitrator on Wikipedia, feminist, Twitter bot commander, unabashed cat lady.


The best way for me to learn a new programming language is to use it. I don’t mean use it to write some artificial programming exercise (sorry FizzBuzz), but to use it to write something that I find interesting, challenging, and useful. When I decided to pick up Python, my first project was an IRC bot framework. I had a lovely time checking out other Python IRC bots1, reading up on Python’s sockets module, and writing some fun commands for the bot. The bot is still in development (read: on the back burner), and my main focus has shifted to a new side project: a wikitext-to-LaTeX parser.

One day I was working on my pet proofreading project on the English Wikisource: the Pentagon Papers. I must have glanced at the “Download as PDF” link in the sidebar, because curiosity led me to try to determine what the project would look like in a nice PDF document. For those of you less familiar with the Wikisource project, Wikisource faces an unusual challenge in that it stores books, whereas most wikis store articles. Because of this, large projects sort of hack together a couple of pages into “chapters,” all stored as subpages of one main page that’s often used as a table of contents. Each of these subpages is just a portion of the entire work; larger works are not stored on a single page to prevent massive loading times and difficulty in editing the content.

I clicked the button and was surprised to find that the whole project (currently at about 626 proofread or verified pages)2 rendered in a few seconds. “Download the file” showed me why: the Collection extension, which is used to generate these PDFs, had ever-so-helpfully provided me with a two-page-long PDF. The first page contains all of the text from the header, and… nothing else. Not even the two-page-long table of contents. The second page is the automatically generated license page that is appended to all of the Collection files. I assumed something had gone wrong, so I tested it by trying to download a PDF version of a couple of other multi-page documents. Sure enough, each one had the same issue. For wikis like Wikisource where multiple pages are transcluded into a single work, Collection is severely limited. If I wanted to create a PDF of the Pentagon Papers, I’d have to manually click “Add to book” for each of its 7,000+ pages.

This is where the idea for Wikisource-to-LaTeX was born. Originally I just thought about writing a script to download each of the pages of Wikisource book. I wasn’t quite sure what I’d want to do with a folder full of wikitext, so I decided to combine that problem with my adoration of LaTeX and regular expressions, and perhaps a bit of masochism, to try my hand at building a parser. After working on it for two months, I’ve reached the following conclusion: parsing wikitext is HARD.

My original plan was simple. Build a huge collection of regular expression patterns, then find and replace. It didn’t take me terribly long to discover the flaws in this plan. One of the main issues is that there’s no good way to perform a large number of these operations at once. Instead, I’d have to pass through the document many times to perform all of the replacements, something that would be extremely inefficient.

After much Googling, I discovered Python Lex-Yacc, or PLY. This is a Python implementation of lex and yacc parsers3 that I could use to convert my wikitext files to a token stream. I set to work writing some regular expressions to identify wikitext snippets and extract the information I needed. Here is an example of how it identifies an underlined phrase:

The phrase is represented in wikitext as {{u|hello world}} or <u>hello world</u>. The lexer uses the regular expression pattern r'(?:[{]{2}u\||<u>)(?P<word>.*?)(?:[}]{2}|</u>)' to match this, capturing “hello world” in the “word” group. This creates a token, which is represented in string form as LexToken(UNDERLINED,'hello world',1,2812). This is a great start! I’ve removed the text I don’t want, and kept the text that’s important. Unfortunately, it’s not sufficient to just leave it at this. What if there is centered, underlined text? The lexer will pull it out as LexToken(CENTERED,'{{u|hello world}}',1,2812), which is not exactly what I’m going for. MediaWiki’s nested templates do not do much to make this project easy. I ended up solving the problem by adding a traverse() function, which finds all occurrances of {{ and }}. It then works from the inside out and tries to make sense of each template. It’s not perfect, but it’s getting better. It hangs out in the reparse module: a collection of functions to help deal with running header templates, left offsets, and other functions that require a bit more attention than the lexer can give. I also had to add entire wikitable and toc (table of contents) modules, because the syntax very involved.

This whole project has given me new respect for the Parsoid team. My project is very specific to the Pentagon Papers—if someone tried to use it for anything more than the simplest Wikisource page, it would probably break immediately. I cannot even fathom trying to undertake a project to create a parser for all of MediaWiki.


1. Thanks Earwig!
2. Progress
3. See "What is Lex? What is Yacc?" for a quick overview; in the interest of brevity, I’m not going into detail here.

comments powered by Disqus