Google Summer of Code: The Problem

by Molly White on Wednesday, June 5, 2013

I have exciting news! I found out a little over a week ago that I’ve been accepted to participate in Google Summer of Code 2013, working on a project for the Wikimedia Foundation. For those of you unfamiliar with Summer of Code, it’s a really cool program where students (high school through Ph.D.) are paid by Google to write code for one of a large number of open source organizations. My project is “Improve support for book structures,” which is explained in detail in my proposal. That said, my proposal is somewhat targeted towards people who have a good understanding of the MediaWiki software and the Wikimedia Foundation’s projects. I was thinking that I could use this blog post to try to explain exactly what problem I’m trying to solve, for those of you who may not know Wikimedia’s projects so well.

Wikipedia is a wiki, or a website that anyone can edit. It hosts millions of articles, each of which is written in a somewhat similar way to a classic print encyclopedia. However, the Wikimedia Foundation (the organization that runs Wikipedia) also runs a number of other projects: Wiktionary, Wikisource, Wikispecies, Wikibooks, Wikidata, and so on. They also follow the wiki design, but host different types of content. My Google Summer of Code project is intended to support Wikisource, Wikibooks, and other wikis like them.¹ Wikisource is “the free library that anyone can improve;” Wikibooks is “the open-content textbooks collection that anyone can edit.” The common thread here is that they each primarily host content that is more like a book than an individual article. These wikis need to break up their content in a different way from, say, Wikipedia. While Wikipedia uses section headers, Wikisource or Wikibooks typically cannot use these to separate chapters. If they did, their longer books would end up as one massive page that would be both difficult to load and difficult to read.

So, to avoid compiling all of their text onto a single page, Wikisource and Wikibooks create a system of subpages, one for each chapter.² As an example, take a look at The Interpretation of Dreams, a proofreading project I did on Wikisource a while ago. The landing page is the front matter of the book: the title page, various prefaces, and the table of contents. From there, you can navigate to each chapter. Works nicely, right? The answer is: ehhh, kind of. The end result is quite nice. But it’s not easy for the editors³ to create this. Each of those green header bars, with the “next” and “previous” chapter links, needs to be added manually. Each of those pages includes the chapter text by using a tag like this: <pages index="Freud - The interpretation of dreams.djvu" from=19 to=97/>. This tells the wiki to include pages 19–97 in that particular chapter. Compiling those pages isn’t terrible for works like The Interpretation of Dreams, which consists of a grand total of eight chapters, some front matter, and an index. Wikisource also hosts works like The Pentagon Papers, however, which can be massive.

The makeshift way Wikisource and Wikibooks stores their content also causes other problems. As a (completely hypothetical) example, what would happen if a Wikisource administrator decided that The Pentagon Papers was not appropriate for inclusion in Wikisource, and decided to delete it? Well, so far there are 35 subpages making up sections I.A. through IV.A. Great, so the admin would need to delete 35 pages. This is a pretty big pain, but not out of the question. But then you must consider that with a project like The Pentagon Papers, where the proofreading is being done from a scan of the source, each individual source page has its own wiki page (an example). In these first 35 sections, there are over 600 pages. Suddenly deleting this work became a whole lot more difficult. For a sense of scale, I am not even 10% done with that proofreading project—the final work will have over 7,000 individual wiki pages. The same issues arise when it comes to other actions, like moving a work to a new title, protecting a work from being edited, or “watching” a work for new edits.

In short, I hope to improve the way wikis like this handle their content. I will be modifying an existing extension so that a user can easily enter information about the book’s structure, and let the extension do the rest. It will automatically create those header bars, and if I have time, I will also work on auto-generating tables of contents and fixing the one-click action problems! In the interest of brevity, I will try to write another post soon that goes into detail on the solutions I’m hoping to implement.

Notes

1. Though I frequently refer to Wikisource and Wikibooks in this project, one of my goals is to add this support in a way that other, non-Wikimedia wikis can make use of it. MediaWiki is free software, so many, many people use it for various purposes.
2. Or section, or volume, or whatever other organization style the source uses.
3. Wikisourcerers!