somewhere to talk about random ideas and projects like everyone else

stuff

Offline Wiki browse the full text of wikipedia offline

In middle school, I had an iPhone— but it didn't actually have cell service, which led to more than a few expeditions for the fabled WiFi. My favorite app was Patrick Collison's Encyclopedia, which contained the full compressed contents of the English Wikipedia. It left a powerful impression on me— I felt the true power of a real-life Hitchhiker's Guide.


related

posts

Offline Wiki Redux 30 December 2011

There’s just something incredibly alluring about the concept of holding the sum of human knowledge with you at all times. While near-ubiquitous connectivity alleviates this to a certain extent, the momentary lapses of networking are incredibly corrosive to an information dependent mentality. Wikipedia never ceases to amaze me and, while I’ve tried in the past to encapsulate part of its sheer awesomeness, this marks a much more significant attempt.

The differences start even before the data gets to the application. The preprocessing toolchain was entirely rewritten for a multitude of reasons. First of all, it compresses not the entireity, but rather the most popular subset of the English Wikipedia. Two dumps are distributed at time of writing, the top 1000 articles and the top 300,000 requiring approximately 10MB and 1GB, respectively. While ostensibly, the mere top 300k articles is far too narrow to delve deep into the long tail, the breadth of the meager 1/25th of articles consistently surprises me in its depth. The advantage is that at 1GB, it’s relatively easy to fit into any system. The algorithm which strips extraneous content has been made far more sophisticated than the original series of regular expressions. This enables greater compression and less accidentally omitted content.

On the application end, the application has switched from a GWT-compiled LZMA SDK to a speedy, pure javascript decoder. This makes page loads significantly speedier and allows greater compression ratios, for individual blocks can be made larger (256KB instead of 100KB). It also now uses WebGL Typed Arrays to further speed things up, such as sending data to and from the WebWorker thread.

The interface was redesigned with CSS media queries to dynamically transition between different modes in response to different viewing environments. The interface consists of two regions: the fixed position recessed left panel which holds the page title, a search bar, controls and the page outline. This collapses down to a toolbar header automatically when the screen estate is limited. It uses an Apple-esque noise texture background.

Downloads happen in little units called chunks (they’re half a megabyte for the dump file and about four kilobytes for the index). The local file can be built up out of order. While online, all storage operations check the virtual file, indexed db, or web sql database. If it’s not there, it transparently uses an XMLHttpRequest in order to fulfill the request and caches it to disk in the respective persistence mechanism. A bitset is used to keep track of which chunks are already downloaded and which need to be downloaded.

http://offline-wiki.googlecode.com/git/app.html


Offline Wiki Chrome Webstore App 23 April 2011

Offline Wiki for Chrome is now on the Web Store. The app that started a few months ago and took forms as Offline Dictionary. The chrome app version is much more refined, aesthetically and functionally. The search bar and autocomplete are much less obtrusive, and there’s gradients and box shadows, a sure indication of progress :)

Description

The most significant difference, is that it includes the entire English Wikipedia, a pretty close approximation of the sum of all human knowledge. Uncompressed, it’s something like 30 gigabytes of raw text, and the compressed version that I’ve compiled for this app clocks in at around 3.7 gigabytes (3.4GB for the actual compressed dump and 0.3 for the search index file). Hosting and serving up all those gigabytes does cost money, so that’s why the app isn’t free. But it’s still a pretty good value considering the equivalent apps for iOS are twice the price.

It includes an index which lists every single article in Wikipedia, paginated and navigable through a scroll bar. There are something like 50,000 pages of the index (the exact number depends on the size of your screen), and articles are divided into columns. There’s a button which sends you to a random article, and a search bar.

However the notion of this app is pretty strange: A webapp which only serves its function when offline. However, the ease of installation and use of this app is somewhat unparalleled on the desktop space. This app allows the browsing of the database instantly after the download has begun, rather than nearly all other such apps which require the entire dump to be downloaded first. There’s no conversion process. Everything pretty much hopefully just works.

HTML5 is awesome.

The interface is done entirely with CSS and HTML, no images except the obligatory xkcd reference. The toolbar is a css3 gradient and enclosed with HTML5 tags like <article>, <header> and <footer>. The download progress is indicated by a native <progress> bar element. The index is navigable through a slider bar created using <input type=range> and all the page titles are put automatically into columns by the css3 column layout properties.

The really cool stuff is what’s in the javascript. Probably the most significant development which has enabled this app is an awesome thing called the FileSystem API which exposes a special persistent sandboxed read/write directory which can be modified through the File API. When the page is first loaded, the process of downloading begins, and it uses the responseType attribute of the XMLHttpRequest to get the downloaded chunk (it just sets the request header to specify a range of one megabyte) as an array buffer. Array Buffers are a part of the WebGL specification which enables the storage and manipulation of binary data. It then uses BlobBuilder to convert that array buffer into Blob which can then be written using the File API to the hard disk.

To search, it reads a small portion of the index file by using a particular slice of the file, which is the equivalent of seeking on a disk. It implements binary search to quickly locate a certain article on the disk and then reads a chunk from the dump in much the same way. Then it uses a javascript implementation of the LZMA compression algorithm to decode a certain hundred-kilobyte block into raw WikiText which is then parsed into HTML.

Finally, the pushState and replaceState methods from the History API are used to handle the navigation of pages without reloading,.

Get it now

Offline Wiki for Chrome

 

 

 

 


HTML5 Offline Wikipedia Reader and Help me win a Cr-48 04 February 2011

690 (1)

So, lets begin with a plead for you to rate my submission (ideally favorably) for the LucidChart chrome notebook contest, partly because it would be awesome, and I need to test the aforementioned HTML5 offline Wikipedia dump reader.

Wikipedia is awesome, and on my service-less iPhone 2G, one of the most useful apps is one with an outdated (by something like three years) Wikipedia dump reader. There’s something very much like the Hitchhiker’s Guide in carrying the “sum of all human knowledge” in one’s pocket. Wikipedia grows really quickly and the English language dump (as of last month at time of writing) is over six gigs of bzipped XML.

Another reason for building an offline Wikipedia dump reader is simply that I can. There’s a lot of cool and cutting edge stuff in this. It uses the file input JS API, FileReader, Blobs, WebWorkers, fast Javascript to handle a pure-JS implementation of the LZMA compression algorithm, and the FileSystem API.

All this stuff has only been possible for a very short amount of time. In fact, the FileSystem API only arrived with the release of Chrome 9 today. It’s pretty awesome to imagine that a web page, HTML and Javascript is processing multiple gigabytes of data at real time. At the current unoptimized state, the search results and the article shows up instantly within 100ms of a key press event (However, I admit it has yet to be tested on the actual English Wikipedia).

This app is almost ideally suited for Chrome OS. A hundred megabytes worth of data in a month (with the Verizon deal, not to mention that it’s limited to two years) isn’t very much and it would be a pity for the sixteen gigabytes of SSD space on a Cr-48 to be wasted. One of the dumps that I’ve compiled is a subset of the English Wiktionary which makes a great lightweight (8MB) dictionary with over sixty five thousand words. Wiktionary is fairly large at 217MB, and the Simple English Wikipedia is 51MB.

Assuming you can somehow access the dumps, I probably won’t post them on this server because of the massive bandwidth usage, someone else could, the process for using it is very simple. Download the file to your computer through HTTP, a torrent, or create it yourself. You might have noticed that the favicon is a picture of a carrot because that’s the current codename. Well, not really a codename, but I was scrolling through some icon lists and saw one for a carrot on the first page. I don’t even like carrots, but I digress. The dump uploading screen has a picture of a bunny biting a file upload carrot that you click on to select the file. Select the file, and it automagically copies it over to a persistent folder. Then you’re off to searching and reading articles and definitions :)

I’m not done yet, or at least, I don’t feel like packaging it up and submitting it to the chrome webstore yet because I need to figure out a good way to host database dumps.