Skip to content

Lossy Text Compression

So i had this idea to make a lossy text compression system. For those who don’t know the difference between lossy and lossless. Well, the main rationale is Wikipedia, I have a jailbroken iPhone and one of my favorite uses of it is the Wiki2Touch (which is now defunct, with the development blog gone and the google code project deleted). The rate of Wikipedia’s growth is close to a gigabyte a year. Extrapolating that growth, it will be soon that my iPhone 2G with 8GB of space will run out of room for Wikipedia. Now, the size is almost 6GB (bzip2 compression). Soon, it will approach 8gb – (200mb (root partition size) + 100mb (music) + 0mb (video) + 1gb (apps).

After googling the concept, it doesn’t seem very original. Even the compressing wikipedia idea doesn’t seem original! But there are some limitations to the current world-record breaking systems, they are very memory-intensive (won’t work on an iPhone!) and are also very slow (hitchiker’s guide isn’t/shouldn’t be slow!).

So here are some, somewhat inspired ideas for this lossy wikipedia encoding. Since the Wikipedia text is quite normal, capitalization is quite unnecessary and can be added automatically later on. It can search for words in a dictionary and use their indexes for the words (or even short phrases). Words that are not in the dictinonary can be searched in a large dictionary/thesarus and replaced with an appropriate synonym. After that, the data could be compressed with some bz2 or 7zip encoding.

It’s quite fast to decode each section with bzip or something, and just looking up the words in the dictionary index (which can be made to use little memory, because I’ve done so in a spell checker a few days ago).

Posted in Uncategorized.

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Some HTML is OK

or, reply to this post via trackback.