somewhere to talk about random ideas and projects like everyone else

stuff

Project Naptha highlight, copy and translate text from any image

Project Naptha started out as "Images as Text" the 2nd place winning entry to 2013's HackMIT Hackathon.

It launched 5 months later, reaching 200,000 users in a week, and was featured on the front page of Hacker News, Reddit, Engadget, Lifehacker, The Verge, and PCWorld.


related

posts

Cooley-Tukey FFT + DCT + IDCT in under 1K of Javascript 30 May 2015

Technically this is a Hough transform and isn't at all related to the FFT*, but it looks a lot cooler than any of the actual FFT/DCT pictures I have, so I might as well stick it here

Almost a year ago, I wrote a simple script for computing perceptual hashes in Javascript (it’s going to be part of a forthcoming Naptha feature which allows it recognize source images for memes and do perfect inpainting). One step of the process is computing the 2D Discrete Cosine Transform of a shrunken version of the image.

The problem is that step was kind of ridiculously slow, and it’s not really hard to see why: there’s a series of four nested for loops— because that’s the obvious naive way for computing the DCT.

In practice, it’s actually not all that bad because you only have to do it once per image, and the dimensions of the source image are a fixed 32x32. But it still felt pretty bad, so when I was looking over the code again earlier this week, I tried doing some research to figure out a faster way to do the DCT.

Surely you’d imagine that there have to be extraordinarily efficient implements of the DCT out there, given that it’s the very foundation of JPEG out there. And you’d be right, but they’ve generally unrolled the butterfly into a series of bit-shifts which are quite unlikely to generalize beyond the 8x8 block size.

I’m reasonably confident that FFTW3 has a fast DCT implementation which also is capable of generalization, but it seems buried within quite a lot of code.

MiniFFT

This isn’t a super performant implementation, but it’s not horrendously slow and it doesn’t use recursion. It’s probably pretty trivial to make it quite a bit faster by precomputing sine and cosine tables. This uses the standard Cooley-Tukey radix-2 decimation-in-time algorithm, which means it only works for one dimensional arrays with a length which is a power of two.

It doesn’t check that the length is a power of two, it’ll just give you the wrong answer. You should definitely check that it’s true beforehand.

function miniFFT(re, im) {
    var N = re.length;
    for (var i = 0; i < N; i++) {
        for(var j = 0, h = i, k = N; k >>= 1; h >>= 1)
            j = (j << 1) | (h & 1);
        if (j > i) {
            re[j] = [re[i], re[i] = re[j]][0]
            im[j] = [im[i], im[i] = im[j]][0]
        }
    }
    for(var hN = 1; hN * 2 <= N; hN *= 2)
        for (var i = 0; i < N; i += hN * 2)
            for (var j = i; j < i + hN; j++) {
                var cos = Math.cos(Math.PI * (j - i) / hN),
                    sin = Math.sin(Math.PI * (j - i) / hN)
                var tre =  re[j+hN] * cos + im[j+hN] * sin,
                    tim = -re[j+hN] * sin + im[j+hN] * cos;
                re[j + hN] = re[j] - tre; im[j + hN] = im[j] - tim;
                re[j] += tre; im[j] += tim;
            }
}

The transformation happens in place, so it’ll modify whatever arrays you pass in.

var re = [1, 2, 3, 4],
    im = [0, 0, 0, 0];
miniFFT(re, im);
console.log(re, im); // [ 10, -2, -2, -2 ] [ 0, 2, 0, -2 ]

Minified, it’s only 359 bytes. Unlike this other contender for the dubious prize of smallest javascript FFT implementation, it doesn’t require a special complex number library three times its size (though this doesn’t fit in a tweet— I’ll be really impressed if someone manages to do that).

I haven’t really golfed the code to the smallest size possible, but I think it’s at a reasonable balance of brevity while still being somewhat comprehensible.

It’s based on an in-place FFT from Princeton’s intro CS class Project Nayuki’s FFT.

MiniIFFT

Turns out that you can invert the FFT modulo a scaling factor just by swapping the real and imaginary arguments.

function miniIFFT(re, im){
    miniFFT(im, re);
    for(var i = 0, N = re.length; i < N; i++){
        im[i] /= N;
        re[i] /= N;
    }
}

MiniDCT

This is an implementation of a Type-II DCT using MiniFFT. For details about Makhoul’s algorithm for transforming a DCT into an FFT, see this post

function miniDCT(s){
    var N = s.length,
        K = -Math.PI / (2 * N),
        re = new Float64Array(N),
        im = new Float64Array(N);
    for(var i = 0, j = N; j > i; i++){
        re[i] = s[i * 2]
        re[--j] = s[i * 2 + 1]
    }
    miniFFT(re, im)
    for(var i = 0; i < N; i++)
        s[i] = 2*re[i]*Math.cos(K*i)-2*im[i]*Math.sin(K*i);
}

Like MiniFFT, it operates in-place (well, not really, but it writes its output to the source array).

var arr = [3, 4, 1, 7];
miniDCT(arr);
console.log(arr); // [ 30, -5.09493566, 7.0710678118, -8.604744653 ]

It produces the same output as scipy’s fftpack, but Matlab’s DCT uses orthogonal normalization and produces a different result. However, it’s pretty simple to do it the Matlab way— just scale everything by 1/sqrt(2 * N) and divide the first element by sqrt(2).

function matlabDCT(arr){
    miniDCT(arr)
    arr[0] /= Math.sqrt(2);   
    for(var i = 0, N = arr.length; i < N; i++) arr[i] /= Math.sqrt(2 * N);
}

MiniIDCT

I really considered renaming this series of functions into TinyFFT and friends because of the predicament of MiniIDCT— yeah, exactly— the horror— repeating I’s.

This is an implementation of a Type-III DCT, also known as the IDCT because it happens to be the inverse of a Type II DCT (frequently refered to as “The DCT”).

function miniIDCT(s){
    var N = s.length,
        K = Math.PI / (2 * N),
        im = new Float64Array(N),
        re = new Float64Array(N);
    re[0] = s[0] / N / 2;
    for(var i = 1; i < N; i++){
        var im2 = Math.sin(i*K), re2 = Math.cos(i*K);
        re[i] = (s[N - i] * im2 + s[i] * re2) / N / 2;
        im[i] = (im2 * s[i] - s[N - i] * re2) / N / 2;
    }
    miniFFT(im, re)
    for(var i = 0; i < N / 2; i++){
        s[2 * i] = re[i]
        s[2 * i + 1] = re[N - i - 1]
    }
}

This is based on Tom Grydeland’s IDL implementation of the IDCT.

Let’s start with a 8 element array of real numbers (DCT values).

[A, B, C, D, E, F, G, H]

From that it generates the corresponding complex values. Note that the real part is intact, and the complex part is the original array with its head chopped off and the rest of it flipped around and attached to a water balloon.

[A, B, C, D, E, F, G, H] - j[0, H, G, F, E, D, C, B]

Then the complex values are rotated by exp(i * pi / (2 * N)) to get spectral components. After an IFFT that gets transformed into:

[a, h, c, f, e, d, g, b]

It took me a while to see what was going on there, but the even indices are map to the first half of the results, and the odd ones map to the second half in reverse

Scrambled      a h c f e d g b
Even Indexed   a   c   e   g
Odd Indexed      h   f   d   b
Odd Reversed     b   d   f   h
Unscrabmled    a b c d e f g h


Ocrad.js Pure Javascript OCR via Emscripten 31 December 2013

doge

As with any minor stepping stone on the road to hell relentless trajectory of Atwood’s Law, I probably don’t need to justify the existence of yet another “x, but now in Javascript!”, but I might as well try. After all, we all would like to think that there’s some ulterior motive to fulfilling that prophesy.

On tablet or other touchscreen devices- of which there are quite a number of nowadays (as the New Year’s Eve post, I am obliged to include conjecture about the technological zeitgeist), a library such as Ocrad.js might be used to add handwriting input in a device and operating system agnostic manner. Oftentimes, capturing the strokes and sending them over to a server to process might entail unacceptably high latency. Maybe you’re working on an offline-capable note-taking app, or a browser extension which indexes all the doge memes that you stumble upon while prowling the dark corners of the internet.

If you’ve been following my trail of blog posts recently, you’d probably be able to tell that I’ve been scrambling to finish the program that I prototyped many months ago overnight at a Hackathon. The idea of the extension was kind of simple and also kind of magical: a browser extension that allowed users to highlight, copy, and paste text from any image as if it were plain text. Of course the implementation is a bit difficult and actually relies on the advent of a number of newfangled technologies.

If you try to search for some open source text recognition engine, the first thing that comes up is Tesseract. That isn’t a mistake, because it turns out that the competition is worlds away in terms of accuracy. It’s actually pretty sad that the state of the art hasn’t progressed substantially since the mid-nineties.

A month ago, I tried compiling Tesseract using Emscripten. Perhaps it was a bad thing to try first, but soon I learned that even if it did work out, it probably wouldn’t have been practical anyway. I had figured that all OCR engines had been powered by artificial neural networks, support vector machines, k-nearest-neighbors and their machine learning kin. It turns out that this is hardly the norm except in the realm of the actually-accurate, whose open source provinces live under the protection of Lord Tesseract.

GOCR and Ocrad are essentially the only other open source OCR engines (there’s technically also Cuneiform, but the source code is in a really really big zip file from some website in Russian and its also really slow according to benchmarks). And something I didn’t realize until I had peered into the source code is that they are powered by (presumably) painstakingly written rules for each and every detectable glyph and variation. This kind of blew my mind.

Anyway, I tried to compile GOCR first and was immediately struck by how easy and painless it had been. I was on a roll, and decided to do Ocrad as well. It wasn’t particularly hard- sure it was slightly more involved but still hardly anything.

GOCR and Ocrad both only operate on NetPBM files (supporting other files is done in typical unix fashion by piping the outputs from programs that convert file formats). Nobody really uses NetPBM anymore, so in order to handle a typical use case, I’d need some means of converting from raw pixel values into the format. I Googled around for Javascript implementations of PBM/PGM/PNM, finding nothing. I opened the Wikipedia page on the format and was pleased to found out why: the format is dirt simple.

If you know me in person, you’ll probably know that I’m not a terribly decisive person. Oftentimes, I’ll delay the decision until there isn’t a choice left for me to make. Anyway, serially-indecisive-me strikes again, so I alternated between the development of GOCR.js and Ocrad.js, leading up to a simultaneous release.

But in the back of my mind, I knew that eventually I would have to pick one for building my image highlighting project. I had been leaning toward Ocrad the whole time because it seemed to be a bit faster and more accurate when it came to handwriting.

Anyway, I spent a while building the demo page. It’s pretty simple but I wouldn’t describe it as ugly. There’s a canvas in the center which can be drawn to in arbitrary fonts pulled via the Google Font API. There’s a neat little thing which lets you draw things on the canvas. And to round out the experience, you can run the OCR engine on your own images, by loading the image file onto canvas to support any file the browser does (JPG, GIF, BMP, WebP, PNG, SVG) or by directly feeding the engine in its native NetPBM formats.

What consistently amazes me about Optical Character Recognition isn’t its astonishing quality or lack thereof. Rather, it’s how utterly unpredictable the results can be. Sometimes there’ll be some barely legible block of text that comes through absolutely pristine, and some other time there will be a perfectly clean input which outputs complete garbage. Maybe this is a testament to the sheer difficulty of computer vision or the incredible and under-appreciated abilities of the human visual cortex.

At one point, I was talking to someone and I distinctly remembered (I know, all the best stories start this way) a sense of surprise when the person indicated that he had heard of Tesseract, the open source OCR engine. I had appraised it as somewhat more obscure than it evidently was. Some time later, I confided about the incident with a friend, and he said something along the lines of “OCR is one of those fields that everyone comes across once”.

I guess I’ve kind of held onto that thought for a while now, and it certainly seems to have at least a grain of truth. Text embedded into the physical world is more or less our primary means we have for communication and expression. Technology is about building tools that augment human capacity and inevitably entails supplanting some human capability. Data input is a huge bottleneck, and while we’re kind of sidestepping the problem with things like QR codes by bringing the digital world into the physical. OCR is just one of those fundamental enabling technologies which ought to be as broad in scope as the set of humans who have interacted with a keyboard.

I can’t help but feel that the rather large set of people who have interacted with the problem character recognition have surveyed the available tools and reached the same conclusion as your miniature Magic 8 Ball desk ornament: “Try again later”. It doesn’t take long for one to discover an instance of perfectly crisp and legible type which results in line noise of such entropy that it’d give DUAL_EC_DRBG a run for its money. “No, there really isn’t any way for this to be the state of the art.” “Well, I guess if it is, then maybe it’ll improve in a few years- technology improves quickly, right?”

You would think that some analogue of Linus’s Law would hold true: “given enough eyeballs, all bugs are shallow”- especially if you’re dealing with literal eyeballs reading letters. But incidentally, the engine that absolutely everyone uses was developed three decades ago (It’s older than I am!), abandoned for a decade before being acquired and released to the world (by our favorite benevolent overlords, Google).

In fact, what’s absolutely stunning is the sheer universality of Tesseract. Just about everything which claims to have text recognition as a feature is backed by it. At one point, I was hoping that Mathematica had some clever routine using morphology and symbolic new kinds of sciences and evolved automata pattern recognition. Nope! Nestled deep within the gigabytes of code lies the Chuck Testa of textadermies: Tesseract.