somewhere to talk about random ideas and projects like everyone else

stuff

#images

Ocrad.js Pure Javascript OCR via Emscripten 31 December 2013

doge

As with any minor stepping stone on the road to hell relentless trajectory of Atwood’s Law, I probably don’t need to justify the existence of yet another “x, but now in Javascript!”, but I might as well try. After all, we all would like to think that there’s some ulterior motive to fulfilling that prophesy.

On tablet or other touchscreen devices- of which there are quite a number of nowadays (as the New Year’s Eve post, I am obliged to include conjecture about the technological zeitgeist), a library such as Ocrad.js might be used to add handwriting input in a device and operating system agnostic manner. Oftentimes, capturing the strokes and sending them over to a server to process might entail unacceptably high latency. Maybe you’re working on an offline-capable note-taking app, or a browser extension which indexes all the doge memes that you stumble upon while prowling the dark corners of the internet.

If you’ve been following my trail of blog posts recently, you’d probably be able to tell that I’ve been scrambling to finish the program that I prototyped many months ago overnight at a Hackathon. The idea of the extension was kind of simple and also kind of magical: a browser extension that allowed users to highlight, copy, and paste text from any image as if it were plain text. Of course the implementation is a bit difficult and actually relies on the advent of a number of newfangled technologies.

If you try to search for some open source text recognition engine, the first thing that comes up is Tesseract. That isn’t a mistake, because it turns out that the competition is worlds away in terms of accuracy. It’s actually pretty sad that the state of the art hasn’t progressed substantially since the mid-nineties.

A month ago, I tried compiling Tesseract using Emscripten. Perhaps it was a bad thing to try first, but soon I learned that even if it did work out, it probably wouldn’t have been practical anyway. I had figured that all OCR engines had been powered by artificial neural networks, support vector machines, k-nearest-neighbors and their machine learning kin. It turns out that this is hardly the norm except in the realm of the actually-accurate, whose open source provinces live under the protection of Lord Tesseract.

GOCR and Ocrad are essentially the only other open source OCR engines (there’s technically also Cuneiform, but the source code is in a really really big zip file from some website in Russian and its also really slow according to benchmarks). And something I didn’t realize until I had peered into the source code is that they are powered by (presumably) painstakingly written rules for each and every detectable glyph and variation. This kind of blew my mind.

Anyway, I tried to compile GOCR first and was immediately struck by how easy and painless it had been. I was on a roll, and decided to do Ocrad as well. It wasn’t particularly hard- sure it was slightly more involved but still hardly anything.

GOCR and Ocrad both only operate on NetPBM files (supporting other files is done in typical unix fashion by piping the outputs from programs that convert file formats). Nobody really uses NetPBM anymore, so in order to handle a typical use case, I’d need some means of converting from raw pixel values into the format. I Googled around for Javascript implementations of PBM/PGM/PNM, finding nothing. I opened the Wikipedia page on the format and was pleased to found out why: the format is dirt simple.

If you know me in person, you’ll probably know that I’m not a terribly decisive person. Oftentimes, I’ll delay the decision until there isn’t a choice left for me to make. Anyway, serially-indecisive-me strikes again, so I alternated between the development of GOCR.js and Ocrad.js, leading up to a simultaneous release.

But in the back of my mind, I knew that eventually I would have to pick one for building my image highlighting project. I had been leaning toward Ocrad the whole time because it seemed to be a bit faster and more accurate when it came to handwriting.

Anyway, I spent a while building the demo page. It’s pretty simple but I wouldn’t describe it as ugly. There’s a canvas in the center which can be drawn to in arbitrary fonts pulled via the Google Font API. There’s a neat little thing which lets you draw things on the canvas. And to round out the experience, you can run the OCR engine on your own images, by loading the image file onto canvas to support any file the browser does (JPG, GIF, BMP, WebP, PNG, SVG) or by directly feeding the engine in its native NetPBM formats.

What consistently amazes me about Optical Character Recognition isn’t its astonishing quality or lack thereof. Rather, it’s how utterly unpredictable the results can be. Sometimes there’ll be some barely legible block of text that comes through absolutely pristine, and some other time there will be a perfectly clean input which outputs complete garbage. Maybe this is a testament to the sheer difficulty of computer vision or the incredible and under-appreciated abilities of the human visual cortex.

At one point, I was talking to someone and I distinctly remembered (I know, all the best stories start this way) a sense of surprise when the person indicated that he had heard of Tesseract, the open source OCR engine. I had appraised it as somewhat more obscure than it evidently was. Some time later, I confided about the incident with a friend, and he said something along the lines of “OCR is one of those fields that everyone comes across once”.

I guess I’ve kind of held onto that thought for a while now, and it certainly seems to have at least a grain of truth. Text embedded into the physical world is more or less our primary means we have for communication and expression. Technology is about building tools that augment human capacity and inevitably entails supplanting some human capability. Data input is a huge bottleneck, and while we’re kind of sidestepping the problem with things like QR codes by bringing the digital world into the physical. OCR is just one of those fundamental enabling technologies which ought to be as broad in scope as the set of humans who have interacted with a keyboard.

I can’t help but feel that the rather large set of people who have interacted with the problem character recognition have surveyed the available tools and reached the same conclusion as your miniature Magic 8 Ball desk ornament: “Try again later”. It doesn’t take long for one to discover an instance of perfectly crisp and legible type which results in line noise of such entropy that it’d give DUAL_EC_DRBG a run for its money. “No, there really isn’t any way for this to be the state of the art.” “Well, I guess if it is, then maybe it’ll improve in a few years- technology improves quickly, right?”

You would think that some analogue of Linus’s Law would hold true: “given enough eyeballs, all bugs are shallow”- especially if you’re dealing with literal eyeballs reading letters. But incidentally, the engine that absolutely everyone uses was developed three decades ago (It’s older than I am!), abandoned for a decade before being acquired and released to the world (by our favorite benevolent overlords, Google).

In fact, what’s absolutely stunning is the sheer universality of Tesseract. Just about everything which claims to have text recognition as a feature is backed by it. At one point, I was hoping that Mathematica had some clever routine using morphology and symbolic new kinds of sciences and evolved automata pattern recognition. Nope! Nestled deep within the gigabytes of code lies the Chuck Testa of textadermies: Tesseract.


hqx.js - pixel art scaling in the browser 31 March 2013

Screenshot 2013-03-31 at 5.15.07 PM

Every once in a while some gadget has the misfortune of epitomizing the next first world problem. I guess right now, this is owning a Retina (or equivalent) laptop, tablet (arguably phone, but most web pages are scaled out so it’s not that big of a problem) and being irked at the prevalence of badly scaled graphics. So there’s a new buzzword “Retina Ready” for websites, layouts and designs which support higher resolution graphics for devices which support it, often meaning of lots of new files and new css rules. It’s this trend of high-pixel-density devices (with devices like the iPad 3, Retina Macbook Pro, Nexus 10 and Chromebook Pixel - though I for one don’t currently have any of them, just this old glitchy-albeit-functional first generation Chromebook) that is driving people to vector icon fonts.

But the problem of radical increases in terms of resolution isn’t a new one. Old arcade games rarely exceeded 260x315, and the Game Boy Color had a paltry 160x144. While a few people still nostalgically lug around game cabinets and dig out their dust-covered childhood handheld consoles for nostalgic sneezing fits, most of the old games are now played with emulators running on systems several orders of magnitude more sophisticated in every imaginable aspect. So that arcade monitor that once could engross a childhood (and maybe early manhood) now appears nothing more than a two inch square on a twenty inch monitor. But luckily there is a surprisingly good solution to all of this in the form of algorithms designed in particular for scaling pixel art.

The most basic form of image scaling that exists is called nearest-neighbor interpolation, which is extra simple for retina devices because it means simply growing the size of each pixel by a factor of two along each axis. That leads to things which are blocky, and unless you’re part of an 8-bit retro-art project with a chiptune soundtrack looks ugly.

The most common form of image scaling borrows a lot from the math and signal processing fields, with names like bilinear, bicubic, and lanzcos essentially they treat an image as some kind of composition of sinusoidal parts and try to ideally extrapolate and interpolate such that visible artifacts are marginalized. It’s all very mathy, but the result is kind of the opposite of nearest-neighbor because it has the tendency to make things blurry and fuzzy.

The thing is that the latter tries to reach some kind of mathematical ideal, because images taken by your friendly neighborhood DSLR-toting amateur (spider-powers optional) are actually samples of real world points of data— so this mathematical pursuit of purity works out very well. There’s still the factor-of-four information-theoretic gap that needs to be filled in with best-guesstimates, but there isn’t really any way to improve the way a photograph is scaled without using a higher-resolution version of said photograph. But most photographs that are taken already are sixteen-megapixel monsters and they usually still look acceptable when upscaled.

The problem arises with pixel art, little icons or buttons which someone painstakingly drew in Photoshop one lazy summer afternoon in the late 90s. They’re everywhere and each pixel isn’t captured and encoded by a sampling algorithm of some analog natural phenomona— each pixel was lovingly crafted and planted by some meticulous artist. There is no underlying analog signal to interpret, it’s a direct perceptual hookup to the mind of the creator— and that’s why bicubic sampling looks especially bad here.

Video games, before 3d graphics engines and math-aware anti-aliasing concerned with murdering jaggies, in the old civilized age of bit-blitting, were mostly constructed out of pixel art. Each color in that limited palette was placed there for a reason and could be exploited by specialized algorithms to construct higher-quality upscaled versions which remained sharp. These come with the names EPX, Scale2x, AdvMAME2x, Eagle, 2×Sal, Super 2×Sal, hqx, and most recently, Kopf-Lischinksi. These algorithms could be applied in real time to emulator windows to acceptably scale a game to new sizes while eschewing jagged corners and blurry edges.

Anyway the cool thing is that you can probably apply these algorithms in lieu of the nearest-neighbor or bilinear scaling algorithms used by browsers on retina platforms to effortlessly upgrade old sites to shiny and smooth. With a few rough heuristics (detect if an image appears to be a sprite by testing for a limited palette, see if the image is small or a perfect square, detect if it has transparent pixels) this could be packed into a simple script include that website makers could easily inject into their pages to automagically upconvert old graphics to new shiny high-resolution ones without having to go through the actual effort of drawing new high resolution graphics and uploading them online. And this could also be packaged as a browser extension so that, once and forever after, this first-world nuisance shall be no more.

Before setting out to port hqx-java to javascript, I actually did some cursory googling to see if it actually had been done before. Midway through writing this post, I found out that it actually had been done before, in a better way, so I won’t even bother linking to my inferior version. But either way the actual goal of this project was the part which was detailed in the last paragraph, that of an embeddable script or browser extension which could heuristically apply pixel-scaling algorithms— something I probably won’t bother trying to do until at least after I get my college laptop (which I anticipate will be a Retina Macbook Pro 15”). Nonetheless, I haven’t written an actual blog post in almost three months and it’s the last day of this month, and I guess it’s better than having you all (though nobody’s probably going to read this now that Google Reader has died) assume that I’ve died. Anyway, now I’m probably going to retroactively publish old blog posts in previous months to fraud continuity.


drag2up – Drag files onto any site 01 October 2010

Drag2Up, is an app named horribly, as I suck at naming things. It’s very simple in terms of capabilities - no buttons, no interface, which is pretty much the ideal type of user experience, something that’s intuitive and lends focus to content, not chrome. Really, this post isn’t useful in any way. What you should do, is install it and try it out for yourself (Chrome only, It would be possible to make a firefox version, but I have no experience with Firefox extensions, and the new Jetpack isn’t quite out yet, though you can probably adapt the code relatively easily. Maybe a greasemonkey script by using GM_XHR). https://chrome.google.com/extensions/detail/bjgjolhpdlgebodaapdafhdnikagbfll

Drag and drop is probably one of the greatest forms of interaction ever, it’s intuitive and exemplifies the usefulness of a GUI. Recently, browsers have implemented drag and drop APIs (actually a feature of IE5.5+ but then codified into part of HTML5). Some browsers, namely Chrome and Firefox implement a dataTransfer property of the drop events that allow one to access files dragged from the user’s desktop (or file browser) onto a web application. The application can read the file through the File API, and do whatever it chooses. Probably most notably is Gmail’s utilization of the feature that allows dragging and dropping files from the desktop onto a message to upload it as an attachment. When it was announced, some people remarked how excited everyone was of a feature that was present in desktop clients for several decades already.

Browser vendors have the role of enabling developers to do cool stuff. It’s up to a web developer or service to maintain an application, to update it and add new features. Undoubtedly, many do, living on the bleeding edge of innovation, but most don’t, and stay locked in a certain state for years if not longer. Extension developers have no obligation to be entirely objective in terms of how it deals with websites, and many are site-specific, adding features to sites that lack them, integrated into the interface, utilizing product-specific APIs, or document structure. Browser vendors enable innovation, site developers integrate standards and browser extensions add features to sites, because the developer is unwilling or because of restrictions on the functionality of web apps (x-domain, persistence, etc).

Uploading images is a fairly common task, and the standard go-to-file-host-site, click-the-attach-button, browse-button-hitting, navigation-through-folders and waiting-for-page-to-reload-after-uploading, link-copying, tab-switching and pasting. Some applications like CloudApp look promising in alleviating this needlessly tedious process, but only reduce the interaction steps a little. The ideal would be making that third-party file upload service integrate seamlessly into all web applications, with a level of integration akin to Gmail, drag and drop.

As awesome as it sounds, it’s not as easy to implement as one might like. The basic idea was to detect when a drag event started, and to iterate all the elements, adding an absolutely positioned, semitransparent mask to indicate that it’s a drop target. A few issues occur with that, I have a little writeup on those issues.

A lot of WYSIWYG editors use iframes, and creating content scripts with all_frames: true doesn’t necessarily mean it runs on all frames (for some very odd reason). And also for some reason, there’s no way to access frames from window.frames in content scripts. So I have this hack where if there are frames, it injects some javascript into the real page executed in the page’s context to run it within the iframe (if it’s the same domain, as it always is in wysiwyg editors). Then events get propagated from the sub-frame into the parent window, through postMessage that is subsequently handled by the content script, forwarded to a background page using chrome.extensions.sendRequest, and it all happens in reverse once the server gives a response. As such, it’s a really bad idea to use this for large files since it needs to be passed around several times over several pages, and apparently V8 isn’t the best with big strings.

Aside from that, uploading to imgur for some odd reason takes a long time. I don’t like sites that don’t have simple JSON POST APIs, and there really aren’t many that fit that criterion. So with the initial release, images are uploaded with imgur and text files are all uploaded to a github gist.


ShinyTouch Images 01 August 2009

This is the app running, notice that it’s not yet been calibrated yet.

Here is the auto-calibration process, it alternates between black and white

This is part of Auto-Calibration.

This is some stuff from the command line:

This is just hovering over the screen, notice it’s not touching, and the algorithm can distinctly recognize the lack of a touch because the reflection is seperated from the finger by a significant gap. (Compare the top red bar).

This is a actual touch, you can see that the red bar is far larger, and it’s very distinctly a touch event.

There’s a draw tool and, here is a primitive drawing of a smiley. The dots come from an issue with PIL/OpenCV or something that makes the image all chopped up and sends the point to an arbitary point on the screen.

This is the magical sensor the whole thing is powered by: An unmodified Playstation 3 Eye on a tissue box with a pink Office Depot eraser on the back (because the camera is made tilted and the script can’t handle those tilts very well)

It’s not too insanely slow either. This is 31 frames per second coming from a pure python app, all from a scripting language. It is nowhere as fast as the normal fast native apps.