GOCR.js
Optical Character Recognition in Javascript

GOCR.js is a pure-javascript version of the GOCR program, automatically converted using Emscripten. It is a simple OCR (Optical Character Recognition) program that can convert scanned images of text back into text. Clocking in at a bit under a megabyte of Javascript with no hefty training data dependencies (looking at you, Tesseract), it's on the lighter end of the spectrum.

Below is a simple demo, which should hopefully demonstrate the capabilities but will more likely show the substantial limitations of the library. Hit the buttons on the left to reset the canvas or to randomly put some text in a random font. You can also try to draw something.

↻

The GOCR.js API is really simple. First you need to include gocr.js which is about 1MB in size.

					<script src="gocr.js"></script>
				

This file exposes a single global function, GOCR which takes an image as an argument and returns the recognized text as a string.

					var string = GOCR(image);

					alert(string);

The image argument can be a canvas element, a Context2D instance, or an instance of ImageData.

What consistently amazes me about OCR isn't its astonishing quality or lack thereof. Rather, it's how utterly unpredictable the results can be. Sometimes there'll be some barely legible block of text that comes through absolutely pristine, and some other time there will be a perfectly clean input which outputs complete garbage.

Aside from the relentless pursuit of Atwood's law, there are legitimate applications which might benefit from client side OCR (I'd like to think that I'm currently working on one, and no, it's not solving the wavy squiggly letters blockading your attempts at building a spam empire). Arguably, it'd be best to go for porting the best possible open source OCR engine in existence (looking at you, Tesseract). Unlike OCRAD and GOCR, which interestingly seem to be powered by painstakingly written rules for each recognizable glyph, Tesseract uses neural networks and the ilk to learn features common to different letters (which means it's extensible and multilingual). When you include the training data, Tesseract is actually kind of massive — A functional Emscripten port would probably be at least 30 times the size of GOCR.js!

GOCR.js Optical Character Recognition in Javascript

GOCR.js
Optical Character Recognition in Javascript