[Simh] OSs with accessible documentation

Sun Feb 7 07:25:49 EST 2016

If you want to get serious about OCRing documents, look at how Project Gutenberg does ir, 

? After OCR each page goes through 3 passes of cleanup and formatting.

-------- Original message --------
From Paul Koning <paulkoning at comcast.net> 
Date: 02/06/2016  2:05 PM  (GMT-07:00) 
To Tom Morris <tfmorris at gmail.com> 
Cc simh at trailing-edge.com 
Subject Re: [Simh] OSs with accessible documentation 

> On Feb 6, 2016, at 2:28 PM, Tom Morris <tfmorris at gmail.com> wrote:
> 
> ...
> I think Tesseract is pretty close to the quality of ABBYY.  Google has trained it on a very large corpus and it's used for Google Books, Google Drive OCR, etc, so it gets a fair amount of attention.  Of course, a lot of the training effort has gone into training it for over 100 languages, which isn't really relevant to old computer documentation, but even for plain English, it's received lots of training attention.

Is Tesseract open source?  It sounds vaguely like the one I tried, but I'm not sure; I remember something that felt more like a toolkit than like an application.

Google's OCR is pretty lousy in many cases.  Perhaps that's because they just feed it stuff without ever looking at the result.  There are plenty of Google books that have errors in the majority of the words.

paul

_______________________________________________
Simh mailing list
Simh at trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.trailing-edge.com/pipermail/simh/attachments/20160207/671a6e6a/attachment.html>