<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></head><body >If you want to get serious about OCRing documents, look at how Project Gutenberg does ir, <div><br></div><div>? After OCR each page goes through 3 passes of cleanup and formatting.</div><br><br><br>-------- Original message --------<br>From Paul Koning <paulkoning@comcast.net> <br>Date: 02/06/2016  2:05 PM  (GMT-07:00) <br>To Tom Morris <tfmorris@gmail.com> <br>Cc simh@trailing-edge.com <br>Subject Re: [Simh] OSs with accessible documentation <br> <br><br><br>> On Feb 6, 2016, at 2:28 PM, Tom Morris <tfmorris@gmail.com> wrote:<br>> <br>> ...<br>> I think Tesseract is pretty close to the quality of ABBYY.  Google has trained it on a very large corpus and it's used for Google Books, Google Drive OCR, etc, so it gets a fair amount of attention.  Of course, a lot of the training effort has gone into training it for over 100 languages, which isn't really relevant to old computer documentation, but even for plain English, it's received lots of training attention.<br><br>Is Tesseract open source?  It sounds vaguely like the one I tried, but I'm not sure; I remember something that felt more like a toolkit than like an application.<br><br>Google's OCR is pretty lousy in many cases.  Perhaps that's because they just feed it stuff without ever looking at the result.  There are plenty of Google books that have errors in the majority of the words.<br><br>     paul<br><br><br>_______________________________________________<br>Simh mailing list<br>Simh@trailing-edge.com<br>http://mailman.trailing-edge.com/mailman/listinfo/simh</body>