[Simh] OSs with accessible documentation

Timothe Litt litt at ieee.org
Sat Feb 6 16:13:29 EST 2016


On 06-Feb-16 16:05, Paul Koning wrote:
>> On Feb 6, 2016, at 2:28 PM, Tom Morris <tfmorris at gmail.com> wrote:
>>
>> ...
>> I think Tesseract is pretty close to the quality of ABBYY.  Google has trained it on a very large corpus and it's used for Google Books, Google Drive OCR, etc, so it gets a fair amount of attention.  Of course, a lot of the training effort has gone into training it for over 100 languages, which isn't really relevant to old computer documentation, but even for plain English, it's received lots of training attention.
> Is Tesseract open source?  
Yes, it's open sourced.  https://github.com/tesseract-ocr

> It sounds vaguely like the one I tried, but I'm not sure; I remember something that felt more like a toolkit than like an application.
Yes, it's the engine.  There are various wrappers that provide more
polished interfaces.
> Google's OCR is pretty lousy in many cases.  Perhaps that's because they just feed it stuff without ever looking at the result.  There are plenty of Google books that have errors in the majority of the words.
The amazing thing about a talking dog is not how well it talks, but that
it talks at all.

For the volume of stuff they've scanned, it's pretty impressive.  If a
book is that bad, no one looked at it & retrained.  What Tom sent around
earlier is fairly typical (in my limited experience).  It would take
someone a good hour or two to clean it up.

> 	paul
>
>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4994 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mailman.trailing-edge.com/pipermail/simh/attachments/20160206/c841e79a/attachment.bin>


More information about the Simh mailing list