[Simh] OSs with accessible documentation

Ron Young rly1 at embarqmail.com
Sun Feb 7 13:33:53 EST 2016


On Feb 7, 2016 4:25 AM, khandy21yo <khandy21yo at gmail.com> wrote:
>
> If you want to get serious about OCRing documents, look at how Project Gutenberg does ir, 
>
> ? After OCR each page goes through 3 passes of cleanup and formatting.
>

When I was doing my masters, I worked on OCR/IR... It has been a while but some things to consider:

Accuracy can be improved by proper training with known ground truth data. If the typeface of the RT11 manuals is the 'DEC' standard and matches the other scanned files. You can use that for training: produce images from the DOCUMENT output along side straight ascii... There's the initial ground truth.

Tesseract  is the OCR engine, there is a project called octopus that provides layout and other processing using tesseracts for OCR.

You can improve accuracy by using multiple OCR engines and vote on the results.

Some packages that may help: tesseracts, cuneiform (another OCR engine from Russia). Unpaper is a package that can help clean up scan images before ocring.

Having said all of that: for my personal stuff I use gscan2pdf under Ubuntu since it includes most of the above packages in a GUI.

-ron _______________________________________________
> Simh mailing list
> Simh at trailing-edge.com
> http://mailman.trailing-edge.com/mailman/listinfo/simh


More information about the Simh mailing list