[Simh] OSs with accessible documentation

Tom Morris tfmorris at gmail.com
Sat Feb 6 14:28:52 EST 2016


On Sat, Feb 6, 2016 at 2:01 PM, Paul Koning <paulkoning at comcast.net> wrote:

>
> > On Feb 5, 2016, at 6:10 PM, Timothe Litt <litt at ieee.org> wrote:
> >
> > Some of the PDFs on bitsavers are searchable.  It would be a good
> > project to OCR the rest into searchable pdfs - as that also means that
> > the text can be extracted.   OCR is getting good enough (finally) that
> > it's feasible.  I'm sure that they'd be accepted back into bitsavers  -
> > searchable is good for everyone.
>

To clarify, I'd be focusing on the PDFs which consist of scanned images
only, so not those that already have a searchable text layer, or those
which are "native" text PDFs like RT-11 V5.6 docs.

Some disapprove of OCR for reasons I don't really understand.
>

I'd be interested in hearing the reasons.  I can't see any downside.

A problem with OCR is that it's hard to find a good one.  I dabbled with an
> OCR plugin that Adobe once offered (free, and worth about that).  I also
> once tried an open source OCR, which was vastly inferior still.
>
> But commercial OCR programs exist that do a decent job, especially if the
> scanned material is clean as is the case for much of what is on Bitsavers.
> I use Abbyy FineReader which I rather like, but I expect there are other
> good ones out there too.
>

I think Tesseract is pretty close to the quality of ABBYY.  Google has
trained it on a very large corpus and it's used for Google Books, Google
Drive OCR, etc, so it gets a fair amount of attention.  Of course, a lot of
the training effort has gone into training it for over 100 languages, which
isn't really relevant to old computer documentation, but even for plain
English, it's received lots of training attention.


> One key point is that you typically need to spend some time "training" the
> program on the particular type of material -- typeface etc. -- that you're
> working with.  The default settings are rarely adequate.
>

I don't expect that to be true.  The Google training set includes a large
number of different fonts.  Do you have specific examples of documents that
are difficult to OCR that I could check?

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.trailing-edge.com/pipermail/simh/attachments/20160206/12833c7a/attachment.html>


More information about the Simh mailing list