[Simh] OSs with accessible documentation

Timothe Litt litt at ieee.org
Sat Feb 6 14:27:24 EST 2016


On 06-Feb-16 14:01, Paul Koning wrote:
>> On Feb 5, 2016, at 6:10 PM, Timothe Litt <litt at ieee.org> wrote:
>>
>> Some of the PDFs on bitsavers are searchable.  It would be a good
>> project to OCR the rest into searchable pdfs - as that also means that
>> the text can be extracted.   OCR is getting good enough (finally) that
>> it's feasible.  I'm sure that they'd be accepted back into bitsavers  -
>> searchable is good for everyone.
> Some disapprove of OCR for reasons I don't really understand.
In the preservation business, one doesn't want to lose bits.  But it's
possible to keep the scanned image and add searchable/extractable text. 
There's also no reason to throw the scanned version away; foo.pdf +
foo_ocr.pdf = not much expense in these days of multi-TB disk drives.

> A problem with OCR is that it's hard to find a good one.  I dabbled with an OCR plugin that Adobe once offered (free, and worth about that).  I also once tried an open source OCR, which was vastly inferior still.
>
> But commercial OCR programs exist that do a decent job, especially if the scanned material is clean as is the case for much of what is on Bitsavers.  I use Abbyy FineReader which I rather like, but I expect there are other good ones out there too.
I've used the one that came with my ~$150 printer/scanner/fax - and been
very surprised at the (high) quality.

Prior to that, I've been very disappointed.  But I haven't had need to
get seriously into OCR.  I have heard good things about tesseract - once
you get over the hump of setup.  Apparently it has a lot of training
material available.  And (not as relevant here), many languages.  I
think Google took it over from HP and has used it for it's various
massive scanning projects.
> One key point is that you typically need to spend some time "training" the program on the particular type of material -- typeface etc. -- that you're working with.  The default settings are rarely adequate.
Yes, I know.  Although that's gotten less necessary.  One thing we have
going is that companies tend to have a stable/slowly-evolving brand
identity that dictates things like typeface.  So 90+% of all DEC manuals
produced in a 5-10 year period have the same typeface/layout style. 
Then a new era begins.  This tends to be true even of smaller
companies.  So even where training is necessary, it pays back over a
fair volume of material.

But there's no denying that it's a <Capital-P>roject.  And that there
are significant fixed costs that it takes a lot of material to amortize...
> 	paul
>


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4994 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mailman.trailing-edge.com/pipermail/simh/attachments/20160206/27ad399f/attachment.bin>


More information about the Simh mailing list