[Simh] OSs with accessible documentation
Timothe Litt
litt at ieee.org
Sat Feb 6 14:27:24 EST 2016
On 06-Feb-16 14:01, Paul Koning wrote:
>> On Feb 5, 2016, at 6:10 PM, Timothe Litt <litt at ieee.org> wrote:
>>
>> Some of the PDFs on bitsavers are searchable. It would be a good
>> project to OCR the rest into searchable pdfs - as that also means that
>> the text can be extracted. OCR is getting good enough (finally) that
>> it's feasible. I'm sure that they'd be accepted back into bitsavers -
>> searchable is good for everyone.
> Some disapprove of OCR for reasons I don't really understand.
In the preservation business, one doesn't want to lose bits. But it's
possible to keep the scanned image and add searchable/extractable text.
There's also no reason to throw the scanned version away; foo.pdf +
foo_ocr.pdf = not much expense in these days of multi-TB disk drives.
> A problem with OCR is that it's hard to find a good one. I dabbled with an OCR plugin that Adobe once offered (free, and worth about that). I also once tried an open source OCR, which was vastly inferior still.
>
> But commercial OCR programs exist that do a decent job, especially if the scanned material is clean as is the case for much of what is on Bitsavers. I use Abbyy FineReader which I rather like, but I expect there are other good ones out there too.
I've used the one that came with my ~$150 printer/scanner/fax - and been
very surprised at the (high) quality.
Prior to that, I've been very disappointed. But I haven't had need to
get seriously into OCR. I have heard good things about tesseract - once
you get over the hump of setup. Apparently it has a lot of training
material available. And (not as relevant here), many languages. I
think Google took it over from HP and has used it for it's various
massive scanning projects.
> One key point is that you typically need to spend some time "training" the program on the particular type of material -- typeface etc. -- that you're working with. The default settings are rarely adequate.
Yes, I know. Although that's gotten less necessary. One thing we have
going is that companies tend to have a stable/slowly-evolving brand
identity that dictates things like typeface. So 90+% of all DEC manuals
produced in a 5-10 year period have the same typeface/layout style.
Then a new era begins. This tends to be true even of smaller
companies. So even where training is necessary, it pays back over a
fair volume of material.
But there's no denying that it's a <Capital-P>roject. And that there
are significant fixed costs that it takes a lot of material to amortize...
> paul
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4994 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mailman.trailing-edge.com/pipermail/simh/attachments/20160206/27ad399f/attachment.bin>
More information about the Simh
mailing list