[Simh] PDP-15/76

Thu May 5 10:57:43 EDT 2016

On Thu, May 05, 2016 at 07:41:24AM -0400, Timothe Litt wrote:
> On 05-May-16 07:18, Mattis Lind wrote:
> >
> >
> >
> >     I didn't have much luck with tumble (some time ago); it tended to
> >     complain about the tiff input formats.
> >     That version/website hasn't been updated since 2003.
> >     I do have a more recent version in my archive; Don't recall where I
> >     found it, but it does somewhat better.  I posted it at
> >     https://drive.google.com/file/d/0B2g2SW-v7RFZWW1BS1E4eVk3cVU/view?usp=sharing
> >     for now, but it needs a permanent home...
> >
> >     ImageMagick (most distributions have it, or see
> >     http://www.imagemagick.org/) is my go-to tool for batch image
> >     conversion/basic manipulations - e.g. rotate, resize, flip, crop,
> >     dither, resample, etc.  It runs on linux, windows, OSX and iOS. 
> >     You can
> >     also adjust the colormap size to shrink the files, depending on
> >     the input.
> >
> >        convert *.tiff manual.pdf
> >
> >
> >
> > It was ImageMagick and the convert tool I ended up using for the last
> > file. But firstly I have to scan the manual two times since it is
> > double sided (there is no duplexer and it there were it would have
> > been extremely slow I presume). The scanner programs generates file
> > name numbering that I cannot control when scanning multiple pages. So
> > the trickiness is to splice everything together at the end. Then
> > secondly the scanner jammed at certain times interrupting the number
> > sequence. I ended up doing it manually. Maybe there is a way to do it
> > more automatically. I will find out next time I scan a document.
> >  
> >
> >
> >     There are a bunch of tools for manipulating PDFs; some free, some not.
> >     Here are a couple.
> >     http://www.pdfsam.org/download-pdfsam-basic/
> >     https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
> >     http://pdfchain.sourceforge.net/
> >     https://wiki.gnome.org/Apps/PdfMod
> >
> >
> Others are probably more expert than I am, but here are a few techniques
> that I've learned:
> 
> Not much one can do about the scanner issues.   I usually don't even use
> the automatic feeder - a jam on an irreplaceable document can be a
> disaster.  Duplexers are even more dangerous since they have to run the
> paper around more sharp curves.
> 
> But tools like pdfmod allow you to rearrange pages in the PDF with drag
> and drop.
> So you can can put the pages in order fairly quickly.
> 
> Another trick is to sort the files by create time (e.g. on unix: convert
> `ls -1t *.tiff` doc.pdf). This will put them into the pdf in the same
> order that you scanned them.  If you have a few pages out of order due
> to rescans or jams, they can be fixed with pdfmod/pdftk.
> 
> For the duplexing issue:
> 
> pdfsam mix will merge odd and even pages.  So you can scan the odd pages
> in one directory & create a PDF with them, and the even pages in a
> second directory.  Then use pdfsam to interleave them into the final
> output file.
> 
> I find that this is quicker than turning pages over, even if I'm not
> using the automatic feeder.
> 
> The PDF tools also will rotate pages - which helps with landscape
> fold-out pages.  And the times that I accidentally scan a page
> upside-down :-)
> 
> Thanks for scanning your archives.
> 

I eventually got sick & tired of keeping piles of paper (bills, receipts,
and other stuff) around and decided to just scan it, shred most of it and
only keep a few select key documents on paper around. 

To this end, I've built a little tool chain:
 - scanpage: just scans via attached USB scanner at 600 dpi A4 in greyscale
   to raw unpacked tiff, I usually name them 1.tiff, 2.tiff, 3.tiff ...
 - scan2page: does the heavy lifting
   - compress original raw scans as TIFF with compression mode ZIP for
     archival - just in case I ever need the original raw scans again
   - normalize & despeckle the scans
   - downconvert to monochrome
   - deskew the scans
   - compress the scans with TIFF G4 (most efficient compression for
     monochrome I've found)
   - create two different display/archival formats from the scan
     - djvu (very compact)
     - pdf (very portable)
   - finally prompt for the name of the three output files (tar, djvu, pdf)
 - git: all of my archival scans are kept in a git repository, giving me:
   - revision control (e.g. I know when a document was scanned or the scan
     redone)
   - trivial replication for redundancy (just git clone & git pull)
   - integrity checking, e.g. git fsck will find bit flips

The whole tool chain is written for Linux (with Debian in mind, but will
run fine elsewhere, and should run on *BSD as well as long as the tools
are provided). The scan2page script does kit completeness checking before
touching the scans, e.g. can it find all the external tools it will invoke
later.

That, together with a (for me) reasonable directory structure, makes it
very easy for me to find old documents again - certainly much easier than
digging through a stack of phyiscal folders.

One thing I have on the todo list for this is eventually hooking OCR into
the process.

links:
http://www.thangorodrim.ch/tmp/scanpage
http://www.thangorodrim.ch/tmp/scan2page

Kind regards,
            Alex.
-- 
"Opportunity is missed by most people because it is dressed in overalls and
 looks like work."                                      -- Thomas A. Edison