[Simh] best way to scan 172 column fanfold 80s printout?

Sun Feb 11 18:32:43 EST 2018

Almost every wide printer had adjustable widths. If this is a problem 
try a different printer. Non-standard lengths may be more of an issue.

If you shoot an entire page at a time line spacing is a problem for your 
OCR software.

With tractor feed page length variability should not be an issue - every 
page will have the same number of holes. I almost always ran whole boxes 
of paper without adjusting top of form. On the big printers with dual 
tractors I could start a new box without adjusting top of form. Some 
printers may not repeatably feed exactly an integer number of holes each 
time -- I didn't experience this.

Friction feed is another mater, and accumulated form feed errors will 
present exactly the problems you describe. Many interesting listings 
probably don't have tractor holes.

Problems feeding and stacking will prevent this from being an unattended 
operation. Perfs will be weak - both the perfs between the page and the 
tractor holes and the perfs between pages. A Data Products B1200 would 
sometimes break perfs on new paper. Some consumer model dot-matrix 
printers had gentle form feeds. Having a manual feed mode, requiring a 
button push for each page, would probably be a good idea.

Upfront camera setup care would be required. Proper camera selection and 
position should minimize keystone, pincushion, barrel etc. distortion. 
Even lighting would be required - much as would be used with a copy 
stand. I am somewhat more worried that the paper would be hanging 
loosely from the printer, not sandwiched between glass. The camera will 
need a small enough aperture to keep the entire sheet in focus while the 
paper does whatever the heck it wants to. It gets into the basic 
principle that you have to get the analog part right if you want to 
successfully digitize.

I probably did understate the effort. There would be work to do, but no 
part of it seems to be intractable to me. If worse came to worse and you 
had to ditch the printer's control electronics and drive the feed 
steppers directly you would still be ahead not having to build a paper 
transport from scratch.

I remember wanting one of those thunder scan devices. In this case I 
think that approach would cause more problems than it would solve.

OCRing the result, however you collect the images, is likely the hardest 
part anyway.

On 02/11/2018 03:10 PM, Timothe Litt wrote:
> It's not that simple.  You need to deal with at least 2 common 
> vertical pitches (6 & 8 LPI), and a number of page lengths (and 
> widths).  These need to be setup per job; not all printers support all 
> these.  Plus, misalignment (as Al noted, crossing the perforations at 
> the bottom of a page is quite common).  The OP mentioned that his 
> listings have a hard crease; this will cause (at least) feed and 
> stacking problems.  Form feed causes a high-speed slew; this becomes 
> less reliable as the distance moved increases.  You're proposing an 
> entire page at a time - which means that the paper will jump off the 
> tractors frequently.[1] Old paper is fragile.  Over hundreds of pages, 
> dimensions may not be stable; it was not uncommon to have to re-adjust 
> TOF after a while.  There's a fair bit of error detection and recovery 
> to work out.
>
> Lighting is an issue, as is compensating for keystoning and other 
> misalignments.  Most cameras don't have a standard remote trigger 
> interface - one of the pointers I provided loads modified firmware 
> into cameras from one manufacturer to make this work.  If you look at 
> digital camera reviews, you'll see that the lenses have varying 
> degrees of artifacts, especially at the edges.  So you need to find 
> and zoom to an area that's relatively "flat" & doesn't need a lot of 
> correction.  While depth of field will help, it also will result in 
> apparent font size changes as paper sways forward and back.  If you 
> stop that, you simplify the OCR - and don't need as much depth of field.
>
> There are many backgrounds that need to be subtracted for OCR to 
> work.  (Printer paper was notorious for institutional logos, as well 
> as bars and other aids to human readers.)  Then there are the other 
> issues mentioned in my earlier note.
>
> It seems simple, but it is a P.roject.  That's a capital P. With a lot 
> of roject to work out.
>
> It's worthwhile, but it's not simple.  It's a pretty interesting 
> hardware (and software) project.  I don't mean to discourage anyone 
> who wants to work on it - but you need to go in with eyes open, or 
> you'll end up very, very frustrated.
>
> Thunderscan tried to scan line by line & retrieve grayscale; the 
> challenges were piecing together the adjacent lines with pixel 
> resolution.   The focal distance was constant because the camera was 
> on a carriage.  The idea here is to capture a page per frame.  So the 
> registration problems are quite different.  One could try the 
> thunderscan approach; it would trade one set of problems xxx 
> "challenges and opportunities" for another.
>