[Simh] EXT :Re: DEC Alpha Emulation

Mon Feb 5 15:10:12 EST 2018

On 05-Feb-18 13:36, Clem Cole wrote:
>
>
> Point taken, but DEC used the SPD as its primary defense for exactly
> this type of problem. It was the 'legal' definition of what was and
> was not allowed.   But as you point out, that behavior does not always
> make for happy customers or sr managers.
>
I started in the field, and consulted with the corporate flying squads. 
The SPDs' value as legal definition was of more interest to lawyers &
junior product managers than to those at the sharp end of the spear. 
Happiness, even at expense above and beyond legal technicalities brought
more business than sticking to the letter of the law.  Unhappiness was
very, very expensive.  I have stories that run both ways...
>  
>
>      The truth is in at least Tru64  (I think is was Feed Knight - Mr.
>     SCSI) had code that detect when your SCSI bus was being shared.  
>     It would have been easy to add add a side look up to check the
>     control being used and if it was not in the official table,
>     produce a boot message saying -- "/shared bus with unsupported
>     SCSI controller, please remove sharing or replace controller and
>     reboot."/
>
>
> But I could never get marketing to accept that.
>
I wish it were that simple.  In this case, Marketing's intuition covered
some technical challenges.  I had many a talk with Fred when I was in
the Tru64 group.  That 'table' would have to deal not only with
controller types, but with compatibility of firmware versions for every
device on the bus.  And the permutations of what worked (and didn't)
weren't static.  The sys_check maintainer made some efforts, as did the
SPEAR folks in CSSE.  But everything was a moving target. 

The trivial case of "don't ever use this controller in a cluster" isn't
all that hard to blacklist.  Of course, when the foobar-plus comes out
with a different device ID, but the same bug, you have to blacklist it
too.  Before any customer finds one a "American Used Computers" (Kenmore
Square, before e-bay:-)  And don't forget that to find another
controller on the bus, you have to enumerate the bus.  This can have
side-effects with "bad" controllers.  The bugs weren't all limited to
fail-over.  IIRC tagging and command queuing had issues; at least one
controller created parity errors (and some undetected one).

But maintaining a useful whitelist - with all the churn in the SCSI
space - would be a nightmare.  Disks have firmware & HW revs. 
Controllers too.  Blocking all 3rd party disks (despite the frequent
firmware issues) isn't viable.  Don't forget CD/DVD, tape, and even
ethernet.  Even getting customers to install patches was hard (patch
quality and interactions was one of my issues); patching to keep up with
hardware/firmware revs wasn't going to fly.  And you need this
information before you have a file system; preferably in the boot
driver.  So no, not a config file.  Maybe SRM console environment
variables...  Even in the relatively controlled environment that DEC was
able to impose, SCSI should have been called CHAOSnet - except that name
was taken.

Worse, once you produce one error message in a problem space (e.g.
invalid HW config), suddenly NOT producing errors for all other cases
that don't work become bugs.

> My point was that if we detected it (which was not not that hard),
> then we could have at least said something.   And in practice if you
> still ignored it and it was in all those system logs, it would have
> been pretty easy to say to the end customer, /we told you not to do that/.
By the time it's in a system log, it's too late.  The logging disk is
probably on the SCSI bus.

"I told you so" - not a happy strategy.

For the simple case of only two machines sharing a bus: what do you mean
by "at boot time"?  The first machine powers up, and is "alone" with a
"good" controller.  Two weeks later, the owner of the second machine
(with a "bad" one) returns from vacation and turns his on.  His dog
brought him a magazine article on clusters, so why not jump in?  It
might, maybe, manage to boot to the point of noticing the first one
without polluting its transfers.  Note that at this point, the first
machine is undoubtedly doing disk writes; packet corruption is not as
"harmless" as when you have a ROFS.  And the second machine has to touch
the first's controller to query it's versions.  And to find it, it
enumerates the entire bus.  Meantime, does the first machine repeat the
boot-time check? How does it notice?

As I said, when something's wrong, logging to disk with an invalid
hardware configuration isn't going fly.  Above the hardware level,
you're not in the cluster (yet), so how are you going to get the disk
bitmaps (and locks)?  And write to a ROFS?  Normally, these are queued
in memory (and retrieved for syslog by dmesg).  But with this
misconfiguration, the last thing you want to do is join the cluster &
remount the logging disk R/W.  So you can't log to disk.  You might want
to try to send to a network syslog - but that means you've gotten a LOT
further into kernel initialization; you have a file system, network
configuration, know where to send it, etc.  Besides the fact that your
network chip may be on the same SCSI bus, you've done a whole lot more
I/O to get this far.  With this kind of error, you want to make the test
and panic very, very early in initialization to minimize collateral
damage. 

There are many more cases to cover.  This is one of the simpler.

It's really not that simple to verify hardware configurations, once you
dig in to the problem space.  Fred's test was undoubtedly useful for
logging & cluster initialization - with supported controllers.  It might
have been a good reminder for engineering experiments.  I'd need to be
convinced that it could solve the issue that you wanted to address. 
"For every problem, there is a solution that is simple, obvious, and ...
wrong".

You're correct that some simple check at driver initialization that
stuck with console logging could probably be 80-90% effective.  But
getting the rest right, while an interesting engineering project, would
be a P.roject.  Sunshine with a slight chance of data corruption just
wasn't the DEC way :-)

As I said, a lot of fun for the engineers, but hard to justify in order
to save a few customers $100.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.trailing-edge.com/pipermail/simh/attachments/20180205/b6db93b3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4577 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mailman.trailing-edge.com/pipermail/simh/attachments/20180205/b6db93b3/attachment.bin>