<div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">I hear you. My point was it only failed the corner cases of failover, so I think it would have made it to logs. And that should have been good enough. Not perfect, but in practice it would have worked and it would have simplified things immensely.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">And not having the Adaptec support was in fact a real problem to added cost and really did not add value. My last act, I was trying too build the $1K Alpha at the time (which I did prototype until Jessie kill it, buts that's another story). Folks said the cheapest Alpha was $5K -- well there was a reason. When we took a $799 [end user Radio Shack priced] Compaq K7 based system and spliced a $200 EV6 into and got Tru64 working (with an Adapter chipset on the motherboard BTW), it worked and people wanted it!!! [I still have the the motherboard at home and EV6 from it is on my desk at Intel].</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Look I tend to be a practical engineer. I always felt that DEC's building things 100% fool proof, everything had to be perfect was really what killed the golden goose - not palmer et al. Not being able to understand what did needed to be "you bet the company/farm/your life" and what could be "good enough for now and move to the next problem". I always thought that was one of the things Roger Gourd taught me -- how to differentiate between the two. I think DEC had that when the PDP-8 and PDP-11 was done and in the early development of the Vax (hey I programmed Vax serial #1 -- VMS V1.0 was buggy as could be). But with Vax becoming supreme, DEC lost its way/believed its own hype.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Alpha and Tru64 are great examples of the problem. BTW: I loved Alpha, bleed for it etc.. Tru64 was the best UNIX implementation I have ever used, and am proud to be a developer of TruClusters. But it took 3 extra years to get Tru64 out the door because it had to be perfect (and nobody got fired for it either).</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">I never understood that. Every subsystem that needed to be rewritten (TTY handler, memory, bulk I/O) did need to be do over from the original code from OSF for the 386 and PMAX. But I always felt, DEC could have shipped OSF/1 on the Alpha pretty much as is, and started to get revenue and move the installed base. Then subsystem by subsystem, replaced them with something better. I also argued with Supnik BTW (whom I adore and think the world of), the lack of 32 bits certainly made the engineering support easier, but again it cost us. We also basically paid the ISVs to fix their code so it would run on 64 bit SPARC (Judy Ward's errors are still the best I know for cleaning up 32 bit-isms. If I have old code, I'm about to port, I often run it through my Alpha with Judy's compiler to tell me is going to be troublesome).</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Yup 32 bit support would have been messy and we would have had to have 4 versions of the libraries just like MIPS, SPARC et al. It would have been a little ugly and not 'perfect' ... but it would have worked and been faster to market. And ISV's would started to get some revenue. By the time we were 'done' - it was too little to late and folks hard already started to look for an alternative - and guess what Winders on a 386 was 'good enough.'</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">As I said to Jessie et al, SW is not written on $1M computers - its written on the cheapest thing that gets the job done. Then moved upstream if it is valuable. </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">By the time the Sr Managers took the 'cut a deal with Microsoft and get their SW' strategy, the death spiral was well underway. And they misunderstood, they were never going to get 'big bucks' for 'commodity sw.' [Intel suffered that also with Itanium].</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">BTW: I'm watching Intel make more of the same mistakes.... sigh. As I say to folks here, I have those tee shirts, I know how this movie ends.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Clem</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">BTW: the best people in Intel IT are the Macs@Intel folks. They listen and understand I can help them. So they try to help me. Its a good arrangement. When they send me a note, I do try to help. I'm one of the few folks in lingering that 'update access' to their tools library (biggest issue is its all written in VB -- seriously for Mac's -- don't ask). But when I find something, I do try to help. In return, I ran into an issue last Tues with my keyboard and called them that PM. I missed the Fedex time for Folsom, but on Thursday an new Mac was on my doorstep. Only reason I have not switched is because I had to travel Thursday for work - so I took them both. Got my job done and will switch systems completely tonight (I hope). </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Anyway -- back to work.... </div><div hspace="streak-pt-mark" style="max-height:1px"><img alt="" style="width:0px;max-height:0px;overflow:hidden" src="https://mailfoogae.appspot.com/t?sender=aY2xlbWNAY2NjLmNvbQ%3D%3D&type=zerocontent&guid=fd312544-47ec-47b0-9166-2fb9f422aa1d"><font color="#ffffff" size="1">ᐧ</font></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Feb 5, 2018 at 3:10 PM, Timothe Litt <span dir="ltr"><<a href="mailto:litt@ieee.org" target="_blank">litt@ieee.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"><span>
On 05-Feb-18 13:36, Clem Cole wrote:<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote"><font color="#0000ff"><font face="arial, helvetica, sans-serif">Point taken, but
DEC used the SPD as its primary defense for exactly this
type of problem. It was the 'legal' definition of what
was and was not allowed. But as you point out,
that behavior does not always make for happy customers
or sr managers.</font></font>
<div><br>
</div>
</div>
</div>
</div>
</blockquote></span>
I started in the field, and consulted with the corporate flying
squads. The SPDs' value as legal definition was of more interest to
lawyers & junior product managers than to those at the sharp end
of the spear. Happiness, even at expense above and beyond legal
technicalities brought more business than sticking to the letter of
the law. Unhappiness was very, very expensive. I have stories that
run both ways...<span><br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">
<div class="gmail_default"> <font color="#0000ff"><font face="arial, helvetica, sans-serif">The truth is in
at least Tru64 (I think is was Feed Knight - Mr.
SCSI) had code that detect when your SCSI bus was
being shared. It would have been easy to add add a
side look up to check the control being used and if
it was not in the official table, produce a boot
message saying -- "<i>shared bus with unsupported
SCSI controller, please remove sharing or replace
controller and reboot."</i></font></font></div>
</blockquote>
<div><br>
</div>
<div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><font color="#0000ff">But I could never get marketing to
accept that.</font></div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><font color="#0000ff"><br>
</font></div>
</div>
</div>
</div>
</div>
</blockquote></span>
I wish it were that simple. In this case, Marketing's intuition
covered some technical challenges. I had many a talk with Fred when
I was in the Tru64 group. That 'table' would have to deal not only
with controller types, but with compatibility of firmware versions
for every device on the bus. And the permutations of what worked
(and didn't) weren't static. The sys_check maintainer made some
efforts, as did the SPEAR folks in CSSE. But everything was a
moving target. <br>
<br>
The trivial case of "don't ever use this controller in a cluster"
isn't all that hard to blacklist. Of course, when the foobar-plus
comes out with a different device ID, but the same bug, you have to
blacklist it too. Before any customer finds one a "American Used
Computers" (Kenmore Square, before e-bay:-) And don't forget that
to find another controller on the bus, you have to enumerate the
bus. This can have side-effects with "bad" controllers. The bugs
weren't all limited to fail-over. IIRC tagging and command queuing
had issues; at least one controller created parity errors (and some
undetected one).<br>
<br>
But maintaining a useful whitelist - with all the churn in the SCSI
space - would be a nightmare. Disks have firmware & HW revs.
Controllers too. Blocking all 3rd party disks (despite the frequent
firmware issues) isn't viable. Don't forget CD/DVD, tape, and even
ethernet. Even getting customers to install patches was hard (patch
quality and interactions was one of my issues); patching to keep up
with hardware/firmware revs wasn't going to fly. And you need this
information before you have a file system; preferably in the boot
driver. So no, not a config file. Maybe SRM console environment
variables... Even in the relatively controlled environment that DEC
was able to impose, SCSI should have been called CHAOSnet - except
that name was taken.<br>
<br>
Worse, once you produce one error message in a problem space (e.g.
invalid HW config), suddenly NOT producing errors for all other
cases that don't work become bugs.<span><br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><font color="#0000ff">My point was that if we detected it
(which was not not that hard), then we could have at
least said something. And in practice if you still
ignored it and it was in all those system logs, it
would have been pretty easy to say to the end
customer, <i>we told you not to do that</i>.</font></div>
</div>
</div>
</div>
</div>
</blockquote></span>
By the time it's in a system log, it's too late. The logging disk
is probably on the SCSI bus.<br>
<br>
"I told you so" - not a happy strategy.<br>
<br>
For the simple case of only two machines sharing a bus: what do you
mean by "at boot time"? The first machine powers up, and is "alone"
with a "good" controller. Two weeks later, the owner of the second
machine (with a "bad" one) returns from vacation and turns his on.
His dog brought him a magazine article on clusters, so why not jump
in? It might, maybe, manage to boot to the point of noticing the
first one without polluting its transfers. Note that at this point,
the first machine is undoubtedly doing disk writes; packet
corruption is not as "harmless" as when you have a ROFS. And the
second machine has to touch the first's controller to query it's
versions. And to find it, it enumerates the entire bus. Meantime,
does the first machine repeat the boot-time check? How does it
notice?<br>
<br>
As I said, when something's wrong, logging to disk with an invalid
hardware configuration isn't going fly. Above the hardware level,
you're not in the cluster (yet), so how are you going to get the
disk bitmaps (and locks)? And write to a ROFS? Normally, these are
queued in memory (and retrieved for syslog by dmesg). But with this
misconfiguration, the last thing you want to do is join the cluster
& remount the logging disk R/W. So you can't log to disk. You
might want to try to send to a network syslog - but that means
you've gotten a LOT further into kernel initialization; you have a
file system, network configuration, know where to send it, etc.
Besides the fact that your network chip may be on the same SCSI bus,
you've done a whole lot more I/O to get this far. With this kind of
error, you want to make the test and panic very, very early in
initialization to minimize collateral damage. <br>
<br>
There are many more cases to cover. This is one of the simpler.<br>
<br>
It's really not that simple to verify hardware configurations, once
you dig in to the problem space. Fred's test was undoubtedly useful
for logging & cluster initialization - with supported
controllers. It might have been a good reminder for engineering
experiments. I'd need to be convinced that it could solve the issue
that you wanted to address. "For every problem, there is a solution
that is simple, obvious, and ... wrong".<br>
<br>
You're correct that some simple check at driver initialization that
stuck with console logging could probably be 80-90% effective. But
getting the rest right, while an interesting engineering project,
would be a <font size="+2">P</font>.roject. Sunshine with a slight
chance of data corruption just wasn't the DEC way :-)<br>
<br>
As I said, a lot of fun for the engineers, but hard to justify in
order to save a few customers $100.<br>
<br>
</div>
</blockquote></div><br></div></div><div hspace="streak-pt-mark" style="max-height:1px"><img alt="" style="width:0px;max-height:0px;overflow:hidden" src="https://mailfoogae.appspot.com/t?sender=aY2xlbWNAY2NjLmNvbQ%3D%3D&type=zerocontent&guid=3e6e3a81-c097-42c1-856d-1f83ed341b4d"><font color="#ffffff" size="1">ᐧ</font></div>