[Info-vax] openvms and xterm

Tue May 14 20:22:05 EDT 2024

Dan Cross <cross at spitfire.i.gajendra.net> wrote:
>Kludge writes:
>>People today just seem to think this is the normal way of doing business.
>>Surely we can do better.
>
>Because it's the law of large numbers, not any particular fault
>that can be reasonably addressed by an individual, organization,
>etc.

Yes, this is why the key is to simplify things.  I want to read my email,
which requires I log in from my desktop computer through a VPN that requires
an external server for downloading some scripts to my machine.  Once I am
on the VPN, I can log into the mail server which is dependent on an active
directory server for authentication and three disk servers to host the mail.
Since it's running a Microsoft product that isn't safe to connect to the
outside world there is also an additional isolation server between the mail
server and the outside world.  All of these things need to be working for me
to be able to read my mail.

Given the complexity of this system, it's not surprising that sometimes 
it isn't working.  This seems ludicrous to me.

>If a marginal solder joint mechanically weakened by a bumpy ride
>in a truck causes something to short, and that current draw on a
>bus spikes over some threshold, pulling a voltage regulator out
>of spec and causing voltage to sag by some nominal amount that
>pulls another component on a different server below its marginal
>threshold for a logic value and a bit flips, what are you, the
>software engineer, supposed to do to tolerate that, let alone
>recover from it?  It's not a software bug, it's the confluence
>of a large number of factors that only emerge when you run at a
>scale with tens or hundreds of thousands of systems.

Yes, precisely.  I don't want to be dependent on systems running at
that scale.

>Can we do better?  Maybe.  There were some lessons learned in
>that failure; in part, making sure that the battery room doesn't
>flood if the generator catches on fire (another part of the
>story).  But the reliability of hyperscalar operations is
>already ridiculously high.  They do it by using redundency and
>designing in an _expectation_ of failure: multiple layers of
>redundent load balancers, sharding traffic across multiple
>backends, redundant storage in multiple geolocations, etc.  But
>a single computer failing and rebooting?  That's expected.  The
>enterprise is, of course, much further behind, but I'd argue on
>balance even they do all right, all things considered.

Redundancy helps a lot, but part of the key is to look at the 
opposite, at the number of single points of failure.
--scott
-- 
"C'est un Nagra. C'est suisse, et tres, tres precis."