[Info-vax] openvms and xterm

Wed May 15 13:52:32 EDT 2024

In article <v20v7d$pvb$1 at panix2.panix.com>,
Scott Dorsey <kludge at panix.com> wrote:
>Dan Cross <cross at spitfire.i.gajendra.net> wrote:
>>Kludge writes:
>>>People today just seem to think this is the normal way of doing business.
>>>Surely we can do better.
>>
>>Because it's the law of large numbers, not any particular fault
>>that can be reasonably addressed by an individual, organization,
>>etc.
>
>Yes, this is why the key is to simplify things.  I want to read my email,
>which requires I log in from my desktop computer through a VPN that requires
>an external server for downloading some scripts to my machine.  Once I am
>on the VPN, I can log into the mail server which is dependent on an active
>directory server for authentication and three disk servers to host the mail.
>Since it's running a Microsoft product that isn't safe to connect to the
>outside world there is also an additional isolation server between the mail
>server and the outside world.  All of these things need to be working for me
>to be able to read my mail.
>
>Given the complexity of this system, it's not surprising that sometimes 
>it isn't working.  This seems ludicrous to me.

But this is conflating two things: scale, and complexity.  They
are not the same, but this treats them as if they are.

The fact is that services have to scale to meet demand, which is
tied to users, who are often tied to revenue; thus, use is a
desirable condition from a business standpoint.  And while it is
true that there is some inherent complexity in a highly-scalable
service, it does not follow that the sort of Rube Goldberg
machine you just described for reading your email is the
ncessary outcome.

Consider, instead, Google search.  If you go to `google.com`,
you actually get a pretty simple interface: right now, for me,
it's just the Google logo, a text box, and a few buttons.  I
enter my search term into the text box and click "Google Search"
and off I go.  But!  When I click that button, I am one of many
millions of users in that same second simultaneously clicking
that button; in order to serve all of those users
simultaneously, there is an enormous pile of resources sitting
behind that simple web page that lets it scale.  And when I
say enormous, I mean enormous: O(10^6) individual servers with
O(10^7) CPUs and many petabytes of RAM total, exabytes of
stable storabe, and terrabits of network bandwidth all
connecting them, in a constellation of globally distributed
data centers often built to be near redundant, high-capacity
electricity sources (i.e., built near a dam, say).

But when you're working at that scale, things simply break, and
it's not because _you_ did anything wrong; it's just the nature
of systems that have been scaled to that size.  Therefore, you
have two choices: architect your system to be resilient to the
breakage, or don't operate at that scale.  Period.  Those are
your only choices.  "Make your system simpler" doesn't work;
even if it were dead stupid, things would _still_ break.

>>If a marginal solder joint mechanically weakened by a bumpy ride
>>in a truck causes something to short, and that current draw on a
>>bus spikes over some threshold, pulling a voltage regulator out
>>of spec and causing voltage to sag by some nominal amount that
>>pulls another component on a different server below its marginal
>>threshold for a logic value and a bit flips, what are you, the
>>software engineer, supposed to do to tolerate that, let alone
>>recover from it?  It's not a software bug, it's the confluence
>>of a large number of factors that only emerge when you run at a
>>scale with tens or hundreds of thousands of systems.
>
>Yes, precisely.  I don't want to be dependent on systems running at
>that scale.

If you use the Internet, at all, you already are.

>>Can we do better?  Maybe.  There were some lessons learned in
>>that failure; in part, making sure that the battery room doesn't
>>flood if the generator catches on fire (another part of the
>>story).  But the reliability of hyperscalar operations is
>>already ridiculously high.  They do it by using redundency and
>>designing in an _expectation_ of failure: multiple layers of
>>redundent load balancers, sharding traffic across multiple
>>backends, redundant storage in multiple geolocations, etc.  But
>>a single computer failing and rebooting?  That's expected.  The
>>enterprise is, of course, much further behind, but I'd argue on
>>balance even they do all right, all things considered.
>
>Redundancy helps a lot, but part of the key is to look at the 
>opposite, at the number of single points of failure.

This is overly simplistic, but let's take it on its face: I look
at the number of single points of failure, and then...do what?
Obviously, make them not be single points of failure...but what
does _that_ mean?  Make them redundant?  Rearchitect the system
to eliminate them entirely?  And then what?  What if I still
need to scale out resources to meet sheer demand?

	- Dan C.