[Info-vax] openvms and xterm

Wed Apr 24 15:06:52 EDT 2024

In article <v0bhum$laq$1 at panix2.panix.com>,
Scott Dorsey <kludge at panix.com> wrote:
>Dan Cross <cross at spitfire.i.gajendra.net> wrote:
>>The thing is, when you're working at scale, managing services
>>across tens of thousands of machines, you quickly discover that
>>shit happens.  Things sometimes crash randomly; often this is
>>due to a bug, but sometimes it's just because the OOM killer got
>>greedy due to the delayed effects of a poor scheduling decision,
>>or there was a dip on one of the voltage rails and a DIMM lost a
>>bit, or a job landed on a machine that's got some latent
>>hardware fault and it just happened to wiggle things in just the
>>right way so that a 1 turned into a 0 (or vice versa), or any
>>number of other things that may or may not have anything to do
>>with the service itself.
>
>Oh, I understand this completely.  I have stood in the middle of a large
>colocation facility and listened to Windows reboot sounds every second or
>two coming from different places in the room each time.
>
>What I don't necessarily understand is why people consider this acceptable.
>People today just seem to think this is the normal way of doing business.
>Surely we can do better.

Because it's the law of large numbers, not any particular fault
that can be reasonably addressed by an individual, organization,
etc.

If a marginal solder joint mechanically weakened by a bumpy ride
in a truck causes something to short, and that current draw on a
bus spikes over some threshold, pulling a voltage regulator out
of spec and causing voltage to sag by some nominal amount that
pulls another component on a different server below its marginal
threshold for a logic value and a bit flips, what are you, the
software engineer, supposed to do to tolerate that, let alone
recover from it?  It's not a software bug, it's the confluence
of a large number of factors that only emerge when you run at a
scale with tens or hundreds of thousands of systems.

Google had a datacenter outage once, due to a diesel generator
catching on fire; there was a small leak and the ambient
temperature in the room with the generator exceeded the fuel's
flashpoint.  "How does that happen?" you ask, since the auto
ignition temperatore of diesel fuel is actually pretty high.  It
turns out that that is for liquids, and doesn't apply if the
fuel is aerosolized; the particular failure mode here was a
hairline fracture in the generator's fuel manifold.  Diesel
forced through it was aerosolized and the ambient temperature
hit the fuel's flashpoint.  Whoops.  Cummins had never seen that
failure mode before; it was literally one in a million.  The
solution, incidentally, was to wrap a flash-resistent cloth
around the manifold; basically putting a diaper on it, to keep
escaped fuel liquid.

Can we do better?  Maybe.  There were some lessons learned in
that failure; in part, making sure that the battery room doesn't
flood if the generator catches on fire (another part of the
story).  But the reliability of hyperscalar operations is
already ridiculously high.  They do it by using redundency and
designing in an _expectation_ of failure: multiple layers of
redundent load balancers, sharding traffic across multiple
backends, redundant storage in multiple geolocations, etc.  But
a single computer failing and rebooting?  That's expected.  The
enterprise is, of course, much further behind, but I'd argue on
balance even they do all right, all things considered.

	- Dan C.