[Info-vax] Beyond Open Source
Stephen Hoffman
seaohveh at hoffmanlabs.invalid
Tue May 12 09:59:06 EDT 2015
On 2015-05-12 12:24:11 +0000, Neil Rieck said:
> I was deliberately cagey in my original post for the following reason:
> I will post first-hand information; I might post second-hand
> information depending upon the source (many people are only able to
> pass on crud); I almost never post third-hand information because I
> have been burned too many times with hearsay.
> The "Linux curator" is an American company offering support to Big
> Enterprise. When I heard that they had blamed gcc I didn't know if they
> were talking about the product itself or how it was being used
> (compiler directives spring to mind). But I think we all agree that if
> something similar had happened with OpenVMS on multi-core Itanium that
> there would be a lot more people at "VMS Engineering" (whoever they are
> working for) who can spring into action because they wrote most of the
> code in question (as opposed to what happens with Linux where the broth
> may have too many chefs).
I'd infer that there is incomplete or missing information here, that
the configuration involved might not have been qualified and thus "best
effort" or other such wording, or potentially that the support company
has funded and optimized their support staff to what others might term
minimal or caretaker levels.
As for debugging these cases, it's often somebody that has source
access and that can figure out how the pieces fit together and how the
problem can be triggered, and that can back-track. If the
troubleshooting has identified a fault in gcc, then either a
replacement compiler or a reworked code-path ensues, depending on the
circumstances.
OpenVMS has had pool corruptions (which can be really nasty to find,
that NFS register corruptor is a marvelous example of this), and there
have been some disk-level corruptions on occasion, and there have been
some nasty compiler bugs (such as the
too-many-instructions-in-the-lock-sequence is one), too. Most of the
OpenVMS systems folks and engineering folks have done a whole lot of
troubleshooting in code that wasn't in that engineer's area of
expertise and variously in code that we'd never looked at before —
that's part of the job. Being responsible for dispatching crashes
when CCAT / Canasta kicked out an "unknown" does get an engineer a wide
view of the OS, but that's fodder for another discussion. With source
access, generating and inserting and enabling any latent diagnostic
tools is feasible, too. Some of the best folks for isolating crash
dumps were at the support center, and there were (are) some folks at
partners and at customer sites; troubleshooting expertise extends
beyond engineering.
How this particular "American company offering support" works, I do not know.
For those folks running their own or a third-party operating system or
any other complex task critical for your business, you're still
ultimately responsible for keeping the servers working, even if you've
outsourced support.
Now as for this particular case, allow me to spin a yarn, and a simple
yarn that easily fits all the known "facts" here. Some big company has
had a variety of folks working on their critical application over the
years, and has recently encountered an application hang on their
multiprocessor Linux server. The staff at the big company looks at
this hang, and cannot resolve it. The big company then calls their
Linux support provider, believing this hang to be an OS-level error.
The support provider organization digs through the hang and the Linux
code involved and the application trigger, and identifies some
incorrectly-written locking code or some other bug in the application
code secondary to some version of gcc, and hands the issue back to the
company to resolve. The code may well be a half-eared locking scheme,
or a race condition or threading error, or other vermin in the customer
application itself.
This same general sequence has happened with OpenVMS, too. (It has
even happened within OpenVMS engineering, where one engineer has handed
off a bug to another engineer. And later had the bug report handed
back to the engineering-originator. But that's fodder for another
time.) My frequent recommendations around upgrading any remaining VAX
C code to C89 and to as much of C99 as is available, and around
enabling and addressing C diagnostics, and on replacing the less-safe
and crusty C calls with more modern calls, and on instrumenting your
code, and around creating a reproducer, all arise from having found
more than a few bugs in dodgy C code above the operating system. In my
own and older C code, and in the C code of others.
--
Pure Personal Opinion | HoffmanLabs LLC
More information about the Info-vax
mailing list