[Info-vax] Beyond Open Source

Tue May 12 09:59:06 EDT 2015

On 2015-05-12 12:24:11 +0000, Neil Rieck said:

> I was deliberately cagey in my original post for the following reason: 
> I will post first-hand information; I might post second-hand 
> information depending upon the source (many people are only able to 
> pass on crud); I almost never post third-hand information because I 
> have been burned too many times with hearsay.
> The "Linux curator" is an American company offering support to Big 
> Enterprise. When I heard that they had blamed gcc I didn't know if they 
> were talking about the product itself or how it was being used 
> (compiler directives spring to mind). But I think we all agree that if 
> something similar had happened with OpenVMS on multi-core Itanium that 
> there would be a lot more people at "VMS Engineering" (whoever they are 
> working for) who can spring into action because they wrote most of the 
> code in question (as opposed to what happens with Linux where the broth 
> may have too many chefs).

I'd infer that there is incomplete or missing information here, that 
the configuration involved might not have been qualified and thus "best 
effort" or other such wording, or potentially that the support company 
has funded and optimized their support staff to what others might term 
minimal or caretaker levels.

As for debugging these cases, it's often somebody that has source 
access and that can figure out how the pieces fit together and how the 
problem can be triggered, and that can back-track.   If the 
troubleshooting has identified a fault in gcc, then either a 
replacement compiler or a reworked code-path ensues, depending on the 
circumstances.

OpenVMS has had pool corruptions (which can be really nasty to find, 
that NFS register corruptor is a marvelous example of this), and there 
have been some disk-level corruptions on occasion, and there have been 
some nasty compiler bugs (such as the 
too-many-instructions-in-the-lock-sequence is one), too.   Most of the 
OpenVMS systems folks and engineering folks have done a whole lot of 
troubleshooting in code that wasn't in that engineer's area of 
expertise and variously in code that we'd never looked at before — 
that's part of the job.   Being responsible for dispatching crashes 
when CCAT / Canasta kicked out an "unknown" does get an engineer a wide 
view of the OS, but that's fodder for another discussion.   With source 
access, generating and inserting and enabling any latent diagnostic 
tools is feasible, too.  Some of the best folks for isolating crash 
dumps were at the support center, and there were (are) some folks at 
partners and at customer sites; troubleshooting expertise extends 
beyond engineering.

How this particular "American company offering support" works, I do not know.

For those folks running their own or a third-party operating system or 
any other complex task critical for your business, you're still 
ultimately responsible for keeping the servers working, even if you've 
outsourced support.

Now as for this particular case, allow me to spin a yarn, and a simple 
yarn that easily fits all the known "facts" here.  Some big company has 
had a variety of folks working on their critical application over the 
years, and has recently encountered an application hang on their 
multiprocessor Linux server.  The staff at the big company looks at 
this hang, and cannot resolve it.   The big company then calls their 
Linux support provider, believing this hang to be an OS-level error.   
The support provider organization digs through the hang and the Linux 
code involved and the application trigger, and identifies some 
incorrectly-written locking code or some other bug in the application 
code secondary to some version of gcc, and hands the issue back to the 
company to resolve.   The code may well be a half-eared locking scheme, 
or a race condition or threading error, or other vermin in the customer 
application itself.

This same general sequence has happened with OpenVMS, too.  (It has 
even happened within OpenVMS engineering, where one engineer has handed 
off a bug to another engineer.  And later had the bug report handed 
back to the engineering-originator.  But that's fodder for another 
time.)   My frequent recommendations around upgrading any remaining VAX 
C code to C89 and to as much of C99 as is available, and around 
enabling and addressing C diagnostics, and on replacing the less-safe 
and crusty C calls with more modern calls, and on instrumenting your 
code, and around creating a reproducer, all arise from having found 
more than a few bugs in dodgy C code above the operating system.  In my 
own and older C code, and in the C code of others.

-- 
Pure Personal Opinion | HoffmanLabs LLC