[Info-vax] Alphaserver ES47: Suspected broken CPU, unable to stop/cpu 2

Wed Aug 8 10:06:58 EDT 2018

On Wednesday, 8 August 2018 15:43:45 UTC+2, Stephen Hoffman  wrote:
> On 2018-08-08 06:34:54 +0000, Robin Schrievers said:
> 
> > One of our Alphaserver ES47's is showing spontaneous crashes since a 
> > few days. As soon as we start queues and make the box work, it will 
> > crash within the hour with a machinecheck.
> > ...
> > Any suggestions would be greatly appreciated
> 
> A remote system with no local hardware support, a very down-revision 
> and unsupported version of OpenVMS Alpha, and no spare server system 
> available for a wholesale failover?   That's.... auspicious.  Might 
> want to acquire an ES47 and air-freight that to the location, swap the 
> storage and get that going.
> 

We are not completely dead right now,.. we have a cluster of 4 and thus are currently running with 3 instead of 4. This causes some delay in processing but we can manage processing wise for now.
As to hardware support. No dedicated Alphaserver HW support, but we do have general hardware support. This is pretty specific stuff though so that's the challenging part.
We are looking into where to get a spare box

> Could well be memory, processor, interconnect or who-knows-what.  I've 
> had cables fail on Marvel-class boxes, for instance.  Without access to 
> the hardware diagnostics and particularly the error log entries and 
> based solely on the not-always-helpful OpenVMS footprint, I'd guess bad 
> CPU or maybe bad memory.  (The OpenVMS ELV tool and the Marvel 
> diagnostics documentation and the Marvel server gremlins don't always 
> agree on what's actually happening, either.)
> 
> Things to try?
> Disable everything except CPU 0 and try again.  MBM> SET CPU_ENABLED 00000001
> 
> Disable the possibly-failing Duo, as the failure of one CPU can cause 
> problems for the other.  MBM> SET CPU_ENABLED FFFFFCFF
> 
> Might also need to try reconfiguring the memory.  Haven't tried 
> partitioning the system on an ES47, but that's one way that a 
> Marvel-class box can re-organize its memory.
> 
> If you can get somebody on-site, re-seat the processors, risers and 
> memory.  Issues with Marvel-class boxes can sometimes be cured with 
> that "simple" expedient.
> 
> If not and otherwise, somebody is going to need to run the fault 
> diagnostics and the ELV error log reports and isolate the hardware 
> error, and swap some boards.  Or as can happen with these sorts of 
> servers, swap around or swap out the boards most likely involved, and 
> see if the gremlins migrate.
> 
> Try scrounging a copy of the MBM CLI manual, too: "AlphaServer 
> ES47/ES80/GS1280 Server Management, Command Line Interface CLI 
> Reference, Version 3.0 October 2003, was the last version around.  The 
> filename used to be "cli_reference_v3.pdf".  That particular manual 
> is... hard to find.

I'll try and do that.
> 
> BTW, ftp.hpe.com has gone https-only (or maybe https and sftp, didn't 
> try that), and now you have to know the filename and full path of the 
> target file to fetch anything from that server.
> 
> 
> -- 
> Pure Personal Opinion | HoffmanLabs LLC