[Info-vax] Alphaserver ES47: Suspected broken CPU, unable to stop/cpu 2

Stephen Hoffman seaohveh at hoffmanlabs.invalid
Thu Aug 9 11:01:28 EDT 2018


On 2018-08-09 08:01:24 +0000, Robin Schrievers said:

> It looks like the zbox memory controller throws the errors which would 
> seem like rimm models being and issue.
> Any thoughts there? (Apart from some resocketing of the parts)

Re-seat processors, memory and risers.   Then start swapping.  I'd 
start with RDRAM/RIMM swappage, then maybe the ZBOX controller and the 
processor.  Or get yourself a spare server and swap that in, and gain 
some time to better troubleshoot the box.

Longer answer...  Decide if this is a "production" cluster, or a 
production cluster.  Right now, it's certainly looking like the former. 
 If it's decided that the latter is preferable, it's going to cost a 
little more.  If so, get one or two spare servers and some key spare 
parts, and get the spare servers either configured and running 
full-time or available as warm spares via remote management console, 
and get some on-site or near-site technical assistance and some remote 
escalation for outages.

Memory RAID is an option on the Marvel series, and it's not configured here.

Reading Marvel diagnostics and dumps is not particularly fun in my 
experience — it can get quite tedious, as the docs and the errors and 
the displays don't always align — and the Marvel diagnostics can 
require poking at the server from the console in addition to swapping 
some parts, as the Marvel hardware diagnostics can sometimes be a 
little... misguiding... about the errors.  I'd expect to take an hour 
or so gathering and rummaging the diagnostic reports for details, and 
then determining some candidate swaps — specific RIMMs or maybe the 
ZBOX — based on what's been reported.




-- 
Pure Personal Opinion | HoffmanLabs LLC 




More information about the Info-vax mailing list