[Info-vax] Alphaserver ES47: Suspected broken CPU, unable to stop/cpu 2
Stephen Hoffman
seaohveh at hoffmanlabs.invalid
Thu Aug 9 11:01:28 EDT 2018
On 2018-08-09 08:01:24 +0000, Robin Schrievers said:
> It looks like the zbox memory controller throws the errors which would
> seem like rimm models being and issue.
> Any thoughts there? (Apart from some resocketing of the parts)
Re-seat processors, memory and risers. Then start swapping. I'd
start with RDRAM/RIMM swappage, then maybe the ZBOX controller and the
processor. Or get yourself a spare server and swap that in, and gain
some time to better troubleshoot the box.
Longer answer... Decide if this is a "production" cluster, or a
production cluster. Right now, it's certainly looking like the former.
If it's decided that the latter is preferable, it's going to cost a
little more. If so, get one or two spare servers and some key spare
parts, and get the spare servers either configured and running
full-time or available as warm spares via remote management console,
and get some on-site or near-site technical assistance and some remote
escalation for outages.
Memory RAID is an option on the Marvel series, and it's not configured here.
Reading Marvel diagnostics and dumps is not particularly fun in my
experience — it can get quite tedious, as the docs and the errors and
the displays don't always align — and the Marvel diagnostics can
require poking at the server from the console in addition to swapping
some parts, as the Marvel hardware diagnostics can sometimes be a
little... misguiding... about the errors. I'd expect to take an hour
or so gathering and rummaging the diagnostic reports for details, and
then determining some candidate swaps — specific RIMMs or maybe the
ZBOX — based on what's been reported.
--
Pure Personal Opinion | HoffmanLabs LLC
More information about the Info-vax
mailing list