[Info-vax] RX-2800 overheat
Rich Jordan
jordan at ccs4vms.com
Mon Mar 2 14:28:20 EST 2020
RX2800-i2. The machine is no longer in production, only used for access to historial data and reports, and the customer had it turned off since last November due to getting an overheat error.
Well they turned it back on last Thursday, it ran for 2 days, then failed with an overheat again. Can anyone tell from these error entries if its possibly a sensor issue, or something a physical cleanout might help with or do we have an actual failing CPU (or apparently a separate 'power pod' module)?
We have asked them (they are remote) to have someone open up the box and verify its not full of dust bunnies and crap, waiting on that response.
They claim that the environment is warm but within spec (around 80F in the server room) but the last time this occurred it was also over a weekend so I wonder if the cooling is being turned off or low to save money...
Needless to say the server is no longer on support.
===========
System Health:
Status Information Details Part Number
Processor 0 major Failed 1600 MHz L3 Cache: 20 MB WL30608
===========
748 informational 2 ILO ACPI_SOFT_OFF 29 Feb 2020 14:13:31 ACPI state S5 (soft-off) Sensor: System ACPI Power State,
S5/G2 soft-off,
205E5A718B02049D FFFF056FFA220400
747 minor 3 ILO TEMP_WARNING 29 Feb 2020 14:13:24 A temperature inside the server has gone outside the factory specified range. Sensor: Temperature - CPU0 VR_THRMALRT,
transition to Non-Critical from OK,
205E5A718402049C FFFF010758010400
746 major 5 ILO TEMP_CRITICAL 29 Feb 2020 14:13:24 A temperature inside the server went far outside the factory specified range. Sensor: Temperature - CPU0 PROCHOT,
transition to Critical from less severe,
205E5A718402049B FFFF020756010400
745 critical 7 ILO TEMP_NON_RECOVERABLE 29 Feb 2020 14:13:24 A temperature inside the server went far outside the factory specified range. Sensor: Temperature - CPU0 THERMTRIP,
transition to Non-recoverable from less severe,
205E5A718402049A FFFF030755010400
744 major 5 ILO CPU_POWER_POD_FAILURE 29 Feb 2020 14:01:35 The CPU power pod is no longer powering the CPU. Physical Location Hard Failure - Processor (Processor Socket): Processor Socket 0,
BA80252800E10498 FFFFFFFFFF00FF11
743 informational 2 OS OS_BOOT_COMPLETE 27 Feb 2020 12:23:26 OS Boot Complete Major change in system state - Boot Complete,
548016E100E10496 0000000000000001
More information about the Info-vax
mailing list