[Info-vax] RX-2800 overheat

Rich Jordan jordan at ccs4vms.com
Mon Mar 2 14:28:20 EST 2020


RX2800-i2.  The machine is no longer in production, only used for access to historial data and reports, and the customer had it turned off since last November due to getting an overheat error.

Well they turned it back on last Thursday, it ran for 2 days, then failed with an overheat again.   Can anyone tell from these error entries if its possibly a sensor issue, or something a physical cleanout might help with or do we have an actual failing CPU (or apparently a separate 'power pod' module)?

We have asked them (they are remote) to have someone open up the box and verify its not full of dust bunnies and crap, waiting on that response.

They claim that the environment is warm but within spec (around 80F in the server room) but the last time this occurred it was also over a weekend so I wonder if the cooling is being turned off or low to save money...

Needless to say the server is no longer on support.

===========

System Health:

Status	Information	Details	Part Number
Processor 0	major Failed	1600 MHz	L3 Cache: 20 MB	WL30608

===========

748	informational   2	ILO	ACPI_SOFT_OFF	29 Feb 2020 14:13:31	ACPI state S5 (soft-off)	Sensor: System ACPI Power State,
S5/G2 soft-off,
205E5A718B02049D FFFF056FFA220400

747	minor   3	ILO	TEMP_WARNING	29 Feb 2020 14:13:24	A temperature inside the server has gone outside the factory specified range.	Sensor: Temperature - CPU0 VR_THRMALRT,
transition to Non-Critical from OK,
205E5A718402049C FFFF010758010400

746	major   5	ILO	TEMP_CRITICAL	29 Feb 2020 14:13:24	A temperature inside the server went far outside the factory specified range.	Sensor: Temperature - CPU0 PROCHOT,
transition to Critical from less severe,
205E5A718402049B FFFF020756010400

745	critical   7	ILO	TEMP_NON_RECOVERABLE	29 Feb 2020 14:13:24	A temperature inside the server went far outside the factory specified range.	Sensor: Temperature - CPU0 THERMTRIP,
transition to Non-recoverable from less severe,
205E5A718402049A FFFF030755010400

744	major   5	ILO	CPU_POWER_POD_FAILURE	29 Feb 2020 14:01:35	The CPU power pod is no longer powering the CPU.	Physical Location Hard Failure - Processor (Processor Socket): Processor Socket 0,
BA80252800E10498 FFFFFFFFFF00FF11

743	informational   2	OS	OS_BOOT_COMPLETE	27 Feb 2020 12:23:26	OS Boot Complete	Major change in system state - Boot Complete,
548016E100E10496 0000000000000001



More information about the Info-vax mailing list