[Info-vax] Alphaserver ES47: Suspected broken CPU, unable to stop/cpu 2

Robin Schrievers robin.schrievers at meteogroup.com
Wed Aug 8 08:44:30 EDT 2018


On Wednesday, 8 August 2018 14:10:45 UTC+2, Roy Omond  wrote:
> On 08/08/18 12:24, abrsvc wrote:
> > On Wednesday, August 8, 2018 at 3:09:57 AM UTC-4, Robin Schrievers wrote:
> >> On Wednesday, 8 August 2018 08:53:08 UTC+2, Hans Vlems  wrote:
> >>> Is it possible to physically remove cpu 3, pull the board?
> >>> Hans
> >>
> >> That will be a challange. As the box is not near me i can't do that myself. We might be able to have remote hands do that, but that would require some very detailed explanation.
> > 
> > Without getting into details, where is this machine located?  Perhaps there are some of us that are local to the machine that can assist.
> 
> I'd guess Wageningen, the Netherlands.

It's located in Telehouse DC, London

> 
> I'm rather surprised nobody's even suggested to actually analyse the
> dump file (+ probably the errorlog).  You need to figure out exactly
> what has caused the machine check in kernel mode.  It *might* be
> purely coincidence that it seems to only ever occur on CPU 2.
> 
> Robin, please start by posting the output from:
> 
> 	$ analyse/crash sys$system:
> 	SDA> clue crash
> 

SDA> clue crash
Crashdump Summary Information:
------------------------------
Crash Time:         6-AUG-2018 10:35:36.53
Bugcheck Type:     MACHINECHK, Machine check while in kernel mode
Node:              THMG03  (Cluster)
CPU Type:          hp AlphaServer ES47 7/1150
VMS Version:       V7.3-2
Current Process:   RMI_grib_srvr3
Current Image:     $30$DKA0:[SYS0.SYSCOMMON.][ZIPTOOLS.ZIP23XV.VMS-BINARIES]ZIP_CLI.AXP_EXE;1
Failing PC:        FFFFFFFF.80014FCC    EXE$SYSTEM_CORRECTED_ERROR_C+0074C
Failing PS:        20000000.00001F04
Module:            SYS$CPU_ROUTINES_270F    (Link Date/Time: 27-OCT-2004 02:59:15.32)
Offset:            00004FCC

Boot Time:          3-AUG-2018 07:49:07.00
System Uptime:               3 02:46:29.53
Crash/Primary CPU: 02/00
System/CPU Type:   270F
Saved Processes:   42
Pagesize:          8 KByte (8192 bytes)
Physical Memory:   32768 MByte (268435456 PFNs, discontiguous memory)

    Press RETURN for more.
SDA>
Crashdump Summary Information:
------------------------------
Dumpfile Pagelets: 769503 blocks
Dump Flags:        olddump,writecomp,errlogcomp
Dump Type:         compressed,selective,shared_mem
EXE$GL_FLAGS:      poolpging,init,bugdump
Paging Files:      1 Pagefile and 1 Swapfile installed

Stack Pointers:
KSP = 00000000.7FF87EE0   ESP = 00000000.7FF8C000   SSP = 00000000.7FF9CD00
USP = 00000000.7AE7B850

General Registers:
R0  = 00000000.00000001   R1  = FFFFFFFF.86BFC000   R2  = 00000000.00000210
R3  = 00000000.00000001   R4  = 00000000.00000000   R5  = 00000008.00002000
R6  = 00000000.0000001A   R7  = 00000000.000979C5   R8  = 00000000.00000000
R9  = 00000000.0007C100   R10 = 00000000.0008C230   R11 = 00000000.0008C110
R12 = 00000000.0005006C   R13 = 00000000.00020000   R14 = 00000000.7C09ED5C
R15 = 00000000.00000001   R16 = 00000000.00000215   R17 = 00000000.00000000
R18 = 00000000.00000210   R19 = 00000000.00000006   R20 = 00000000.00000040
R21 = 00000000.00000000   R22 = 00000000.00000000   R23 = 00000000.00000000

    Press RETURN for more.
SDA>
Crashdump Summary Information:
------------------------------
R24 = FFFFFFFF.86BFC000   AI  = 00000000.00000001   RA  = FFFFFFFF.80014FA4
PV  = FFFFFFFF.869D1AC0   R28 = FFFFFFFF.8006AA90   FP  = 00000000.7FF87EE0
PC  = FFFFFFFF.80014FD0   PS  = 20000000.00001F04

System Registers:
Page Table Base Register (PTBR)                           00000000.00035EB4
Processor Base Register (PRBR)                            FFFFFFFF.811C7400
Privileged Context Block Base (PCBB)                      00000000.6BD6A080
System Control Block Base (SCBB)                          00000000.00000F20
Software Interrupt Summary Register (SISR)                00000000.00000000
Address Space Number (ASN)                                00000000.0000007B
AST Summary / AST Enable (ASTSR_ASTEN)                    00000000.0000000F
Floating-Point Enable (FEN)                               00000000.00000001
Interrupt Priority Level (IPL)                            00000000.0000001F
Machine Check Error Summary (MCES)                        00000000.00000000
Virtual Page Table Base Register (VPTB)                   FFFFFEFA.00000000




    Press RETURN for more.
SDA>
Crashdump Summary Information:
------------------------------
Failing Instruction:
EXE$SYSTEM_CORRECTED_ERROR_C+0074C:  	BUGCHK

Instruction Stream (last 20 instructions):
EXE$SYSTEM_CORRECTED_ERROR_C+006FC:  	STW		R31,#XFFFA(R0)
EXE$SYSTEM_CORRECTED_ERROR_C+00700:  	STW		R31,#XFFFC(R0)
EXE$SYSTEM_CORRECTED_ERROR_C+00704:  	STW		R18,#XFFFE(R0)
EXE$SYSTEM_CORRECTED_ERROR_C+00708:  	LDL		R16,#X0010(FP)
EXE$SYSTEM_CORRECTED_ERROR_C+0070C:  	LDQ_U		R31,(SP)
EXE$SYSTEM_CORRECTED_ERROR_C+00710:  	JSR		R26,(R26)
EXE$SYSTEM_CORRECTED_ERROR_C+00714:  	LDA		R16,#X0018(FP)
EXE$SYSTEM_CORRECTED_ERROR_C+00718:  	LDA		R27,#XFE90(R2)
EXE$SYSTEM_CORRECTED_ERROR_C+0071C:  	LDQ_U		R31,(SP)
EXE$SYSTEM_CORRECTED_ERROR_C+00720:  	BSR		R26,#X0005D3
EXE$SYSTEM_CORRECTED_ERROR_C+00724:  	BIS		R31,R0,R3
EXE$SYSTEM_CORRECTED_ERROR_C+00728:  	MFPR		MCES
EXE$SYSTEM_CORRECTED_ERROR_C+0072C:  	BIS		R0,#X01,R16
EXE$SYSTEM_CORRECTED_ERROR_C+00730:  	MTPR		MCES
EXE$SYSTEM_CORRECTED_ERROR_C+00734:  	LDQ		R18,#XFDA8(R2)

    Press RETURN for more.
SDA>
Crashdump Summary Information:
------------------------------
EXE$SYSTEM_CORRECTED_ERROR_C+00738:  	LDA		R3,#X0004(R3)
EXE$SYSTEM_CORRECTED_ERROR_C+0073C:  	LDQ_U		R31,(SP)
EXE$SYSTEM_CORRECTED_ERROR_C+00740:  	BEQ		R3,#X000003
EXE$SYSTEM_CORRECTED_ERROR_C+00744:  	ADDL		R31,R18,R2
EXE$SYSTEM_CORRECTED_ERROR_C+00748:  	BIS		R2,#X05,R16
EXE$SYSTEM_CORRECTED_ERROR_C+0074C:  	BUGCHK
EXE$SYSTEM_CORRECTED_ERROR_C+00750:  	BIS		R31,FP,SP
EXE$SYSTEM_CORRECTED_ERROR_C+00754:  	LDQ		R26,#X0030(FP)
EXE$SYSTEM_CORRECTED_ERROR_C+00758:  	LDQ		R2,#X0038(FP)
EXE$SYSTEM_CORRECTED_ERROR_C+0075C:  	LDQ		R3,#X0040(FP)
SDA>


> Next you'll probably need to:
> 
> 	$ analyse/crash sys$system:
> 	SDA> clue err
> 
> This will dump the errorlog buffers active at the time of the crash
> to a file sys$login:clue$errlog.sys
> 
> It will then depend on whether you have DECevent or WSEA installed
> on your system in order to analyse this file.  Try first:
> 
> 	$ dia sys$login:clue$errlog.sys

I actually already got the clue$errlog.sys file,.. (got that far already) DIAG is not installed on the box sadly.



More information about the Info-vax mailing list