[Info-vax] Alphaserver ES47: Suspected broken CPU, unable to stop/cpu 2
Robin Schrievers
robin.schrievers at meteogroup.com
Wed Aug 8 08:44:30 EDT 2018
On Wednesday, 8 August 2018 14:10:45 UTC+2, Roy Omond wrote:
> On 08/08/18 12:24, abrsvc wrote:
> > On Wednesday, August 8, 2018 at 3:09:57 AM UTC-4, Robin Schrievers wrote:
> >> On Wednesday, 8 August 2018 08:53:08 UTC+2, Hans Vlems wrote:
> >>> Is it possible to physically remove cpu 3, pull the board?
> >>> Hans
> >>
> >> That will be a challange. As the box is not near me i can't do that myself. We might be able to have remote hands do that, but that would require some very detailed explanation.
> >
> > Without getting into details, where is this machine located? Perhaps there are some of us that are local to the machine that can assist.
>
> I'd guess Wageningen, the Netherlands.
It's located in Telehouse DC, London
>
> I'm rather surprised nobody's even suggested to actually analyse the
> dump file (+ probably the errorlog). You need to figure out exactly
> what has caused the machine check in kernel mode. It *might* be
> purely coincidence that it seems to only ever occur on CPU 2.
>
> Robin, please start by posting the output from:
>
> $ analyse/crash sys$system:
> SDA> clue crash
>
SDA> clue crash
Crashdump Summary Information:
------------------------------
Crash Time: 6-AUG-2018 10:35:36.53
Bugcheck Type: MACHINECHK, Machine check while in kernel mode
Node: THMG03 (Cluster)
CPU Type: hp AlphaServer ES47 7/1150
VMS Version: V7.3-2
Current Process: RMI_grib_srvr3
Current Image: $30$DKA0:[SYS0.SYSCOMMON.][ZIPTOOLS.ZIP23XV.VMS-BINARIES]ZIP_CLI.AXP_EXE;1
Failing PC: FFFFFFFF.80014FCC EXE$SYSTEM_CORRECTED_ERROR_C+0074C
Failing PS: 20000000.00001F04
Module: SYS$CPU_ROUTINES_270F (Link Date/Time: 27-OCT-2004 02:59:15.32)
Offset: 00004FCC
Boot Time: 3-AUG-2018 07:49:07.00
System Uptime: 3 02:46:29.53
Crash/Primary CPU: 02/00
System/CPU Type: 270F
Saved Processes: 42
Pagesize: 8 KByte (8192 bytes)
Physical Memory: 32768 MByte (268435456 PFNs, discontiguous memory)
Press RETURN for more.
SDA>
Crashdump Summary Information:
------------------------------
Dumpfile Pagelets: 769503 blocks
Dump Flags: olddump,writecomp,errlogcomp
Dump Type: compressed,selective,shared_mem
EXE$GL_FLAGS: poolpging,init,bugdump
Paging Files: 1 Pagefile and 1 Swapfile installed
Stack Pointers:
KSP = 00000000.7FF87EE0 ESP = 00000000.7FF8C000 SSP = 00000000.7FF9CD00
USP = 00000000.7AE7B850
General Registers:
R0 = 00000000.00000001 R1 = FFFFFFFF.86BFC000 R2 = 00000000.00000210
R3 = 00000000.00000001 R4 = 00000000.00000000 R5 = 00000008.00002000
R6 = 00000000.0000001A R7 = 00000000.000979C5 R8 = 00000000.00000000
R9 = 00000000.0007C100 R10 = 00000000.0008C230 R11 = 00000000.0008C110
R12 = 00000000.0005006C R13 = 00000000.00020000 R14 = 00000000.7C09ED5C
R15 = 00000000.00000001 R16 = 00000000.00000215 R17 = 00000000.00000000
R18 = 00000000.00000210 R19 = 00000000.00000006 R20 = 00000000.00000040
R21 = 00000000.00000000 R22 = 00000000.00000000 R23 = 00000000.00000000
Press RETURN for more.
SDA>
Crashdump Summary Information:
------------------------------
R24 = FFFFFFFF.86BFC000 AI = 00000000.00000001 RA = FFFFFFFF.80014FA4
PV = FFFFFFFF.869D1AC0 R28 = FFFFFFFF.8006AA90 FP = 00000000.7FF87EE0
PC = FFFFFFFF.80014FD0 PS = 20000000.00001F04
System Registers:
Page Table Base Register (PTBR) 00000000.00035EB4
Processor Base Register (PRBR) FFFFFFFF.811C7400
Privileged Context Block Base (PCBB) 00000000.6BD6A080
System Control Block Base (SCBB) 00000000.00000F20
Software Interrupt Summary Register (SISR) 00000000.00000000
Address Space Number (ASN) 00000000.0000007B
AST Summary / AST Enable (ASTSR_ASTEN) 00000000.0000000F
Floating-Point Enable (FEN) 00000000.00000001
Interrupt Priority Level (IPL) 00000000.0000001F
Machine Check Error Summary (MCES) 00000000.00000000
Virtual Page Table Base Register (VPTB) FFFFFEFA.00000000
Press RETURN for more.
SDA>
Crashdump Summary Information:
------------------------------
Failing Instruction:
EXE$SYSTEM_CORRECTED_ERROR_C+0074C: BUGCHK
Instruction Stream (last 20 instructions):
EXE$SYSTEM_CORRECTED_ERROR_C+006FC: STW R31,#XFFFA(R0)
EXE$SYSTEM_CORRECTED_ERROR_C+00700: STW R31,#XFFFC(R0)
EXE$SYSTEM_CORRECTED_ERROR_C+00704: STW R18,#XFFFE(R0)
EXE$SYSTEM_CORRECTED_ERROR_C+00708: LDL R16,#X0010(FP)
EXE$SYSTEM_CORRECTED_ERROR_C+0070C: LDQ_U R31,(SP)
EXE$SYSTEM_CORRECTED_ERROR_C+00710: JSR R26,(R26)
EXE$SYSTEM_CORRECTED_ERROR_C+00714: LDA R16,#X0018(FP)
EXE$SYSTEM_CORRECTED_ERROR_C+00718: LDA R27,#XFE90(R2)
EXE$SYSTEM_CORRECTED_ERROR_C+0071C: LDQ_U R31,(SP)
EXE$SYSTEM_CORRECTED_ERROR_C+00720: BSR R26,#X0005D3
EXE$SYSTEM_CORRECTED_ERROR_C+00724: BIS R31,R0,R3
EXE$SYSTEM_CORRECTED_ERROR_C+00728: MFPR MCES
EXE$SYSTEM_CORRECTED_ERROR_C+0072C: BIS R0,#X01,R16
EXE$SYSTEM_CORRECTED_ERROR_C+00730: MTPR MCES
EXE$SYSTEM_CORRECTED_ERROR_C+00734: LDQ R18,#XFDA8(R2)
Press RETURN for more.
SDA>
Crashdump Summary Information:
------------------------------
EXE$SYSTEM_CORRECTED_ERROR_C+00738: LDA R3,#X0004(R3)
EXE$SYSTEM_CORRECTED_ERROR_C+0073C: LDQ_U R31,(SP)
EXE$SYSTEM_CORRECTED_ERROR_C+00740: BEQ R3,#X000003
EXE$SYSTEM_CORRECTED_ERROR_C+00744: ADDL R31,R18,R2
EXE$SYSTEM_CORRECTED_ERROR_C+00748: BIS R2,#X05,R16
EXE$SYSTEM_CORRECTED_ERROR_C+0074C: BUGCHK
EXE$SYSTEM_CORRECTED_ERROR_C+00750: BIS R31,FP,SP
EXE$SYSTEM_CORRECTED_ERROR_C+00754: LDQ R26,#X0030(FP)
EXE$SYSTEM_CORRECTED_ERROR_C+00758: LDQ R2,#X0038(FP)
EXE$SYSTEM_CORRECTED_ERROR_C+0075C: LDQ R3,#X0040(FP)
SDA>
> Next you'll probably need to:
>
> $ analyse/crash sys$system:
> SDA> clue err
>
> This will dump the errorlog buffers active at the time of the crash
> to a file sys$login:clue$errlog.sys
>
> It will then depend on whether you have DECevent or WSEA installed
> on your system in order to analyse this file. Try first:
>
> $ dia sys$login:clue$errlog.sys
I actually already got the clue$errlog.sys file,.. (got that far already) DIAG is not installed on the box sadly.
More information about the Info-vax
mailing list