[Info-vax] MACHINECHK on my XP900...
Hans Vlems
hvlems at freenet.de
Sun Feb 13 15:10:24 EST 2011
On 13 feb, 17:48, Jan-Erik Soderholm <jan-erik.soderh... at telia.com>
wrote:
> Jan-Erik Soderholm wrote 2011-02-13 17:32:
>
>
>
> > Jan-Erik Soderholm wrote 2011-02-13 17:07:
> >> Hans Vlems wrote 2011-02-13 10:31:
> >>> On Feb 12, 5:57 pm, Jan-Erik Soderholm<jan-erik.soderh... at telia.com>
> >>> wrote:
> >>>> Hi.
>
> >>>> I have som e trouble with my XP900 466 MHz system.
>
> >>>> Currently, when powered on, it boots (VMS 8.3) and runs
> >>>> for 5-10 minutes, then I get this on the console :
>
> >>>> --------------------------------------------------------------------------
>
> >>>> **** OpenVMS Alpha Operating System V8.3 - BUGCHECK ****
>
> >>>> ** Bugcheck code = 00000215: MACHINECHK, Machine check while in kernel
> >>>> mode
> >>>> ** Crash CPU: 00000000 Primary CPU: 00000000 Node Name: OSSBY1
> >>>> ** Supported CPU count: 00000001
> >>>> ** Active CPUs: 00000000.00000001
> >>>> ** Current Process: NULL
> >>>> ** Current PSB ID: 00000001
> >>>> ** Image Name:
>
> >>>> ** Dumping error log buffers to HBVS unit 0
>
> >>>> **** No supported device(s) found in DUMP_DEV
> >>>> **** No DUMP_DEV devices found
> >>>> **** Attempting to write the crash dump to the system disk
>
> >>>> --------------------------------------------------------------------------
>
> >>>> Before I begin fault-tracing, I thought I'd ask if there
> >>>> is anything in that message that "sticks out" ?
>
> >>>> I had the box opened before this started and I might have
> >>>> touched some RAM module or something like that. I do not know.
>
> >>>> Is the code = 00000215 trying to tell me something important ? :-)
>
> >>>> I was doing nohting in VMS, just booted and waited for the crash.
>
> >>>> Jan-Erik.
>
> >>> Good morning Jan-Erik,
> >>> very likely the code is trying to tell you/us something, but very few
> >>> people speak that language these days....
> >>> As you suggested, reseating modules (memory modules, cpu board and pci
> >>> controllers) is a good start.
> >>> Perhaps one of the memory modules has developed a hardware problem, so
> >>> reducing the memory is an option.
> >>> The system takes 5-10 minutes to crash, so it is not a straightforward
> >>> hardware problem.
> >>> If the system isn't doing anything then we can assume that there is no
> >>> VMS related problem, right?
> >>> So it may be temperature related, or an intermittent hardware
> >>> problem.
> >>> I never saw an XP900 (just an XP1000) but if the cpu has a fan on top
> >>> of it, check whether it rotates freely.
> >>> Other than that, strip the system to its minimum configuration (cd,
> >>> cpu and minimal memory) and run VMS off cd.
> >>> If nothing happens, add more hardware until the problem appears again.
> >>> If it does, well, I would defy Murphy and suspect the cheapest
> >>> components: memory ;-)
> >>> Does the XP900 have a memtest command in nvram?
> >>> Hans
>
> >> Hi again and thanks to those respodning.
> >> Yes, there is a "memtest" command, but I can't make it do anything
> >> (I think, it silently returns to the >>> prompt.
>
> >> I found something else. SHOW POWER gives this output :
>
> >>>>> show power
>
> >> Status
> >> Power Supply good
> >> System Fan/PCI Fan good
> >> CPU Fan good
> >> Temperature good
>
> >> Current ambient temperature is 56 degrees C
> >> System shutdown temperature is set to 60 degrees C
>
> >> 8 Environmental events are logged in nvram
> >> Do you want to view the events? (Y/<N>) y
>
> >> Total Environmental Events: 8 (8 logged)
>
> >> 1 FEB 11 6:44 Temperature Failure
> >> 2 FEB 12 16:08 Temperature Failure
> >> 3 FEB 12 16:15 Temperature Failure
> >> 4 FEB 12 16:19 Temperature Failure
> >> 5 FEB 12 16:34 Temperature Failure
> >> 6 FEB 13 16:01 Temperature Failure
> >> 7 FEB 13 16:11 Temperature Failure
> >> 8 FEB 13 16:15 Temperature Failure
>
> >> These timestamps seems to be the same as the crasches
> >> I've had.
>
> >> I also saw that while powering on I got :
> >> "System Temperature is 59 degrees C".
>
> >> The system has been powered off during the night,
> >> so *it seems* as something is weird with the temp
> >> measurement !?
>
> >> I have as a quick workaround done SET SHUTDOWN_TEMP 70
> >> and we'll see if it keeps running longer.
>
> >> Has anyone seen a temp-sensor gone bad in a DS10, XP900 ?
>
> >> Jan-Erik.
>
> > This is a very similar problem description :
> >http://forums13.itrc.hp.com/service/forums/questionanswer.do?threadId...
>
> > This page talkes about temp sensor problems on DS10 :
> >http://h30097.www3.hp.com/docs/updates/V51B/html/ar01s06.html
>
> > Mabe time to upgrade to a newer Alpha... :-)
>
> Sorry for yet another post on this issue... :-)
>
> The temp can be read fron within VMS, and it gives another value:
>
> $ temp = f$getsyi("temperature_vector")
> $ sh sym temp
> TEMP = "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF38"
> $
>
> The last two chars is the temp in degC, 38 is a rather
> sensible value.
>
> I made a quick shutdown to re-check the value in console mode
> and it still says "Current ambient temperature is 56 degrees C".
>
> Weird...
Weid, but I'd rather trust the output value and figure out why the box
is running that hot.
BTW ambient means " near the sensor" so somewhere inside the cabinet.
My XP1000 has a cpu with a passive heatsink. I'm typing this on
another system, a Digital Server 5305:
$ temp = f$getsyi("temperature_vector")
$ sh sym temp
TEMP = "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF2C"
$
And 42 degrees Celsius is not that hot.
The last thing to do is increase the shutdown temperature!!
Hans
More information about the Info-vax
mailing list