[Info-vax] MACHINECHK on my XP900...

Jan-Erik Soderholm jan-erik.soderholm at telia.com
Sun Feb 13 17:21:28 EST 2011


Hans Vlems wrote 2011-02-13 21:10:
> On 13 feb, 17:48, Jan-Erik Soderholm<jan-erik.soderh... at telia.com>
> wrote:
>> Jan-Erik Soderholm wrote 2011-02-13 17:32:
>>
>>
>>
>>> Jan-Erik Soderholm wrote 2011-02-13 17:07:
>>>> Hans Vlems wrote 2011-02-13 10:31:
>>>>> On Feb 12, 5:57 pm, Jan-Erik Soderholm<jan-erik.soderh... at telia.com>
>>>>> wrote:
>>>>>> Hi.
>>
>>>>>> I have som e trouble with my XP900 466 MHz system.
>>
>>>>>> Currently, when powered on, it boots (VMS 8.3) and runs
>>>>>> for 5-10 minutes, then I get this on the console :
>>
>>>>>> --------------------------------------------------------------------------
>>
>>>>>> **** OpenVMS Alpha Operating System V8.3 - BUGCHECK ****
>>
>>>>>> ** Bugcheck code = 00000215: MACHINECHK, Machine check while in kernel
>>>>>> mode
>>>>>> ** Crash CPU: 00000000 Primary CPU: 00000000 Node Name: OSSBY1
>>>>>> ** Supported CPU count: 00000001
>>>>>> ** Active CPUs: 00000000.00000001
>>>>>> ** Current Process: NULL
>>>>>> ** Current PSB ID: 00000001
>>>>>> ** Image Name:
>>
>>>>>> ** Dumping error log buffers to HBVS unit 0
>>
>>>>>> **** No supported device(s) found in DUMP_DEV
>>>>>> **** No DUMP_DEV devices found
>>>>>> **** Attempting to write the crash dump to the system disk
>>
>>>>>> --------------------------------------------------------------------------
>>
>>>>>> Before I begin fault-tracing, I thought I'd ask if there
>>>>>> is anything in that message that "sticks out" ?
>>
>>>>>> I had the box opened before this started and I might have
>>>>>> touched some RAM module or something like that. I do not know.
>>
>>>>>> Is the code = 00000215 trying to tell me something important ? :-)
>>
>>>>>> I was doing nohting in VMS, just booted and waited for the crash.
>>
>>>>>> Jan-Erik.
>>
>>>>> Good morning Jan-Erik,
>>>>> very likely the code is trying to tell you/us something, but very few
>>>>> people speak that language these days....
>>>>> As you suggested, reseating modules (memory modules, cpu board and pci
>>>>> controllers) is a good start.
>>>>> Perhaps one of the memory modules has developed a hardware problem, so
>>>>> reducing the memory is an option.
>>>>> The system takes 5-10 minutes to crash, so it is not a straightforward
>>>>> hardware problem.
>>>>> If the system isn't doing anything then we can assume that there is no
>>>>> VMS related problem, right?
>>>>> So it may be temperature related, or an intermittent hardware
>>>>> problem.
>>>>> I never saw an XP900 (just an XP1000) but if the cpu has a fan on top
>>>>> of it, check whether it rotates freely.
>>>>> Other than that, strip the system to its minimum configuration (cd,
>>>>> cpu and minimal memory) and run VMS off cd.
>>>>> If nothing happens, add more hardware until the problem appears again.
>>>>> If it does, well, I would defy Murphy and suspect the cheapest
>>>>> components: memory ;-)
>>>>> Does the XP900 have a memtest command in nvram?
>>>>> Hans
>>
>>>> Hi again and thanks to those respodning.
>>>> Yes, there is a "memtest" command, but I can't make it do anything
>>>> (I think, it silently returns to the>>>  prompt.
>>
>>>> I found something else. SHOW POWER gives this output :
>>
>>>>>>> show power
>>
>>>> Status
>>>> Power Supply good
>>>> System Fan/PCI Fan good
>>>> CPU Fan good
>>>> Temperature good
>>
>>>> Current ambient temperature is 56 degrees C
>>>> System shutdown temperature is set to 60 degrees C
>>
>>>> 8 Environmental events are logged in nvram
>>>> Do you want to view the events? (Y/<N>) y
>>
>>>> Total Environmental Events: 8 (8 logged)
>>
>>>> 1 FEB 11 6:44 Temperature Failure
>>>> 2 FEB 12 16:08 Temperature Failure
>>>> 3 FEB 12 16:15 Temperature Failure
>>>> 4 FEB 12 16:19 Temperature Failure
>>>> 5 FEB 12 16:34 Temperature Failure
>>>> 6 FEB 13 16:01 Temperature Failure
>>>> 7 FEB 13 16:11 Temperature Failure
>>>> 8 FEB 13 16:15 Temperature Failure
>>
>>>> These timestamps seems to be the same as the crasches
>>>> I've had.
>>
>>>> I also saw that while powering on I got :
>>>> "System Temperature is 59 degrees C".
>>
>>>> The system has been powered off during the night,
>>>> so *it seems* as something is weird with the temp
>>>> measurement !?
>>
>>>> I have as a quick workaround done SET SHUTDOWN_TEMP 70
>>>> and we'll see if it keeps running longer.
>>
>>>> Has anyone seen a temp-sensor gone bad in a DS10, XP900 ?
>>
>>>> Jan-Erik.
>>
>>> This is a very similar problem description :
>>> http://forums13.itrc.hp.com/service/forums/questionanswer.do?threadId...
>>
>>> This page talkes about temp sensor problems on DS10 :
>>> http://h30097.www3.hp.com/docs/updates/V51B/html/ar01s06.html
>>
>>> Mabe time to upgrade to a newer Alpha... :-)
>>
>> Sorry for yet another post on this issue... :-)
>>
>> The temp can be read fron within VMS, and it gives another value:
>>
>> $ temp = f$getsyi("temperature_vector")
>> $ sh sym temp
>>     TEMP = "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF38"
>> $
>>
>> The last two chars is the temp in degC, 38 is a rather
>> sensible value.
>>
>> I made a quick shutdown to re-check the value in console mode
>> and it still says "Current ambient temperature is 56 degrees C".
>>
>> Weird...
>
> Weid, but I'd rather trust the output value and figure out why the box
> is running that hot.
> BTW ambient means " near the sensor"  so somewhere inside the cabinet.
> My XP1000 has a cpu with a passive heatsink. I'm typing this on
> another system, a Digital Server 5305:
>
> $ temp = f$getsyi("temperature_vector")
> $ sh sym temp
>     TEMP = "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF2C"
> $
> And 42 degrees Celsius is not that hot.
> The last thing to do is increase the shutdown temperature!!
> Hans

Now, I'm very positive that the measured temp by the box is way off.

The box had been shut down over night and should not be close to
60 degC right at poweron. And when the measurment reported close to 60
degC I could very comfortable put my hand on the CPU heatsink (which
is fan-cooled with it's own fan on top of the heatsink on the XP900)
and it was not very much more then body temp. There was nothing else in
the box with a temp far higher then body temp. 60 degC defenitely hurts.

All fans i the box runs just well (CPU, front/memory and power supply
fans).

There is nothing else weird with it apart from the temp reported
is higher then is actualy possible.

As I posted earlier, there have been other reports about the sensor
going nuts on DS10 boxes that are not that different from the XP900.

I just checked right now (about 7-8 hours uptime, so the temp
should have had time to stabilize) and it now reports 60 degC.

I would guess that the actual temp is somewhere around 40 degC
as is the case with the Digital Server 5305.

Anyway, I raised the limit 10 deg's so if something realy would
happen (fans stopping) it will shutdown soon anyway. The one thing
that troubles me now, is that I realy can't trust the temp sensor
at all. Hopefully it still "works", just giving a to high value.

And I think I will look out for a replacement box anyway, the
DS15 seems like a nice box...

Jan-Erik.










More information about the Info-vax mailing list