[Info-vax] Home-grown application process dumps

Mon Jan 5 12:42:25 EST 2015

In article <fdbc5085-371e-44da-b6aa-878ae5f0ac61 at googlegroups.com>, RGB <11brvo at gmail.com> writes:
>On Monday, January 5, 2015 12:01:29 PM UTC-5, Stephen Hoffman wrote:
>> On 2015-01-05 16:23:40 +0000, RGB said:
>>=20
>> > Hi all and happy new year.
>> >=20
>> > We are currently running a home-grown application, which uses its own=
>=20
>> > "database" on an Itanium rx2800 i2 cluster of 2 nodes running VMS v8.4.=
>=20
>> >  All ECO's are up to date on the cluster.  This home-grown application=
>=20
>> > has various modules which do specific tasks.  Two of the modules have=
>=20
>> > been crashing/dumping as of late and the developers, who wrote the=20
>> > code, claim it's a bug in VMS whereas I believe that the cause of the=
>=20
>> > process dumps are coding issues.  I'm going to output a couple of the=
>=20
>> > process dumps here with the hope that someone could give me their=20
>> > opinion on what might be causing these processes to dump like this. =20
>> > That's my hope anyway!
>>=20
>> Any claims of "it's a bug in VMS" is unfortunately immediately suspect,=
>=20
>> without some supporting evidence and/or a reproducer.  While it might=20
>> well be a VMS bug, any programmers involved here should be working to=20
>> isolate the error, and to create a reproducer.   I've learned to=20
>> perform that with misbehavior in my own code, and creating a reproducer=
>=20
>> can and variously does lead to the discovery of the bug somewhere in my=
>=20
>> own code.  If this is a VMS bug, the reproducer is something you can=20
>> hand to HP support, too -- that usually makes getting a response and a=20
>> fix from HP Support all that much faster, as they're not wading through=
>=20
>> your code.
>>=20
>> The relevant bits in that dump appear to be the following:
>>=20
>> %BAS-F-MEMMANVIO, Memory management violation
>>=20
>> and
>>=20
>> -BAS-I-USEPC_PSL, at user PC=3D84236620, PSL=3D0000001B
>> -SYSTEM-F-ACCVIO, access violation, reason mask=3D04, virtual=20
>> address=3D00000000002D0002, PC=3DFFFFFFFF84236620, PS=3D0000001B
>>=20
>> and
>>=20
>> %SYSTEM-F-OPCCUS, opcode reserved to customer fault at=20
>> PC=3DFFFFFFFF848DBB20, PS=3D0000001B
>>=20
>> On no particular evidence beyond that 00000000002D0002 virtual address=20
>> value looking rather bogus, I'd be looking for some BASIC code=20
>> somewhere that makes a call that passes a string descriptor by value,=20
>> and not by reference.   That particular 002D0002 value is the first=20
>> longword of a two-byte dynamic text string descriptor, after all:
>>=20
>> #define DSC$K_DTYPE_T 14                /* Character-coded text. A=20
>> single 8-bit character  */
>> #define DSC$K_CLASS_D 2                 /* Dynamic String Descriptor     =
>   */
>>=20
>> Look around for what is located at and at what subroutine calls lead up=
>=20
>> to the execution of the code at virtual address FFFFFFFF84236620, too --=
>=20
>> that's some system space code that's involved, and quite possibly the=20
>> code that's behind a system service call.
>>=20
>> If that's not it, I'd start looking for a memory heap corruption, as=20
>> those can blow out all over the place, and with all sorts of odd=20
>> errors.  With BASIC, that's usually some system service call or similar=
>=20
>> that exceeds the size of the string that's been presented to the system=
>=20
>> service call -- system services generally don't re-size or extend=20
>> dynamic string descriptors, the calls just keep writing however much=20
>> they've been asked to write.  Ten pounds of bytes into a five-pound=20
>> string buffer makes for a corrupt heap, after all.
>>=20
>> It's usually easier to trace these sorts of bugs when the application=20
>> code contains its own integrated logging and tracing support, and=20
>> contains own signal handler and its dump support.  Looking at a process=
>=20
>> dump is slightly tedious and rather further downstream from the error=20
>> and the application code, after all.  Having the ability to trigger the=
>=20
>> debugger via SS$_DEBUG and generate some specific output, and maybe a=20
>> call to the traceback routine, can be helpful, too.
>>=20
>> Related:
>> <http://labs.hoffmanlabs.com/node/803>
>> <http://labs.hoffmanlabs.com/node/848>
>> <http://labs.hoffmanlabs.com/node/800>
>> <http://labs.hoffmanlabs.com/node/800#comment-2049>
>>=20
>>=20
>>=20
>>=20
>> --=20
>> Pure Personal Opinion | HoffmanLabs LLC
>
>Hi Steve,
>
>Happy new year to you.  Thanks for your synopsis.  What I find interesting =
>about the above is that these "bugs" can NOT be reproduced in our test/deve=
>lopment/QA environments.  Said environments run on exactly the same hardwar=
>e and config i.e., rx2800 i2 with 32GB RAM and VMS v8.4.  These processes d=
>ump ONLY in production but, then again, the modules are more heavily utiliz=
-----------------------------------------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>ed in production than in the aforementioned test/dev environments.
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I think you've answered your own questions here.

-- 
VAXman- A Bored Certified VMS Kernel Mode Hacker    VAXman(at)TMESIS(dot)ORG

I speak to machines with the voice of humanity.