[Info-vax] SHADDETINCON, SHADOWING detects inconsistent state

Thu Jan 1 10:50:42 EST 2009

On 1 jan, 12:56, hel... at astro.multiCLOTHESvax.de (Phillip Helbig---
remove CLOTHES to reply) wrote:
> My hobbyist cluster currently consists of:
>
>    VAX 4000-105A
>    VAXstation 4000-90A
>    DEC 3000 - M600
>
> Each system has a 2- or 3-member shadow set as its system disk.  There
> are some non-shadowed disks (including CD-ROMs) and some 2-member shadow
> sets distributed among the nodes (each member has a direct connection to
> only one machine).  In particular, DISK$USER has members on each of the
> VAXes.  I haven't changed much in 2 or 3 years.
>
> Starting several weeks ago, and becoming more frequent in the last
> couple of weeks, the VAX 4000-105A spontaneously reboots.  Even though
> SHADOW_MBR_TMO is set to 10 minutes and MVTIMEOUT to one hour
> (SHADOW_SYS_TMO is 2 minutes but that isn't relevant here), after such a
> reboot everything looks OK on the VAX 4000-105A but on (usually just one
> of) the other machines, the system-disk shadow set and the CD-ROM on the
> VAX 4000-105A and the DISK$USER shadow set have gone into mount-verify
> timeout.  This has always happened during the night, so I don't know how
> long the spontaneous reboot takes.  I can just dismount and remount the
> system-disk shadow set and the CD-ROM on the VAX 4000-105A from the
> other nodes, but since DISK$USER has gone into mount-verify timeout, I
> have to reboot the corresponding node.  (Note that SYSUAF etc are all on
> DISK$USER.)  I can't dismount it since it contains open files.  I
> haven't tried DISMOUNT/ABORT in such a situation.  Should I?  With
> DISK$USER inaccessible, various applications will fail.  A reboot is
> probably quicker than getting everything going again by hand.  (If it is
> the VAXstation 4000-90A which needs to be rebooted, then I can dismount
> and remount the member of DISK$USER on it from the ALPHA, so that I get
> just a minicopy when the VAXstation 4000-90A comes back up.)
>
> Note that everytime this has happened, DISK$USER was in the shadow-copy
> state, copying from the member on the VAXstation 4000-90A to the member
> on the VAX 4000-105A---even if DISK$USER as a shadow set isn't
> accessible to the VAXstation 4000-90A and its members show up only as
> remote shadow members.
>
> I doubt it is possible to avoid these problems without creating more as
> long as the spontaneous reboots are happening.  However, I want to get
> rid of the spontaneous reboots.  ANALYZE/CRASH says:
>
>       OpenVMS (TM) VAX System dump analyzer
>
>    Dump taken on  1-JAN-2009 06:04:26.14
>    SHADDETINCON, SHADOWING detects inconsistent state
>
> HELP/MESSAGE says:
>
>  SHADDETINCON,  SHADOWING detects inconsistent state
>
>   Facility:     BUGCHECK, System Bugcheck
>
>   Explanation:  The volume shadowing software reached an irrecoverable or
>                 inconsistent state because a shadow set failed an internal
>                 consistency check.
>
>   User Action:  Note the conditions leading to the error and contact a Compaq
>                 support representative. If the system is configured to produce
>                 a memory dump, retain the dump file.
>
> I don't see how I can "Note the conditions leading to the error".
>
> Since the hardware setup hasn't changed in years, and since I'm not
> seeing any additional errors, my assumption is that the VAX 4000-105A
> is acting up.  Fortunately, I have an identical spare (thanks Hans!), so
> I plan to swap the machines today.  If the problem goes away, then
> presumably there was a fault with the machine, but who knows what it
> could be.
>
> Actually, I can't swap out everything since I put all the memory for the
> VAX 4000-105A I have (128 MB) in the one currently in the cluster, so I
> will remove it and put it in the spare.  I don't think this is a problem
> with the memory.
>
> Any further suggestions?

Actually I don't think power is an issue. Phillip lives in Germany and
mains power is rather reliable.
My suggestion would be to have a good look at the network. The systems
mentioned are all
10 Mb/s systems. So they're connected with thinwire coax or have UTP
transceivers on their AUI ports.
(Unless Phillip runs a 10BASE5 "classic" ethernet in his house, which
I doubt :-)
Thinwire cables may go bad and so may transceivers fail. As I read the
original post, the number of errors seems
to increase over time which indicates failing hardware.
Look at the error counters of other protocols, such as LAT, DECnet or
IP. Especially LAT since it is very sensitive to problems
in the physical layer and may provide you with clues.
Happy New Year !
Hans