[Info-vax] Loosing all LAT connections (More answered questions)
johnwallace4 at yahoo.co.uk
johnwallace4 at yahoo.co.uk
Sun Apr 19 17:36:51 EDT 2009
On Apr 17, 4:09 pm, JCamCMKRNL <jcam90... at earthlink.net> wrote:
> First, thanks to all who have responded. Your information has been
> very valuable.
> So far, we have not had another occurrence of these dropping of all
> LAT connections on one system. Just the original three occurrences in
> the past three weeks. The information on the LAT counters do seem to
> indicate that the problem will occur again. It is just a mater of
> when.
>
> Several of you asked some more questions about this issue, so I have
> gathered the questions and the answers below. I hope I have hit all of
> your queries. In particular, I think the very last question here and
> its answer is very important.
> ----------------> what does the current output of this show?
>
> MCR NCP SHOW COUNTER KNOW CIRC
>
> It is very clean:
> Known Circuit Counters as of 17-APR-2009 06:30:10
>
> Circuit = ISA-0
>
> >65534 Seconds since last zeroed
> 0 Terminating packets received
> 0 Originating packets sent
> 0 Terminating congestion loss
> 0 Transit packets received
> 0 Transit packets sent
> 0 Transit congestion loss
> 0 Circuit down
> 0 Initialization failure
> 0 Adjacency down
> 0 Peak adjacencies
> 28945 Data blocks sent
> 1447250 Bytes sent
> 0 Data blocks received
> 0 Bytes received
> 0 Unrecognized frame destination
> 0 User buffer unavailable
>
> > Make this MC NCP SHOW KNOWN LINE COUNTERS
>
> This is clean except some send failures/collisions:
> Known Line Counters as of 17-APR-2009 06:31:45
>
> Line = ISA-0
>
> >65534 Seconds since last zeroed
> 1691897 Data blocks received
> 25491 Multicast blocks received
> 0 Receive failure
> 78211496 Bytes received
> 1529460 Multicast bytes received
> 0 Data overrun
> 2240057 Data blocks sent
> 37989 Multicast blocks sent
> 87 Blocks sent, multiple collisions
> 102 Blocks sent, single collision
> 1173 Blocks sent, initially deferred
> 107990422 Bytes sent
> 1729968 Multicast bytes sent
> 8030 Send failure, including:
> Carrier check failed
> 8030 Collision detect check failure
> 0 Unrecognized frame destination
> 0 System buffer unavailable
> 0 User buffer unavailable
>
> > Jeff, you write that "the counters show no errors, but this was interesting from the MCR LATCP SHOW LINK/COUNT ...etc"
> > What counters show no errors?
>
> Here is the complete output of the LAT LINK counters:
> Link Name: LAT$LINK
> Device Name: _EZA4:
>
> Seconds Since Zeroed: 65535
> Messages Received: 1693146
> Multicast Msgs Received: 25517
> Bytes Received: 78269314
> Multicast Bytes Received: 1531020
> System Buffer Unavailable: 0
> Unrecognized Destination: 0
>
> Messages Sent: 2241710
> Multicast Msgs Sent: 38006
> Bytes Sent: 108119723
> Multicast Bytes Sent: 1730717
> User Buffer Unavailable: 0
> Data Overrun: 0
>
> Receive Errors -
> Block Check Error: No
> Framing Error: No
> Frame Too Long: No
> Frame Status Error: No
> Frame Length Error: No
>
> Transmit Errors -
> Excessive Collisions: No
> Carrier Check Failure: Yes
> Short Circuit: Yes
> Open Circuit: Yes
> Frame Too Long: Yes
> Remote Failure To Defer: No
> Transmit Underrun: Yes
> Transmit Failure: No
>
> CSMACD Specific Counters
> ------------------------
>
> Transmit CDC Failure: 8030
>
> Messages Transmitted -
> Single Collision: 102
> Multiple Collisions: 87
> Initially Deferred: 1173
>
> > The transceiver on the DELNI wasn't replaced recently was it?
>
> No. It is the original H4000 which was installed about 6 years ago.
>
> > You wrote that there is no 10BASE2 gear available. Is it possible to
> > use RJ45 transceivers (enable heartbeat), a low speed UTP switch or
> > hub with an AUI port on it?
>
> I do have a Black Box switch with one AUI port, and 8 10-BaseT RJ45
> ports available, but it requires change control paperwork to connect
> it to the DEC Network. I would like to avaoid doing this.
>
> > If it turns out that it is not hardware, is it possible that there is a
> > PC (or other 100MB equipment) connected to the backbone somewhere?
>
> No. At this time all equipment on the DEC Network is 100% Digital
> Equipment (Not Compaq, not HP) hardware.
>
> > If you can get the DECserver counters easily ...
> > please post them; they may not add anything to the picture,
> > but they might.
>
> Here are the results from one of the many DECServers.
> Node LIMS is the VAX, all the others are PDP-11s running RSX.
>
> Local> SHOW NODE ALL COUNTERS
>
> Node: ALICE
> Seconds Since Zeroed: 1985926
> Messages Received: 1262
> Messages Transmitted: 1133
> Slots Received: 638
> Slots Transmitted: 864
> Bytes Received: 17554
> Bytes Transmitted: 768
>
> Multiple Node Addresses: 0
> Duplicates Received: 0
> Messages Re-transmitted: 6
> Illegal Messages Received: 0
> Illegal Slots Received: 0
> Solicitations Accepted: 0
> Solicitations Rejected: 0
>
> Node: IRV70A
> Seconds Since Zeroed: 2577531
> Messages Received: 0
> Messages Transmitted: 0
> Slots Received: 0
> Slots Transmitted: 0
> Bytes Received: 0
> Bytes Transmitted: 0
>
> Multiple Node Addresses: 0
> Duplicates Received: 0
> Messages Re-transmitted: 0
> Illegal Messages Received: 0
> Illegal Slots Received: 0
> Solicitations Accepted: 0
> Solicitations Rejected: 0
>
> Node: LIMS
> Seconds Since Zeroed: 2577490
> Messages Received: 122179
> Messages Transmitted: 88449
> Slots Received: 76864
> Slots Transmitted: 65984
> Bytes Received: 6861709
> Bytes Transmitted: 494411
>
> Multiple Node Addresses: 0
> Duplicates Received: 6
> Messages Re-transmitted: 1
> Illegal Messages Received: 0
> Illegal Slots Received: 0
> Solicitations Accepted: 0
> Solicitations Rejected: 0
>
> Node: MINNIE
> Seconds Since Zeroed: 2577505
> Messages Received: 69814
> Messages Transmitted: 66573
> Slots Received: 13149
> Slots Transmitted: 13931
> Bytes Received: 779022
> Bytes Transmitted: 15412
>
> Multiple Node Addresses: 0
> Duplicates Received: 0
> Messages Re-transmitted: 0
> Illegal Messages Received: 0
> Illegal Slots Received: 0
> Solicitations Accepted: 0
> Solicitations Rejected: 0
>
> > You have spares for all the active network kit, right?
>
> Yes. We are planning to start by swapping out the DELNI in the Data
> Center to see if this helps.
>
> > Same goes for the MicroVAX itself. Do you have a spare network card (DELQA) you could plug in?
>
> Yes. If the problem raises it's head again after swapping out the
> DELNI then we will swap out the DELQA.
> If it continues to fail after that, then we will swap out the H4000
> transceiver.
>
> > Has anybody installed any significant new electrical kit on the
> > factory floor recently ?
>
> No. The last physical change to the network was performed 2 months
> before the first occurence of this VAX/LAT connection dropout problem,
> and that change was just adding another DECServer 200 to another DELNI
> in a remote IDF closet.
>
> > How much is the disruption costing you? Enough to make it worthwhile upgrading the backbone to modern technology.
>
> The VAX system is used exclusively for interactive sessions by many of
> our Medial Labs for the entry of laboratory test results for QA/QC
> records for the FDA. As long as the interruptions are not prolonged,
> there are no serious impacts to production. The PDP-11s are
> considerably involved in production. They must stay up for production
> to continue but they do not use the network for production operations.
> The network for the PDP-11s is simply for production monitors to see
> what is going on and to pass data to the VAX. We can take the network
> down for planned upgrades or changes for up to 2-3 hours without
> significant impact to production.
>
> > Forgot to say: wrt user sessions being dropped, you do know you
> > can avoid that using VMS's "virtual terminal" feature, right? When the
> > LAT session is dropped, the VMS session continues and can in principle
> > be resumed from where it left off once the user reconnects their
> > session.
>
> I seem to remember this feature in VMS from along time ago, but for
> some reason when the LAT connections are dropped, all the interactive
> processes are stopped and the users are logged out.
>
> > When the LAT service of the VMS machine had disappeared from the
> > DECservers, did you still try to do connect to the service somehow (e.g.
> > SET H /LAT <VAX>)? If you did, was it successful?
>
> I just enabled outgoing LAT on the VAX, and I can do a SET HOST/LAT
> and it works now, but the problem of all LAT connections dropping has
> not happened again since I enabled outgoing LAT.
>
> > You know what they say, put a monkey in front of a keyboard and
> > eventually he'll come up with something intelligent.
>
> We have had hundreds of Monkeys using this network for about 30 years.
> So far no sign of intelligence.
> ==========================
> Thanks again for all of your input.
>
> Jeff Cameron
Thanks for the update.
Should we be pleased it hasn't happened again? Or did the fault
perhaps happen again anyway, just with less visible consequences ?
The VAX line counters and the counters from the DECservers all see
carrier detect check failures; the DECservers even say open circuit
detected and short circuit detected. Something's broken or
misconfigured (well duh!) and because the interconnects are DELNIs and
coax there's no current way to localise the problem (no repeaters,
switches, or bridges dividing the network into separate fault
domains); wherever it is, it will likely be visible everywhere on that
LAN.
One suggestion I would make is that you zero the DECnet counters
periodically so that you have at least some idea of how old the
numbers are. The architected way of doing this is the counter timer
associated with the object, eg MC NCP SET LINE ISA-0 COUNTER TIMER
14400 ! reset counters every four hours, or whatever.
In order to make this useful you then also need to have DECnet logging
set up to record the counter values before they are zeroed. It can be
done reasonably simply, but right now I can't remember the details, so
if you haven't done that logging kind of thing before you may find a
simple DCL loop in a batch job easier to get working:
$ loop:
$ MC NCP SHOW LINE ISA-0 COUNT
$ MC NCP ZERO LINE ISA-0 COUNT
$ wait 1:00:00 ! or whatever
$ goto loop
If you're a stickler for tidiness you may want to subtract a couple of
seconds from your nice round interval so it doesn't drift later as the
days go by, you may want to make the batch job autorestart, etc. Usual
stuff.
Once you have this set up you can keep an eye on the counters and if
the relevant ones are non-zero you know you have had a hiccup, and you
know very roughly when it occurred. E.g. in case the problem sometimes
occurs invisibly, without causing a complete collapse of LAT sessions.
There's probably a way of logging individual carrier check failure
events *when they happen* too (or at least close to when they happen),
rather than checking counters every few hours, if you don't mind a bit
of programming; whether it's worth going to the trouble of doing that
is for you to decide, there may be other better uses for your time.
In an ideal world you'd do something equivalent for the DECserver
counters, but since everything's currently on the same LAN segment
(electrically) it probably doesn't matter.
By the way, nice to hear that you wouldn't be able to touch the
network hardware without getting change control authorisation. You
might (or might not) be amazed how many people don't bother with that
kind of thing.
Good luck.
More information about the Info-vax
mailing list