[Info-vax] TCP/IP Services SSH and new router difficulties

Sun Oct 2 12:00:18 EDT 2011

On Oct 2, 2:12 pm, MG <marcog... at SPAMxs4all.nl> wrote:
> On 1-10-2011 19:30, John Wallace wrote:
>
> > This might be the time to make sure you understand the basics of IP
> > (TCP vs UDP, role of routers, significance of MTU, etc). Then learn a
> > bit about how to use TCPDUMP or wireshark, and then come back with a
> > bit more info.
>
> Thank you for the input, I guess.  However, your presumption is wrong.
> I *do* understand IP and a lot more than simply the basics.  I have
> already said, or indicated, that I strongly presume it's something
> specific to TCP/IP Services (hence the subject title).  I'm sorry if I
> somehow gave you the impression that I was looking for someone to give
> me a breakdown of IP to me.
>
> > Based more on intuition+experience than anything else, I'd guess you
> > may have a packet loss problem somewhere, and the intrinsic behaviour
> > of TCP (guaranteed delivery via timeout+retry) means that eventually
> > the lost packet and what followed it is retransmitted.
>
> Yes, I figured out that much.
>
> > In the absence of supporting evidence other than what's posted here,
> > I'd start by looking to see whether you have a Path MTU Discovery
> > problem somewhere along the way between the two end systems of
> > interest, a problem which is allowing smaller packets through but
> > fragmenting big ones, but one end isn't aware of the size limit along
> > the way. The consequence is that if one end occasionally sends
> > unusually large packets which somebody along the way drops (because
> > they're too big).
>
> That's all nice and well, but (as I said before), it are only the VMS
> nodes that are causing problems with the forwarded SSH TCP ports (22)
> at the moment.
>
> So, again, as I said before: I'm wondering if there's perhaps an
> OpenVMS TCP/IP Services "ifconfig" variable that could be the culprit.
> In other words, if my problems sound familiar (to not have to reinvent
> the wheel, in terms of solving certain problems).
>
> > Circumstances like this can occur when a network administrator
> > foolishly blocks all flavours of "ping" (aka ICMP) packet; when that
> > happens it is impossible for Path MTU Discovery to work right, which
> > will lead to problems when perfectly valid (but big) TCP packets need
> > fragmenting (and re-assembling when received); a router's reply
> > message to the sender after dropping such a packet is an important
> > part of Path MTU discovery. The router is an ICMP packet which silly
> > network admins (including those in commercial ISPs) may sometimes
> > block.
>
> How do you explain that all other forwarded SSH ports (like for iLO
> SSH and some other systems in my network) work fine, but just the
> VMS TCP/IP SSH ports budge?
>
> > Don't forget all the usual "systematic troubleshooting" stuff, e.g.
> > Has this ever worked?
>
> Yes, as I said several times before.
>
> > Grossly oversimplified, and from distant non-error-correcting memory,
> > apologies for any errors or significant omissions. I had great fun
> > doing this kind of thing when mass market broadband first arrived in
> > the UK ten years ago, and various consumer ISPs (and more importantly
> > their wholesaler connectivity provider) didn't really have much of a
> > clue.
>
> Sorry, but did you just say I don't have much of a clue...?  (I hope I
> am interpreting this incorrectly.)
>
>   - MG

No, I didn't say you don't have much of a clue (or at least didn't
intend to).

That being said, in many troubleshooting circumstances, first hand raw
evidence is a useful supplement to someone's interpretation of it,
however experienced they may be.

E.g. How are readers supposed to unambiguously interpret "It appears
as if the connection is cut off before it can even reach the system
and actually connect." which you wrote a few days ago?

As a subjective description it's a fine starting point, but it's far
from definitive from a network application troubleshooting point of
view. tcpdump/wireshark/etc are more definitive, though they often can
be overkill.

Anyway, two weeks or so have passed, you're now on the third router,
and there's still no tcpdump of the failing connection setup for an
expert (someone other than me?) to look at.

Nor is there yet a trace of the interesting new circumstances you
describe where a connection appears to be successfully established,
and some data can be successfully transferred, but symptoms
corresponding to packet loss (on oversize packets?) occur from time to
time, symptoms which you don't see when using diffrent boxes or when
using a variety of different applications (e.g. iLO) between the same
boxes (but different ports, different IP addresses in the case of iLO,
and different protocols).

"How do you explain that all other forwarded SSH ports (like for iLO
SSH and some other systems in my network) work fine, but just the VMS
TCP/IP SSH ports budge?"

You admit you have no idea what might cause this hang and
subsequrecovery on large chunks of data. As I've already said but in
different words, in my past experience one possibility that should be
relatively easy to look into, and then rule out or investigate
further, might be that the working applications and their respective
IP stacks happen not to use big TCP packets (at least when you've been
testing them), whereas the failing application (in this case, DCL and
TCPIP services) occasionally does send a big TCP packet, which is
dropped somewhere along the way, maybe because Path MTU Discovery
didn't work right. But in the absence of hard evidence it can't be
anything more than a guess (and it may be a guess based on my
incorrect recollections).

There's been plenty of time for other readers with more knowledge of
TCPIP services than I have to make suggestions. It's not happened yet.
Nobody's even asked whether your TCPIP patches are up to date. Maybe
that's obvious from your SHOW VERSION output which just says "version
5.7"?

The TCPIP services command to force a given MTU is documented (is it
safe for me to assume you know what it is?) In case that's not a safe
assumption, the V5.4 manual says it's "SET PROTOCOL TCP /
MTU_SEGMENT_SIZE=".

You may also want to look into whether PMTUD is in fact enabled or not
on your IA64 box. Something along the lines of:
TCPIP> sysconfig -c inet
TCPIP> sysconfig -r inet pmtu_enabled=0 (or 1 to enable it)
will tell you about the system-wide state of PMTUD. There are route-
specific rather than system-wide variants of this in The Book as well;
you know where to look.

If I were in your position and experimentation was possible and safe,
I'd try a quick experiment with PMTUD disabled and MTU forced to
something ridiculously low (for a proper network) such as 512, and see
what happens. It can certainly prove my hypothesis was wrong, but
that's what experiments are for.

You say you know what you're doing with IP networking. Maybe other
readers who haven't seen the fun that can be had with PMTUD failures
might enjoy the one of the many writeups of PMTUD failure modes, e.g.
the one at
http://www.netheaven.com/pmtu.html - again, no guarantees it's the
problem here, but the symptoms described are similar (but not
identical, e.g. there's no hang+recovery scenario described there).

Lots of words, partly because I may not be around around much for the
next few days.

Good luck.