[Info-vax] OpenVMS async I/O, fast vs. slow

Sat Nov 4 10:14:47 EDT 2023

In article <ui59v2$rn5$3 at news.misty.com>,
Johnny Billquist  <bqt at softjar.se> wrote:
>On 2023-11-03 22:45, Dan Cross wrote:
>> In article <ff7f2845-84cc-4c59-a007-1b388c82543fn at googlegroups.com>,
>> Jake Hamby (Solid State Jake) <jake.hamby at gmail.com> wrote:
>>> I've become a little obsessed with the question of how well
>>> OpenVMS performs relative to Linux inside a VM, under different
>>> conditions. My current obsession is the libuv library which
>>> provides a popular async I/O abstraction layer implemented for
>>> all the different flavors of UNIX that have async I/O, as well
>>> as for Windows. What might a VMS version look like? How many
>>> cores could it scale up to without too much synchronization
>>> overhead?
>>>
>>> Alternatively, for existing VMS apps, how might they be sped up
>>> on non-VAX hardware? Based on the mailbox copy driver loop in
>>> the VMS port of Perl that I spent some time piecing together,
>>> I've noticed a few patterns that can't possibly perform well on
>>> any hardware newer than Alpha, and maybe not on Alpha either.
>>>
>>> My first concern is the use of VAX interlocked queue
>>> instructions, because they're system calls on Itanium and x86
>>> (and PALcode on Alpha).  They can't possibly run as fast as VMS
>>> programmers may be assuming based on what made sense on a VAX
>>> 30 years ago. The current hotness is to use lock-free data
>>> structures as much as possible.
>> 
>> I don't know that that's the current hotness so much as trying
>> to structure problems so that the vast majority of access are
>> local to a core, so that either locks are unnecessary, or
>> taking a lock is just an uncontended write to an owned cache
>> line.
>> 
>> Consider the problem of a per-core queue with work-stealing;
>> the specifics of what's in the queue don't matter so much as the
>> overall structure of the problem: items in the queue might
>> represent IO requests, or they may represent runnable threads,
>> or whatever.  Anyway, a core will usually process things on
>> its own local queue, but if it runs out of things to do, it
>> may choose some victim and steal some work from it.
>> Conventionally, this will involve everyone taking locks on
>> these queues, but that's ok: we expect that work stealing is
>> pretty rare, and that usually a core is simply locking its own
>> queue, which will be an uncontended write on a cacheline it
>> already owns.  In any event, _most of the time_ the overhead
>> of using a "lock" will be minimal (assuming relatively simply
>> spinlocks or MCS locks or something like that).
>
>I generally agree with everything you posted. But I think it's 
>unfortunate that you bring in cache lines into this, as it's somewhat 
>incorrect to talk about "owning" a cache line, and cache lines are not 
>really that relevant at all in this context.

This is an odd thing to say.  It's quite common to talk about
core ownership of cache lines in both MESI and MOESI, which are
the cache coherency protocols in use on modern x86 systems from
both Intel and AMD, respectively.  Indeed, the "O" in "MOESI" is
for "owned" (https://en.wikipedia.org/wiki/MOESI_protocol).

Cache lines and their ownership are massively relevant in the
context of mutual exclusion, atomic operations, and working on
shared memory generally.

>I think you already know this - but multiple CPUs can have the same 
>memory cached, and that's just fine. Noone "owns" it. The exciting 
>moment is when you write data. Cached or not (it don't matter). When 
>writing, anything in any cache on any other CPU needs to be either 
>invalidated or updated with the new data. Most common is that it just 
>gets invalidated.

This is incorrect, or at least conflating several things that
make it sufficiently inaccurate overall as to be be effectively
incorrect.

Note that when we talk about ownership in this context, we talk
about ownership of a _cache line_, which is different than
ownership of _memory_.  Recalling that the context is x86, we
can review the cache coherency protocols in use on
multiprocessor x86 systems, MESI and MOESI, which are based on
driving a state machine around snooping on accesses to "cache
lines", which are in turn relatively fine divisions of aligned
RAM regions (e.g., 64 bytes or 128 bytes are common).  As
different processors in the system operate on memory, the cache
units use the coherency protocol to synchronize the state of
each processor's cache with respect to others.  In these
protocols, we may think of their being a dedicated bus for cache
coherency operations.  In both MESI and MOESI, when a cache line
is in the "exclusive" state for a CPU, we say that that CPU
"owns" that cache line.  In MOESI, there is as mentioned above
also a dedicated "OWNED" state.

So "ownership" is very much a part of the model, and is used in
both the Intel SDM and AMD APM when discussing cache coherency
protocols.

>But that's all there is to it. Now, it is indeed very costly if you have 
>many CPUs trying to spin-lock on the same data, because each one will be 
>hitting the same memory, causing a big pressure on that memory address. 

More accurately, you'll be pushing the associated cache line
through a huge number of state transitions as different cores
vye for exclusive ownership of the line in order to write the
lock value.  Beyond a handful of cores, the generated cache
coherency traffic begins to dominate, and overall throughput
_decreases_.

This is of paramount importance when we start talking about
things like synchronization primitives, which are based on
atomic updates to shared memory; cache-inefficient algorithms
simply do not scale beyond a handful of actors, and why things
like MCS locks or CHM locks are used on many-core machines.
See, for example:
https://people.csail.mit.edu/nickolai/papers/boyd-wickizer-locks.pdf

This isn't to say that one cannot use naive spinlocks, simply
that one must be careful when doing so.  For example, a spin
lock protecting a per-core work queue that is rarely contended
may be just fine if the only other writer is a single entity
that is dispatching work to available threads.

>Of course, each CPU will then cache the memory address, but as soon as 
>anyone tries to take it, then all other CPUs cache entry will be 
>invalidated, and each one will need to read out from main memory again, 
>which scales very badly with many CPUs.

Eh, close.  The description on Wikipedia for MESI isn't bad:
https://en.wikipedia.org/wiki/MESI_protocol

And the description of MOESI in the AMD APM is pretty good (sec. 7.3
of vol 2):
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf

>So, lockless algorithms, or clever locks that spread out over many 
>addresses, can help a lot here, since memory is so slow compared to 
>cache. Cache invalidation is costly. But essential for correctly working 
>software.

Yes and no.  The point I have been trying to make is that
minimizing the need for cache coherency traffic scales;
"lockless" algorithms are still almost always based on atomic
operations and may collapse under load, as the pointed out by
the Boyd-Wickizer paper.  If you want scalability, you have to
come up with some way to avoid doing that.  E.g., MCS locks,
while a fairly traditional mutual exclusion primitive, may be
far more efficient than a "lockless" protocol.

>(And then we had the PDP-11/74, which had to implement cache coherency 
>between CPUs without any hardware support...)

Sounds rough. :-/

	- Dan C.