[Info-vax] OpenVMS async I/O, fast vs. slow

Sat Nov 4 07:29:38 EDT 2023

On 2023-11-03 22:45, Dan Cross wrote:
> In article <ff7f2845-84cc-4c59-a007-1b388c82543fn at googlegroups.com>,
> Jake Hamby (Solid State Jake) <jake.hamby at gmail.com> wrote:
>> I've become a little obsessed with the question of how well
>> OpenVMS performs relative to Linux inside a VM, under different
>> conditions. My current obsession is the libuv library which
>> provides a popular async I/O abstraction layer implemented for
>> all the different flavors of UNIX that have async I/O, as well
>> as for Windows. What might a VMS version look like? How many
>> cores could it scale up to without too much synchronization
>> overhead?
>>
>> Alternatively, for existing VMS apps, how might they be sped up
>> on non-VAX hardware? Based on the mailbox copy driver loop in
>> the VMS port of Perl that I spent some time piecing together,
>> I've noticed a few patterns that can't possibly perform well on
>> any hardware newer than Alpha, and maybe not on Alpha either.
>>
>> My first concern is the use of VAX interlocked queue
>> instructions, because they're system calls on Itanium and x86
>> (and PALcode on Alpha).  They can't possibly run as fast as VMS
>> programmers may be assuming based on what made sense on a VAX
>> 30 years ago. The current hotness is to use lock-free data
>> structures as much as possible.
> 
> I don't know that that's the current hotness so much as trying
> to structure problems so that the vast majority of access are
> local to a core, so that either locks are unnecessary, or
> taking a lock is just an uncontended write to an owned cache
> line.
> 
> Consider the problem of a per-core queue with work-stealing;
> the specifics of what's in the queue don't matter so much as the
> overall structure of the problem: items in the queue might
> represent IO requests, or they may represent runnable threads,
> or whatever.  Anyway, a core will usually process things on
> its own local queue, but if it runs out of things to do, it
> may choose some victim and steal some work from it.
> Conventionally, this will involve everyone taking locks on
> these queues, but that's ok: we expect that work stealing is
> pretty rare, and that usually a core is simply locking its own
> queue, which will be an uncontended write on a cacheline it
> already owns.  In any event, _most of the time_ the overhead
> of using a "lock" will be minimal (assuming relatively simply
> spinlocks or MCS locks or something like that).

I generally agree with everything you posted. But I think it's 
unfortunate that you bring in cache lines into this, as it's somewhat 
incorrect to talk about "owning" a cache line, and cache lines are not 
really that relevant at all in this context.

I think you already know this - but multiple CPUs can have the same 
memory cached, and that's just fine. Noone "owns" it. The exciting 
moment is when you write data. Cached or not (it don't matter). When 
writing, anything in any cache on any other CPU needs to be either 
invalidated or updated with the new data. Most common is that it just 
gets invalidated.

But that's all there is to it. Now, it is indeed very costly if you have 
many CPUs trying to spin-lock on the same data, because each one will be 
hitting the same memory, causing a big pressure on that memory address. 
Of course, each CPU will then cache the memory address, but as soon as 
anyone tries to take it, then all other CPUs cache entry will be 
invalidated, and each one will need to read out from main memory again, 
which scales very badly with many CPUs.

So, lockless algorithms, or clever locks that spread out over many 
addresses, can help a lot here, since memory is so slow compared to 
cache. Cache invalidation is costly. But essential for correctly working 
software.
(And then we had the PDP-11/74, which had to implement cache coherency 
between CPUs without any hardware support...)

   Johnny