[Info-vax] OpenVMS async I/O, fast vs. slow

Tue Nov 7 11:42:09 EST 2023

On 2023-11-04 15:14, Dan Cross wrote:
> In article <ui59v2$rn5$3 at news.misty.com>,
> Johnny Billquist  <bqt at softjar.se> wrote:
>> On 2023-11-03 22:45, Dan Cross wrote:
>>> In article <ff7f2845-84cc-4c59-a007-1b388c82543fn at googlegroups.com>,
>>> Jake Hamby (Solid State Jake) <jake.hamby at gmail.com> wrote:
>>>> I've become a little obsessed with the question of how well
>>>> OpenVMS performs relative to Linux inside a VM, under different
>>>> conditions. My current obsession is the libuv library which
>>>> provides a popular async I/O abstraction layer implemented for
>>>> all the different flavors of UNIX that have async I/O, as well
>>>> as for Windows. What might a VMS version look like? How many
>>>> cores could it scale up to without too much synchronization
>>>> overhead?
>>>>
>>>> Alternatively, for existing VMS apps, how might they be sped up
>>>> on non-VAX hardware? Based on the mailbox copy driver loop in
>>>> the VMS port of Perl that I spent some time piecing together,
>>>> I've noticed a few patterns that can't possibly perform well on
>>>> any hardware newer than Alpha, and maybe not on Alpha either.
>>>>
>>>> My first concern is the use of VAX interlocked queue
>>>> instructions, because they're system calls on Itanium and x86
>>>> (and PALcode on Alpha).  They can't possibly run as fast as VMS
>>>> programmers may be assuming based on what made sense on a VAX
>>>> 30 years ago. The current hotness is to use lock-free data
>>>> structures as much as possible.
>>>
>>> I don't know that that's the current hotness so much as trying
>>> to structure problems so that the vast majority of access are
>>> local to a core, so that either locks are unnecessary, or
>>> taking a lock is just an uncontended write to an owned cache
>>> line.
>>>
>>> Consider the problem of a per-core queue with work-stealing;
>>> the specifics of what's in the queue don't matter so much as the
>>> overall structure of the problem: items in the queue might
>>> represent IO requests, or they may represent runnable threads,
>>> or whatever.  Anyway, a core will usually process things on
>>> its own local queue, but if it runs out of things to do, it
>>> may choose some victim and steal some work from it.
>>> Conventionally, this will involve everyone taking locks on
>>> these queues, but that's ok: we expect that work stealing is
>>> pretty rare, and that usually a core is simply locking its own
>>> queue, which will be an uncontended write on a cacheline it
>>> already owns.  In any event, _most of the time_ the overhead
>>> of using a "lock" will be minimal (assuming relatively simply
>>> spinlocks or MCS locks or something like that).
>>
>> I generally agree with everything you posted. But I think it's
>> unfortunate that you bring in cache lines into this, as it's somewhat
>> incorrect to talk about "owning" a cache line, and cache lines are not
>> really that relevant at all in this context.
> 
> This is an odd thing to say.  It's quite common to talk about
> core ownership of cache lines in both MESI and MOESI, which are
> the cache coherency protocols in use on modern x86 systems from
> both Intel and AMD, respectively.  Indeed, the "O" in "MOESI" is
> for "owned" (https://en.wikipedia.org/wiki/MOESI_protocol).

I think we're talking past each other.

> Cache lines and their ownership are massively relevant in the
> context of mutual exclusion, atomic operations, and working on
> shared memory generally.

You are then talking about it in the sense that if a CPU writes to 
memory, and it's in the cache, then yes, you get ownership properties 
then/there. Primarily because you are not immediately writing back to 
main memory, and other CPUs are allowed to then hold read-only copies. I 
sortof covered that exact topic further down here, about either 
invalidate or update the caches of other CPUs. And it also holds that 
other CPUs cannot freely write or update those memory cells when another 
CPU holds a dirty cache about it.

But before the CPU writes the data, it never owns the cache line. So it 
don't make sense to say "uncontended write on a cacheline it already 
owns". Ownership only happens when you do the write. And once you've 
written, you own it.

I'm probably trying to make it too simple when I write about it. Trying 
to avoid going overly-technical has its risks...

>> But that's all there is to it. Now, it is indeed very costly if you have
>> many CPUs trying to spin-lock on the same data, because each one will be
>> hitting the same memory, causing a big pressure on that memory address.
> 
> More accurately, you'll be pushing the associated cache line
> through a huge number of state transitions as different cores
> vye for exclusive ownership of the line in order to write the
> lock value.  Beyond a handful of cores, the generated cache
> coherency traffic begins to dominate, and overall throughput
> _decreases_.
> 
> This is of paramount importance when we start talking about
> things like synchronization primitives, which are based on
> atomic updates to shared memory; cache-inefficient algorithms
> simply do not scale beyond a handful of actors, and why things
> like MCS locks or CHM locks are used on many-core machines.
> See, for example:
> https://people.csail.mit.edu/nickolai/papers/boyd-wickizer-locks.pdf

Yes. This is the scaling problem with spinlocks. Cache coherency starts 
becoming costly. Having algorithms or structures that allow locking to 
not be localized to one address is an important piece to help alleviate it.

>> (And then we had the PDP-11/74, which had to implement cache coherency
>> between CPUs without any hardware support...)
> 
> Sounds rough. :-/

It is. CPU was modified to actually always step around the cache for one 
instruction (ASRB - used for spin locks), and then you manually turn on 
and off cache bypass on a per-page basis, or in general of the CPU, 
depending on what is being done, in order to not get into issues of 
cache inconsistency.

   Johnny