[Info-vax] OpenVMS async I/O, fast vs. slow

Tue Nov 7 21:00:07 EST 2023

In article <uidpd1$6s2$4 at news.misty.com>,
Johnny Billquist  <bqt at softjar.se> wrote:
>On 2023-11-04 15:14, Dan Cross wrote:
>>>>[snip]
>>>> Consider the problem of a per-core queue with work-stealing;
>>>> the specifics of what's in the queue don't matter so much as the
>>>> overall structure of the problem: items in the queue might
>>>> represent IO requests, or they may represent runnable threads,
>>>> or whatever.  Anyway, a core will usually process things on
>>>> its own local queue, but if it runs out of things to do, it
>>>> may choose some victim and steal some work from it.
>>>> Conventionally, this will involve everyone taking locks on
>>>> these queues, but that's ok: we expect that work stealing is
>>>> pretty rare, and that usually a core is simply locking its own
>>>> queue, which will be an uncontended write on a cacheline it
>>>> already owns.  In any event, _most of the time_ the overhead
>>>> of using a "lock" will be minimal (assuming relatively simply
>>>> spinlocks or MCS locks or something like that).
>>>
>>> I generally agree with everything you posted. But I think it's
>>> unfortunate that you bring in cache lines into this, as it's somewhat
>>> incorrect to talk about "owning" a cache line, and cache lines are not
>>> really that relevant at all in this context.
>> 
>> This is an odd thing to say.  It's quite common to talk about
>> core ownership of cache lines in both MESI and MOESI, which are
>> the cache coherency protocols in use on modern x86 systems from
>> both Intel and AMD, respectively.  Indeed, the "O" in "MOESI" is
>> for "owned" (https://en.wikipedia.org/wiki/MOESI_protocol).
>
>I think we're talking past each other.

Yes.  See below.

>> Cache lines and their ownership are massively relevant in the
>> context of mutual exclusion, atomic operations, and working on
>> shared memory generally.
>
>You are then talking about it in the sense that if a CPU writes to 
>memory, and it's in the cache, then yes, you get ownership properties 
>then/there. Primarily because you are not immediately writing back to 
>main memory, and other CPUs are allowed to then hold read-only copies. I 
>sortof covered that exact topic further down here, about either 
>invalidate or update the caches of other CPUs. And it also holds that 
>other CPUs cannot freely write or update those memory cells when another 
>CPU holds a dirty cache about it.

Well, no; in the protocols in use on x86, if you have exclusive
access to a cache line, in either the exclusive or modified
state, then any other cache's copy of that line is invalidated.

MOESI augments this by allowing sharing of dirty data via the
OWNED state, allowing for cache-to-cache updates.

>But before the CPU writes the data, it never owns the cache line. So it 
>don't make sense to say "uncontended write on a cacheline it already 
>owns". Ownership only happens when you do the write. And once you've 
>written, you own it.

This ignores the context; see what I wrote above.

To recap, I was hypothesizing a multiprocessor system with
per-CPU work queues with work stealing.  In such a system, an
implementation may be to use a spinlock for each CPU's queue: in
the rare cases where one wants to steal work, one must lock some
other CPU's queue, _BUT_, in the usual scenario when the per-CPU
queue is uncontended, the local worker will lock the queue,
remove an item from it, unlock the queue, do some work, _and
then repeat_.  Note that once it acquires the lock on its queue
the CPU will own the corresponding cacheline; since we assume
work stealing is rare, it is likely _it will still own it on
subsequent iterations of this loop._  Hence, making an
uncontended write on a cache line it already owns; here, it owns
it from the last time it made that same write.

>I'm probably trying to make it too simple when I write about it. Trying 
>to avoid going overly-technical has its risks...

Yes.

>>> But that's all there is to it. Now, it is indeed very costly if you have
>>> many CPUs trying to spin-lock on the same data, because each one will be
>>> hitting the same memory, causing a big pressure on that memory address.
>> 
>> More accurately, you'll be pushing the associated cache line
>> through a huge number of state transitions as different cores
>> vye for exclusive ownership of the line in order to write the
>> lock value.  Beyond a handful of cores, the generated cache
>> coherency traffic begins to dominate, and overall throughput
>> _decreases_.
>> 
>> This is of paramount importance when we start talking about
>> things like synchronization primitives, which are based on
>> atomic updates to shared memory; cache-inefficient algorithms
>> simply do not scale beyond a handful of actors, and why things
>> like MCS locks or CHM locks are used on many-core machines.
>> See, for example:
>> https://people.csail.mit.edu/nickolai/papers/boyd-wickizer-locks.pdf
>
>Yes. This is the scaling problem with spinlocks. Cache coherency starts 
>becoming costly. Having algorithms or structures that allow locking to 
>not be localized to one address is an important piece to help alleviate it.

Not just address, but _cache lines_, which is why this the topic
is so important.  Multiple simultaneous writers to multiple
memory locations in the same cache line can lead to contention
due to false sharing.

>>> (And then we had the PDP-11/74, which had to implement cache coherency
>>> between CPUs without any hardware support...)
>> 
>> Sounds rough. :-/
>
>It is. CPU was modified to actually always step around the cache for one 
>instruction (ASRB - used for spin locks), and then you manually turn on 
>and off cache bypass on a per-page basis, or in general of the CPU, 
>depending on what is being done, in order to not get into issues of 
>cache inconsistency.

This implies that stores were in a total order, then, and
these uncached instructions were serializing with respect to
other CPUs?

	- Dan C.