[Info-vax] OpenVMS async I/O, fast vs. slow

Thu Nov 9 11:50:46 EST 2023

In article <uig3nn$2ke$2 at news.misty.com>,
Johnny Billquist  <bqt at softjar.se> wrote:
>On 2023-11-08 03:00, Dan Cross wrote:
>> [snip]
>> Yes.  See below.
>
>:-)
>
>And yes, I know how the cache coherency protocols work. Another thing 
>that was covered already when I was studying at University.

Cool.  MESI wasn't presented until 1984, so you must have seen
it pretty early on.

>[snip]
>> This ignores the context; see what I wrote above.
>> 
>> To recap, I was hypothesizing a multiprocessor system with
>> per-CPU work queues with work stealing.  In such a system, an
>> implementation may be to use a spinlock for each CPU's queue: in
>> the rare cases where one wants to steal work, one must lock some
>> other CPU's queue, _BUT_, in the usual scenario when the per-CPU
>> queue is uncontended, the local worker will lock the queue,
>> remove an item from it, unlock the queue, do some work, _and
>> then repeat_.  Note that once it acquires the lock on its queue
>> the CPU will own the corresponding cacheline; since we assume
>> work stealing is rare, it is likely _it will still own it on
>> subsequent iterations of this loop._  Hence, making an
>> uncontended write on a cache line it already owns; here, it owns
>> it from the last time it made that same write.
>
>That would assume that the cache has not been written back. Which is 
>likely if we talk about a short time after last updating the lock, but 
>rather unlikely most of the time.

This is making an assumption that may or may not hold.  It is
true that if the data is flushed back to main memory from the
cache we no longer own the line, but it is not necessarily true
that the line will be immediately flushed from cache.  The
hypothetical was positing a scenario in which a spinlock can
be ridiculously cheap: if it is almost never contended and its
cacheline remains owned by the current processor.

>This boils down to patterns and timing (obviously). I believe that spin 
>locks are hit hard when you are trying to get a lock, but once you have 
>it, you will not be touching that thing for quite a while, and it will 
>quickly go out of cache.

That's are assumptions that may or may not hold.

>If you are a CPU trying to grab something from another CPU, with a spin 
>lock held by that other CPU, the spin lock mutex will most likely not 
>sit in the owners cache. So the other CPU will be hitting it, getting 
>the value and it gets into the cache. Might be several CPUs as well. 

Yes.

>They'll get the value in the cache. Cached value will be flagged as 
>shared. Noone will have it as owned.

Hmm, this depends entirely on what instructions are being used
to implement the lock.  A navie `XCHG` will pull the line into
modified state; something like a `LOCK CMPXCHG` might pull it
into directly into EXCLUSIVE via a Read-for-ownership cache op.
I'd have to look at the implementations in detail to understand
the exact behavior here.

>The owning CPU then tries to 
>release the lock, at which time it also access the data, writes it, at 
>which point the other CPUs will still have it in shared, and the owning 
>CPU gets it as owned.

Hey now?  Seems like most spinlock releases on x86 are simply
going to be a memory barrier followed by a store.  That'll push
the line in the local into MODIFIED and invalidate all other
caches.

>Other CPUs then try to get the lock, so they all 
>start kicking the owning CPU cache, in order to get the data comitted to 
>main memory, so they can obtain ownership, and grab the lock. One of 
>them will succeed, the others again end up with a shared cache state. 

Actually, immediately after one succeeds, the others will be in
INVALID state until they're read again, then they may or may not
be in EXCLUSIVE or SHARED; depending on how they are
implemented.

>The one CPU that managed to grab the lock will now be the owner, and all 
>other CPUs will hit it for the cached data, and eventuall it will be 
>written to memory, and dropped from the owners cache, since the owner is 
>not actually working on that data.

This seems like a very different hypothetical from the one I was
proposing, but seems more or less correct to me.  :-)

>Cache lines are typically something like 128 bytes or so, so even though 
>locality means there is some other data around, the owning CPU is 
>unlikely to care about anything in that cache line, but that is 
>speculation on my part.

Cache line size on current x86 processors is 64 bytes, but yeah.

>The really bad thing is if you have several spin locks in the same cache 
>line...

YUP.  Or any highly contended data.  For example, a lock and
a frequently updated conter that's unrelated to the lock.

>But now I tried to become a little more technical. ;-)
>
>But also maybe we don't need to kick this around any moew. Seems like 
>we're drifting. I think we started out with the question of I/O 
>performance, and in this case specifically by using multiple threads in 
>VMS, and how Unix compatible layers seem to not get much performance, 
>which seems is no surprise to either of us, while VMS own primitives can 
>deliver fairly ok performance.

Well getting back to that....  One of the things that I found
rather odd was that that discussion seems to conflate macro- and
micro-level optimizations in an odd way.  I mean, it went from
talking about efficient ways to retrieve data from a secondary
storage device that is orders of magnitude slower than the CPU
to atomic operations and reducing mutex contention, which seems
like for most IO-bound applications will be in the noise.

>Beyond that, I'm not sure if we are arguing about something much, or 
>basically nitpicking. :-)

This is true.  :-)  It's fun, though!  Also, I've found that not
understanding how this stuff works in detail can have really
profound performance implications.  Simply put, most programmers
don't have good intuition here.

>>>>> But that's all there is to it. Now, it is indeed very costly if you have
>>>>> many CPUs trying to spin-lock on the same data, because each one will be
>>>>> hitting the same memory, causing a big pressure on that memory address.
>>>>
>>>> More accurately, you'll be pushing the associated cache line
>>>> through a huge number of state transitions as different cores
>>>> vye for exclusive ownership of the line in order to write the
>>>> lock value.  Beyond a handful of cores, the generated cache
>>>> coherency traffic begins to dominate, and overall throughput
>>>> _decreases_.
>>>>
>>>> This is of paramount importance when we start talking about
>>>> things like synchronization primitives, which are based on
>>>> atomic updates to shared memory; cache-inefficient algorithms
>>>> simply do not scale beyond a handful of actors, and why things
>>>> like MCS locks or CHM locks are used on many-core machines.
>>>> See, for example:
>>>> https://people.csail.mit.edu/nickolai/papers/boyd-wickizer-locks.pdf
>>>
>>> Yes. This is the scaling problem with spinlocks. Cache coherency starts
>>> becoming costly. Having algorithms or structures that allow locking to
>>> not be localized to one address is an important piece to help alleviate it.
>> 
>> Not just address, but _cache lines_, which is why this the topic
>> is so important.  Multiple simultaneous writers to multiple
>> memory locations in the same cache line can lead to contention
>> due to false sharing.
>
>That is true. Cache lines is really the smallest denominator here. I 
>should take care about putting that correctly.
>It don't help if you have some clever algorithm that spreads the lock 
>out over multiple addresses, if they all hit the same cache line.

Yup.

>>>>> (And then we had the PDP-11/74, which had to implement cache coherency
>>>>> between CPUs without any hardware support...)
>>>>
>>>> Sounds rough. :-/
>>>
>>> It is. CPU was modified to actually always step around the cache for one
>>> instruction (ASRB - used for spin locks), and then you manually turn on
>>> and off cache bypass on a per-page basis, or in general of the CPU,
>>> depending on what is being done, in order to not get into issues of
>>> cache inconsistency.
>> 
>> This implies that stores were in a total order, then, and
>> these uncached instructions were serializing with respect to
>> other CPUs?
>
>The uncached instruction is basically there in order to be able to 
>implement a spin lock that works as you would expect. Once you have the 
>lock, then you either deal with data which is known to be shared, in 
>which case you need to run with cache disabled, or you are dealing with 
>data you know is not shared, in which case you can allow caching to work 
>as normal.
>
>No data access to shared resources are allowed to be done without 
>getting the lock first.

Sure.  But suppose I use the uncached instructions to implement
a lock around a shared data structure; I use the uncached instr
to grab a lock, I modify (say) a counter in "normal" memory with
a "normal" instruction and then I use the uncached instruction
to release the lock.  But what about the counter value?  Is its
new value --- the update to which was protected by the lock ---
immediately visible to all other CPUs?

>Hey, it's an old system by today's standards. Rather primitive. I think 
>it's cool DEC even made it, and it works, and gives pretty acceptable 
>performance. But it was noted that going much above 4 CPUs really gave 
>diminishing returns. So the 11/74 never supported more than 4 CPUs. And 
>it gave/gives about 3.5 times the performance of a single 11/70.

I can see that.  Very cool, indeed.

	- Dan C.