[Info-vax] OpenVMS async I/O, fast vs. slow

Thu Nov 9 13:51:56 EST 2023

On 2023-11-09 17:50, Dan Cross wrote:
> In article <uig3nn$2ke$2 at news.misty.com>,
> Johnny Billquist  <bqt at softjar.se> wrote:
>> On 2023-11-08 03:00, Dan Cross wrote:
>>> [snip]
>>> Yes.  See below.
>>
>> :-)
>>
>> And yes, I know how the cache coherency protocols work. Another thing
>> that was covered already when I was studying at University.
> 
> Cool.  MESI wasn't presented until 1984, so you must have seen
> it pretty early on.

I think I was looking at it in 1995 (what made you think I would have 
looked at it close to 1984? How old do you think I am??? :) ). My 
professor was specifically focused on CPU caches. He was previously the 
chief architect for high-end server division at SUN (Erik Hagersten if 
you want to look him up).
I didn't even start at University until 1989.

>> That would assume that the cache has not been written back. Which is
>> likely if we talk about a short time after last updating the lock, but
>> rather unlikely most of the time.
> 
> This is making an assumption that may or may not hold.  It is
> true that if the data is flushed back to main memory from the
> cache we no longer own the line, but it is not necessarily true
> that the line will be immediately flushed from cache.  The
> hypothetical was positing a scenario in which a spinlock can
> be ridiculously cheap: if it is almost never contended and its
> cacheline remains owned by the current processor.

The cache will most definitely not immediately be flushed. However, it 
is also most likely not referenced, and as we execute other code, it 
will sooner or later get flushed.
It's unlikely that something sits in the cache for long when it's not 
referenced. And a spin lock where you have acquired the lock is a memory 
location you no longer will be accessing. And even though the cache line 
holds more data, it's still unlikely that you'll be referencing that 
small chunk of memory much.

But yes, it's not impossible. We're all down to speculations here.

>> This boils down to patterns and timing (obviously). I believe that spin
>> locks are hit hard when you are trying to get a lock, but once you have
>> it, you will not be touching that thing for quite a while, and it will
>> quickly go out of cache.
> 
> That's are assumptions that may or may not hold.

I think it's a fairly safe assumption that once you have acquired the 
spin lock, you will not be hitting the cell than contains the state of 
the spin lock. There is absolutely no point in continuing hitting that 
one after this point. And it's also safe to assume that we will be 
running through a bunch of code once we do have acquired the lock. Since 
that's the whole point of the lock - once you have it you can proceed in 
doing the stuff you want exclusive access in order to do.

So that leaves the assumption that this will flush that line out. But I 
think that one is also fairly safe to say that it will happen within a 
short future. Caches, after all, only contains a rather limited amount 
of data, compared to the full memory, or what memory your code hits.

>> The owning CPU then tries to
>> release the lock, at which time it also access the data, writes it, at
>> which point the other CPUs will still have it in shared, and the owning
>> CPU gets it as owned.
> 
> Hey now?  Seems like most spinlock releases on x86 are simply
> going to be a memory barrier followed by a store.  That'll push
> the line in the local into MODIFIED and invalidate all other
> caches.

Why would it invalidate other caches? It could just push the change into 
their caches as well. At which point the local CPU would have "owned" 
and others "shared". But yes, it could just also invalidate the others, 
and move to "modified". Both are equally correct states.

>> Other CPUs then try to get the lock, so they all
>> start kicking the owning CPU cache, in order to get the data comitted to
>> main memory, so they can obtain ownership, and grab the lock. One of
>> them will succeed, the others again end up with a shared cache state.
> 
> Actually, immediately after one succeeds, the others will be in
> INVALID state until they're read again, then they may or may not
> be in EXCLUSIVE or SHARED; depending on how they are
> implemented.

You seem to assume that updates never gets pushed from the cache on one 
CPU to the others, but always just do an invalidate. It's possible to do 
either. Exactly what any specific implementation does on the other hand 
I have no idea.

>> Cache lines are typically something like 128 bytes or so, so even though
>> locality means there is some other data around, the owning CPU is
>> unlikely to care about anything in that cache line, but that is
>> speculation on my part.
> 
> Cache line size on current x86 processors is 64 bytes, but yeah.

I can't even keep track of that. It changes over time anyway, and is not 
even CPU specific. It's an implementation detail for the memory 
subsystem. :)

>> But now I tried to become a little more technical. ;-)
>>
>> But also maybe we don't need to kick this around any moew. Seems like
>> we're drifting. I think we started out with the question of I/O
>> performance, and in this case specifically by using multiple threads in
>> VMS, and how Unix compatible layers seem to not get much performance,
>> which seems is no surprise to either of us, while VMS own primitives can
>> deliver fairly ok performance.
> 
> Well getting back to that....  One of the things that I found
> rather odd was that that discussion seems to conflate macro- and
> micro-level optimizations in an odd way.  I mean, it went from
> talking about efficient ways to retrieve data from a secondary
> storage device that is orders of magnitude slower than the CPU
> to atomic operations and reducing mutex contention, which seems
> like for most IO-bound applications will be in the noise.

True.

>> Beyond that, I'm not sure if we are arguing about something much, or
>> basically nitpicking. :-)
> 
> This is true.  :-)  It's fun, though!  Also, I've found that not
> understanding how this stuff works in detail can have really
> profound performance implications.  Simply put, most programmers
> don't have good intuition here.

Maybe I should tell about the time when I worked at a compane where we 
were making network switching/routing and did our own hardware, and were 
very careful about exactly where each byte of incoming packets ended up 
so that all the data we were interested in were all sitting in the same 
cache line, to get the most speed out of it all...

>>>>>> (And then we had the PDP-11/74, which had to implement cache coherency
>>>>>> between CPUs without any hardware support...)
>>>>>
>>>>> Sounds rough. :-/
>>>>
>>>> It is. CPU was modified to actually always step around the cache for one
>>>> instruction (ASRB - used for spin locks), and then you manually turn on
>>>> and off cache bypass on a per-page basis, or in general of the CPU,
>>>> depending on what is being done, in order to not get into issues of
>>>> cache inconsistency.
>>>
>>> This implies that stores were in a total order, then, and
>>> these uncached instructions were serializing with respect to
>>> other CPUs?
>>
>> The uncached instruction is basically there in order to be able to
>> implement a spin lock that works as you would expect. Once you have the
>> lock, then you either deal with data which is known to be shared, in
>> which case you need to run with cache disabled, or you are dealing with
>> data you know is not shared, in which case you can allow caching to work
>> as normal.
>>
>> No data access to shared resources are allowed to be done without
>> getting the lock first.
> 
> Sure.  But suppose I use the uncached instructions to implement
> a lock around a shared data structure; I use the uncached instr
> to grab a lock, I modify (say) a counter in "normal" memory with
> a "normal" instruction and then I use the uncached instruction
> to release the lock.  But what about the counter value?  Is its
> new value --- the update to which was protected by the lock ---
> immediately visible to all other CPUs?

Like I said - if you are dealing with shared data, even after you get 
the lock, you then need to turn off the cache while working on it. So 
basically, such updates would immediately hit main memory. And any reads 
of that data would also be done with cache disabled, so you get the 
actual data. So updates always immediately visible to all CPUs.

>> Hey, it's an old system by today's standards. Rather primitive. I think
>> it's cool DEC even made it, and it works, and gives pretty acceptable
>> performance. But it was noted that going much above 4 CPUs really gave
>> diminishing returns. So the 11/74 never supported more than 4 CPUs. And
>> it gave/gives about 3.5 times the performance of a single 11/70.
> 
> I can see that.  Very cool, indeed.

It was nice that they made it work, considering the limitations. Also, 
they took an existing OS (RSX) and made rather small modifications to it 
in order to make it work MP. Sure, you basically have one big lock, but 
the system was for the most part designed in such a way that most things 
were already safe.

   Johnny