[Info-vax] OpenVMS async I/O, fast vs. slow

Sat Nov 11 10:06:58 EST 2023

In article <uij9oc$st3$1 at news.misty.com>,
Johnny Billquist  <bqt at softjar.se> wrote:
>On 2023-11-09 17:50, Dan Cross wrote:
>> In article <uig3nn$2ke$2 at news.misty.com>,
>> Johnny Billquist  <bqt at softjar.se> wrote:
>>> On 2023-11-08 03:00, Dan Cross wrote:
>>>> [snip]
>>>> Yes.  See below.
>>>
>>> :-)
>>>
>>> And yes, I know how the cache coherency protocols work. Another thing
>>> that was covered already when I was studying at University.
>> 
>> Cool.  MESI wasn't presented until 1984, so you must have seen
>> it pretty early on.
>
>I think I was looking at it in 1995 (what made you think I would have 
>looked at it close to 1984? How old do you think I am??? :) ).

*cough* Uh, ahem...sorry.... *cough*

>My 
>professor was specifically focused on CPU caches. He was previously the 
>chief architect for high-end server division at SUN (Erik Hagersten if 
>you want to look him up).
>I didn't even start at University until 1989.

Gosh, I think like some other folks I just ASSumed you were
older given your PDP-11 interests.  Apologies!

>[snip]
>But yes, it's not impossible. We're all down to speculations here.

Agreed.

>> That's are assumptions that may or may not hold.
>
>I think it's a fairly safe assumption that once you have acquired the 
>spin lock, you will not be hitting the cell than contains the state of 
>the spin lock. There is absolutely no point in continuing hitting that 
>one after this point.

Well, you have to release the lock, which will be a store, which
will touch the line.  If you've got a very short critical
section, say just removing something from a list, which is a
handful of instructions, it is very unlikely that the line
holding the lock will be evicted before the end of the critical
section.

>And it's also safe to assume that we will be 
>running through a bunch of code once we do have acquired the lock.

I don't think that's a safe assumption at all, especially for a
spin lock: minimizing the length of the critical section should
absolutely be a goal.

>Since 
>that's the whole point of the lock - once you have it you can proceed in 
>doing the stuff you want exclusive access in order to do.
>
>So that leaves the assumption that this will flush that line out. But I 
>think that one is also fairly safe to say that it will happen within a 
>short future. Caches, after all, only contains a rather limited amount 
>of data, compared to the full memory, or what memory your code hits.

This is true, but again, we're all speculating.

>>> The owning CPU then tries to
>>> release the lock, at which time it also access the data, writes it, at
>>> which point the other CPUs will still have it in shared, and the owning
>>> CPU gets it as owned.
>> 
>> Hey now?  Seems like most spinlock releases on x86 are simply
>> going to be a memory barrier followed by a store.  That'll push
>> the line in the local into MODIFIED and invalidate all other
>> caches.
>
>Why would it invalidate other caches?

Because that's what the protocol says it must do?  :-)

>It could just push the change into 
>their caches as well. At which point the local CPU would have "owned" 
>and others "shared".

Well, no....  Even in MOESI, a write hit puts an OWNED cache
line into MODIFIED state, and a probe write hit puts the line
into INVALID state from any state.  A probe read hit can move a
line from MODIFIED into OWNED (presumably that's when it pushes
its modified contents to other caches).

>But yes, it could just also invalidate the others, 
>and move to "modified". Both are equally correct states.

Yup; I agree.

>>> Other CPUs then try to get the lock, so they all
>>> start kicking the owning CPU cache, in order to get the data comitted to
>>> main memory, so they can obtain ownership, and grab the lock. One of
>>> them will succeed, the others again end up with a shared cache state.
>> 
>> Actually, immediately after one succeeds, the others will be in
>> INVALID state until they're read again, then they may or may not
>> be in EXCLUSIVE or SHARED; depending on how they are
>> implemented.
>
>You seem to assume that updates never gets pushed from the cache on one 
>CPU to the others, but always just do an invalidate. It's possible to do 
>either. Exactly what any specific implementation does on the other hand 
>I have no idea.

Well, that is what the protocols both say that they do, though I
suppose it is theoretically possible that MOESI could avoid it.

>>> Cache lines are typically something like 128 bytes or so, so even though
>>> locality means there is some other data around, the owning CPU is
>>> unlikely to care about anything in that cache line, but that is
>>> speculation on my part.
>> 
>> Cache line size on current x86 processors is 64 bytes, but yeah.
>
>I can't even keep track of that. It changes over time anyway, and is not 
>even CPU specific. It's an implementation detail for the memory 
>subsystem. :)

What's worse is that there are big.LITTLE ARM configurations
that have different cache line sizes for the different CPUS in
the same SoC complex!

>>> But now I tried to become a little more technical. ;-)
>>>
>>> But also maybe we don't need to kick this around any moew. Seems like
>>> we're drifting. I think we started out with the question of I/O
>>> performance, and in this case specifically by using multiple threads in
>>> VMS, and how Unix compatible layers seem to not get much performance,
>>> which seems is no surprise to either of us, while VMS own primitives can
>>> deliver fairly ok performance.
>> 
>> Well getting back to that....  One of the things that I found
>> rather odd was that that discussion seems to conflate macro- and
>> micro-level optimizations in an odd way.  I mean, it went from
>> talking about efficient ways to retrieve data from a secondary
>> storage device that is orders of magnitude slower than the CPU
>> to atomic operations and reducing mutex contention, which seems
>> like for most IO-bound applications will be in the noise.
>
>True.
>
>>> Beyond that, I'm not sure if we are arguing about something much, or
>>> basically nitpicking. :-)
>> 
>> This is true.  :-)  It's fun, though!  Also, I've found that not
>> understanding how this stuff works in detail can have really
>> profound performance implications.  Simply put, most programmers
>> don't have good intuition here.
>
>Maybe I should tell about the time when I worked at a compane where we 
>were making network switching/routing and did our own hardware, and were 
>very careful about exactly where each byte of incoming packets ended up 
>so that all the data we were interested in were all sitting in the same 
>cache line, to get the most speed out of it all...

Nice.

>>>>>>> (And then we had the PDP-11/74, which had to implement cache coherency
>>>>>>> between CPUs without any hardware support...)
>>>>>>
>>>>>> Sounds rough. :-/
>>>>>
>>>>> It is. CPU was modified to actually always step around the cache for one
>>>>> instruction (ASRB - used for spin locks), and then you manually turn on
>>>>> and off cache bypass on a per-page basis, or in general of the CPU,
>>>>> depending on what is being done, in order to not get into issues of
>>>>> cache inconsistency.
>>>>
>>>> This implies that stores were in a total order, then, and
>>>> these uncached instructions were serializing with respect to
>>>> other CPUs?
>>>
>>> The uncached instruction is basically there in order to be able to
>>> implement a spin lock that works as you would expect. Once you have the
>>> lock, then you either deal with data which is known to be shared, in
>>> which case you need to run with cache disabled, or you are dealing with
>>> data you know is not shared, in which case you can allow caching to work
>>> as normal.
>>>
>>> No data access to shared resources are allowed to be done without
>>> getting the lock first.
>> 
>> Sure.  But suppose I use the uncached instructions to implement
>> a lock around a shared data structure; I use the uncached instr
>> to grab a lock, I modify (say) a counter in "normal" memory with
>> a "normal" instruction and then I use the uncached instruction
>> to release the lock.  But what about the counter value?  Is its
>> new value --- the update to which was protected by the lock ---
>> immediately visible to all other CPUs?
>
>Like I said - if you are dealing with shared data, even after you get 
>the lock, you then need to turn off the cache while working on it. So 
>basically, such updates would immediately hit main memory. And any reads 
>of that data would also be done with cache disabled, so you get the 
>actual data. So updates always immediately visible to all CPUs.

Oh, I understand now: I had missed that if you wanted the data
protected by the lock to be immediately visible to the other
CPUs you had to disable caching (which, I presume, would flush
cache contents back to RAM).  Indeed, it's impressive that they
were able to do that back then.

>>> Hey, it's an old system by today's standards. Rather primitive. I think
>>> it's cool DEC even made it, and it works, and gives pretty acceptable
>>> performance. But it was noted that going much above 4 CPUs really gave
>>> diminishing returns. So the 11/74 never supported more than 4 CPUs. And
>>> it gave/gives about 3.5 times the performance of a single 11/70.
>> 
>> I can see that.  Very cool, indeed.
>
>It was nice that they made it work, considering the limitations. Also, 
>they took an existing OS (RSX) and made rather small modifications to it 
>in order to make it work MP. Sure, you basically have one big lock, but 
>the system was for the most part designed in such a way that most things 
>were already safe.

Cool.

	- Dan C.