[Info-vax] OpenVMS async I/O, fast vs. slow

Thu Nov 9 19:10:10 EST 2023

On Thursday, November 9, 2023 at 10:52:00 AM UTC-8, Johnny Billquist wrote:
> >> Cache lines are typically something like 128 bytes or so, so even though 
> >> locality means there is some other data around, the owning CPU is 
> >> unlikely to care about anything in that cache line, but that is 
> >> speculation on my part. 
> > 
> > Cache line size on current x86 processors is 64 bytes, but yeah.
> I can't even keep track of that. It changes over time anyway, and is not 
> even CPU specific. It's an implementation detail for the memory 
> subsystem. :)

One "fun" feature of the PowerPC is that it has a "zero cache line" instruction as well as the usual cache line flush. How big is the cache line? It depends on the CPU. All the 32-bit PowerPCs that Apple used had a 32-byte cache line size, until they switched to the G5, which, like other recent 64-bit POWER CPUs, has a 128-byte cache line size.

As you might guess, that change broke a lot of PowerPC Mac programs that blithely assumed the cache line size was 32 bytes. Or it would have, if Apple hadn't foreseen this issue and asked IBM to add a special processor flag to emulate a 32-byte cache size just for the dcbz instruction, which they enable when running 32-bit Mac programs.

More trivia: the G5 (PPC 970) is one of the only PowerPC CPUs and the only one I know of that can't run in little-endian mode. The endian mode bit doesn't work. In other aspects, it's roughly equivalent to a POWER4. IBM kept adding new CPU instructions so it's hit-or-miss whether toolchains and JIT compilers will generate code for a machine so old, and projects like Node.js, Go, Rust, etc. have explicitly decided to drop support for anything older than POWER7 or POWER8. Even AIX has just recently dropped support for older CPUs.

Linux has a very fancy version of KVM for PowerPC/POWER. On their most recent CPUs, there's an "ultravisor" above the hypervisor, and new interrupt controllers to push large amounts of data in and out of the guest VM.

What's more amusing for me, as a hobbyist, is that there was a lot of work on this feature from circa 2005-2014 when PowerPC was more popular and being used in game consoles (including "Other OS" for the PS3, RIP), so there's a special KVM for PowerPC chips without a hypervisor, which emulates the MMU in software. In IBM fashion, it's called "KVM-PS" for "problem state" (i.e. "user mode"). It runs in kernel mode but not in hypervisor mode like "KVM-HV", the one you'd be using on POWER6+.

KVM (both versions) actually handles this 32-byte cache line issue when running in 32-bit mode by either flipping the compatibility bit on the G5 CPU, or on every other 64-bit CPU by rewriting those instructions into an illegal opcode whenever it loads a page to execute, and then doing a 32-byte memcpy() when the CPU hits the replacement instruction.

> > Well getting back to that.... One of the things that I found 
> > rather odd was that that discussion seems to conflate macro- and 
> > micro-level optimizations in an odd way. I mean, it went from 
> > talking about efficient ways to retrieve data from a secondary 
> > storage device that is orders of magnitude slower than the CPU 
> > to atomic operations and reducing mutex contention, which seems 
> > like for most IO-bound applications will be in the noise.
> True.
> >> Beyond that, I'm not sure if we are arguing about something much, or 
> >> basically nitpicking. :-) 
> > 
> > This is true. :-) It's fun, though! Also, I've found that not 
> > understanding how this stuff works in detail can have really 
> > profound performance implications. Simply put, most programmers 
> > don't have good intuition here.

It is interesting thinking about the problem at both long and short time scales. Hacking on the libuv code has been very educational for me because they use atomic ops correctly where they make sense, spinlocks where they make sense, and other forms of waiting when the event loop is waiting for I/O completions to arrive.

There probably isn't much benefit to optimizing the callbacks in the event loop for a particular program because it's only going to loop once per set of I/O events, which could be minutes, hours, or days if you're not getting any events, or it could be 100,000's of loops per second on an active server, with each loop handling all the I/O completions that arrived since it finished the queue on the last loop.

It's requiring a bit of creativity to craft my own data structures and write code to replace the UNIX and Windows versions (the UNIX code has generic and OS-specific functions and macros to fill out the data structs).

Getting back to I/O in general, I reread the DEC Ada manuals, partly for nostalgia but also to form an opinion about what worked and didn't work about that language. The VMS implementation was quite good, but Ada 83 had some very awkward and tedious aspects that slowly got streamlined and improved. There's now an Ada 2022 that's a further refinement of Ada 2012.

As a student, Ada's I/O libraries caused everyone a lot of grief because the language specifies that text and terminal I/O have end-of-line, end-of-page, and end-of-file markers, and there are End_Of_Line, End_Of_Page, and End_Of_File functions that return different boolean values as you process a file.

The other really ugly language limitation was the lack of a heap-allocated variable-length string data type, so you had to know the length of each string or set a max length and keep the actual length in a separate variable, etc.. That also means that for the Ada bindings to VMS system routines, it can only handle Class S strings (static length) and not the dynamic-length variety. Ada 95 added unbounded-length strings that you could assign and modify like you'd expect a standard library as big as Ada's to include, so in theory at least, GNAT Ada could have a more elegant binding to OpenVMS than any of the other "layered product" languages.

Ada is an ideal language for async I/O because it has had tasking with synchronization and synchronized data structures since the beginning, and improved with each revision. But that's not why I brought it up just now.

One quirk I remember using as a student is DEC Ada added its own "ADA$INPUT" and "ADA$OUTPUT" logicals that you could redefine that it would use if defined in preference to "SYS$INPUT" and "SYS$OUTPUT", which helped for testing Ada programs without redirecting any other I/O.

More relevant to the discussion, DEC Ada also used RMS services for all forms of I/O, including terminal I/O, which you can do. The RMS manual talks about using RMS for terminal I/O, and it seems like a bizarre thing to want to do, but for any code that has to work with both files and terminals, I can absolutely see why it's easier to route everything through the same API, and that API is going to be RMS if you want to support the RMS file and record I/O features.