[Info-vax] OpenVMS async I/O, fast vs. slow

Sun Nov 5 17:25:50 EST 2023

On Saturday, November 4, 2023 at 2:44:30 PM UTC-7, Arne Vajhøj wrote:
> On 11/4/2023 7:11 AM, Johnny Billquist wrote: 
> > 
> > I'm not sure I have ever understood why people think memory mapped files 
> > would be faster than a QIO under VMS.
> Very few layers. 
> 
> Large degree of freedom to the OS about how to read.
> > With memory mapped I/O, what you essentially get is that I/O transfers 
> > go directly from/to disk to user memory with a single operation. There 
> > are no intermediate buffers, no additional copying. Which is what you 
> > pretty much always have otherwise on Unix systems. 
> > 
> > However, a QIO under VMS is already a direct communication between the 
> > physical memory and the device with no intermediate buffers, additional 
> > copying or whatever, unless I'm confused (and VMS changed compared to 
> > RSX here...).
> XFC?
> > So how would memory mapped I/O be any faster? You basically cannot be 
> > any faster than one DMA transfer. In fact, with memory mapped I/O, you 
> > might be also hitting the page fault handling, and a reading in of a 
> > full page, which might be more than you needed, causing some overhead as 
> > well.
> Fewer layers to go through. More freedom to read ahead.
> > Also, what do $IO_PERFORM do, that could possibly make it faster than QIO?
> $QIO(W) is original. $IO_PERFORM(W) was added much later. 
> 
> $IO_PERFORM(W) is called fast path IO. The name and the fact 
> that it was added later hint at it being faster. 
> 
> That name has always give me associations to a strategy of 
> doing lots of checks upfront and then skip layers 
> and checks when doing the actual reads/writes. But I 
> have no idea if that is actually what it does. 

I think you've summed up the open questions and tradeoffs with VMS I/O. Memory mapped I/O could be much faster under certain access patterns but slower for others. It's great for random access patterns that leverage the page cache. $IO_PERFORM(W) seems to primarily fix $QIO having too many arguments to hold in registers (beyond 6 get pushed to the stack) and eliminating a few spinlocks (as Stephen Hoffman quoted in his reply).

I pushed some work in progress to my GitHub repo for the VMS port of libuv that I started to work on. I'm not sure if it should be considered a bug in DECthreads that it behaves so impossibly poorly on the async ping test that writes a byte to a pipe to wake a child sleeping on a poll() and then sleeps on poll() to wait for the reply on a pipe going the other way. I'm getting no more than 60 round trips / sec, even after commenting out the sched_yield() call before it goes back to sleep. Being so unworkably slow using the POSIX APIs, I didn't bother to try to run the test cases.

My guess would be that VMS's user-mode thread scheduler doesn't realize immediately that one of the benchmark process's threads has signaled another one to wake up from a poll() by writing to a pipe shared between them, and it's going to sleep on a timer before eventually waking up. The poll() version of the benchmark uses almost no CPU because it's mostly sleeping. You can get instant inter-thread wakeup with local event flags, and as I mentioned at the start of this thread, I did see above 10K round-trips/sec with a hacked-up version of libuv's async benchmark to test the do-nothing wake-up case.

What makes libuv such an interesting test is that the naive port using the C RTL synchronous POSIX APIs is almost unusably bad, and in the best-case scenario has too many unneeded layers of emulation, but by using the Win32 and Linux io_uring implementations as a model, it's possible to get acceptable behavior using all native VMS APIs, with completion ASTs waking up the event loops using one local event flag per event loop. App servers like Node.js either have only a single event loop, or one per core, so VMS's limit of 48 (64 - 8 reserved) should be more than adequate.

So that gets back to RMS vs. $QIO(W)/$IO_PERFORM(W). Since nobody cares about anything other than Stream_LF sequential files in UNIX world, there's nothing advanced that we need or want from RMS as far as record handling, but we do want to perform reads/writes of any size starting from any offset. If you use $QIO(W), then, like $IO_PERFORM(W), you must access file data by virtual block offset (starting at block 1, not 0), and then you're effectively doing raw disk I/O, except with the OS mapping only the blocks in the file you have access to through your channel.

The reason C RTL and everyone else go through RMS is so they can do byte-level file I/O instead of block-level, and to take advantage of XFC caching. Everyone keeps talking about raw throughput and database servers and not the much more common use cases for file I/O.

For purposes of libuv, it's convenient that Windows has a flag you must set if you're going to mmap() a file, while VMS also has a flag you have to set to use $CRMPSC. The VMS flag you must set is FAB$V_UFO, for user file open, and if you set it then that means you also are forced to use $QIO(W) (or $IO_PERFORM(W)) to access the file contents because RMS essentially opens the file as "foreign" and only handles open, create, and close. So you can't use RMS to read/write a file and also mmap() it, or at least not from the same open channel.

Coincidentally, both VMS and Windows have a convenient "delete temp file on close" flag, which the VMS port of Perl also uses, but VMS has the additional quirk that you can make temp files that don't even get directory entries.

With VMS being closer to UNIX in some ways and closer to Windows in others, while being distinct from either, for an abstraction layer like libuv, which needs to provide a file descriptor-like API (except async), it makes sense to only use the RMS file access APIs, in async mode, because going the raw virtual block read/write route is going to be a lot of code, a lot of bugs, and it bypasses the XFC block cache, which is almost certainly not what users are going to want.

The last detail I want to mention is that if I want to support libuv's mapping of stdio file handles and fds to its own API, which is probably not used often, the ideal situation would be if I could get the underlying I/O channel for a C RTL UNIX file descriptor, and vice versa. But the only API remotely like this was added by customer request (I found the forum thread requesting it), and it returns the I/O channel for a FILE *, not an fd. That call is  decc$get_channel(__FILE_ptr32 fp).

>From the header files, I see that <socket.h> has decc$socket_fd(int __channel) to turn an open socket channel into an fd. That will definitely be useful, since every type of file needs different special handling in libuv anyway. Just below it is decc$get_sdc (int __descrip_no) to go the other direction. If there aren't equivalent APIs for files and mailboxes/pipes, then the fallback would be to have an internal two-way pipe between them like Perl does for spawning subprocesses.

Regards,
Jake Hamby