[Info-vax] OpenVMS async I/O, fast vs. slow

Sat Nov 4 07:11:59 EDT 2023

On 2023-11-03 15:08, Arne Vajhøj wrote:
> On 11/2/2023 9:02 PM, Jake Hamby (Solid State Jake) wrote:
>> I've become a little obsessed with the question of how well OpenVMS
>> performs relative to Linux inside a VM, under different conditions.
>> My current obsession is the libuv library which provides a popular
>> async I/O abstraction layer implemented for all the different flavors
>> of UNIX that have async I/O, as well as for Windows. What might a VMS
>> version look like? How many cores could it scale up to without too
>> much synchronization overhead?
>>
>> Alternatively, for existing VMS apps, how might they be sped up on
>> non-VAX hardware? Based on the mailbox copy driver loop in the VMS
>> port of Perl that I spent some time piecing together, I've noticed a
>> few patterns that can't possibly perform well on any hardware newer
>> than Alpha, and maybe not on Alpha either.
> 
> The normal assumption regarding speed of disk IO would be that:
> 
> RMS record IO ($GET and $PUT) < RMS block IO ($READ and $WRITE) < 
> $QIO(W) < $IO_PERFORM(W) < memory mapped file
> 
> (note that assumption and fact are spelled differently)

I'm not sure I have ever understood why people think memory mapped files 
would be faster than a QIO under VMS.

With memory mapped I/O, what you essentially get is that I/O transfers 
go directly from/to disk to user memory with a single operation. There 
are no intermediate buffers, no additional copying. Which is what you 
pretty much always have otherwise on Unix systems.

However, a QIO under VMS is already a direct communication between the 
physical memory and the device with no intermediate buffers, additional 
copying or whatever, unless I'm confused (and VMS changed compared to 
RSX here...).

So how would memory mapped I/O be any faster? You basically cannot be 
any faster than one DMA transfer. In fact, with memory mapped I/O, you 
might be also hitting the page fault handling, and a reading in of a 
full page, which might be more than you needed, causing some overhead as 
well.

Also, what do $IO_PERFORM do, that could possibly make it faster than QIO?

I'm really curious about this topic...

> All of it can relative easily be done async as VMS has done async IO
> since forever.

Indeed. And it does it in a most wonderful way. Unfortunately, most 
people never make full use of it, and it's often not that easy to use 
from higher level languages.

>> My first concern is the use of VAX interlocked queue instructions,
>> because they're system calls on Itanium and x86 (and PALcode on
>> Alpha). They can't possibly run as fast as VMS programmers may be
>> assuming based on what made sense on a VAX 30 years ago. The current
>> hotness is to use lock-free data structures as much as possible.
> 
> These constructs are certainly going to be relative
> more costly on newer system than on VAX.
> 
> But I doubt that it matters much for overall performance.

Agreed. I doubt a lot of performance is hanging around these bits.

>> I also think it does make sense to have a shared pool of IOSBs and to
>> avoid dynamic memory allocation as much as possible. malloc() vs.
>> stack memory allocation is another area I'm curious about. We're not
>> in the embedded era with 64KB stack sizes or anything like that, so
>> if you know that an array will be a fixed 4KB or 8KB or 64KB max,
>> then why not put it on the stack? That seems like it'd be helpful on
>> anything post-VAX. The circular buffers that io_uring uses are also a
>> good template.
> 
> I have never seen VMS code dynamic allocate an IOSB in heap - either
> stack or static heap.

I have, but I don't think it matters much where your IOSB is. Not sure 
what the problem is here that is being addressed...

> But again then I don't think location of IOSB or reuse of IOSB's will
> matter much for overall performance.

Me neither. But a common trick (at least in my corner of the world) is 
to have the IOSB as a part of some larger structure with context and 
various information. Which means that when the AST is called, you 
actually get the address of this context, since the address of the IOSB 
is passed on to the AST handler.
No need to have anything locked, managed and shared around this. Each 
IOSB as well as the whole data structure can be pretty private from the 
start.

   Johnny