[Info-vax] OpenVMS async I/O, fast vs. slow

Fri Nov 3 06:17:04 EDT 2023

On Friday, November 3, 2023 at 1:02:33 AM UTC, Jake Hamby (Solid State Jake) wrote:
> I've become a little obsessed with the question of how well OpenVMS performs relative to Linux inside a VM, under different conditions. My current obsession is the libuv library which provides a popular async I/O abstraction layer implemented for all the different flavors of UNIX that have async I/O, as well as for Windows. What might a VMS version look like? How many cores could it scale up to without too much synchronization overhead? 
> 
> Alternatively, for existing VMS apps, how might they be sped up on non-VAX hardware? Based on the mailbox copy driver loop in the VMS port of Perl that I spent some time piecing together, I've noticed a few patterns that can't possibly perform well on any hardware newer than Alpha, and maybe not on Alpha either. 
> 
> My first concern is the use of VAX interlocked queue instructions, because they're system calls on Itanium and x86 (and PALcode on Alpha). They can't possibly run as fast as VMS programmers may be assuming based on what made sense on a VAX 30 years ago. The current hotness is to use lock-free data structures as much as possible. 
> 
> My next thought was that it makes sense to build your I/O around shared memory and atomic ops as much as possible. I think Linux's io_uring presents a good model for how a user-level VMS I/O completion buffer might look. There's no ability to influence how you issue the async I/O requests, so there's no way to make a circular buffer for submissions, but you have a lot of flexibility in handling the completion events. 
> 
> I also think it does make sense to have a shared pool of IOSBs and to avoid dynamic memory allocation as much as possible. malloc() vs. stack memory allocation is another area I'm curious about. We're not in the embedded era with 64KB stack sizes or anything like that, so if you know that an array will be a fixed 4KB or 8KB or 64KB max, then why not put it on the stack? That seems like it'd be helpful on anything post-VAX. The circular buffers that io_uring uses are also a good template. 
> 
> In general, what's the fastest native way to queue and handle async commands? I think setting the event flag to EFN$C_ENF (don't care) and using the AST and AST parameter would be the quickest way to match up a completed I/O to some useful data. There are a lot of possibilities for how to encode things into that 64-bit AST parameter so that it can post the right completion code to the right completion queue. 
> 
> The problem with ASTs, though, is they don't play well with pthreads, and you might want to use threads with libuv (Node.js is essentially single-threaded, using the default event queue only, but libuv does provide for threads and a few of its calls are threadsafe). 
> 
> The safest option on VMS, and a popular one, is to use local event flags. I've modified the simplest libuv async benchmark to use pairs of LEF's to ping-pong back and forth and it looks like it's about 1/5 the speed of Linux in the same VM, or 10K/sec instead of 50K. The same test on the bare metal is around 250K per second, so the VM makes a huge difference. 
> 
> As an aside, I'd avoid KVM/QEMU in favor of VirtualBox to run OpenVMS, for performance reasons. I ran some of the Java benchmarks I posted here recently in KVM and I didn't compute the exact numbers, but most of them seem around half the speed of VirtualBox. I suppose then that it's a good thing I started out with VirtualBox and then only recently copied my .vdi files into .qcow2 format to test in KVM. I have no idea how fast VMware ESXi is. 
> 
> I think there must be a safe way to wake a POSIX thread that's sleeping on a condition variable without possibly sleeping on the associated mutex. In general, for an async event loop, there'll be only one thread waiting for completion, so I'm thinking it may be safe to use an atomic variable as a counter and then wake the sleeping thread only if the previous value was 0. I'll have to think about the details. The AST callback can't ever block, especially since it's going to be occurring on the kernel thread that issued the request (that's my understanding, anyway), which has an event loop that may or may not be sleeping waiting for a completion event. 
> 
> The other detail constantly in my mind is that the DECthreads library doesn't have a 1:1 mapping between pthreads and kernel threads. In fact, it only can create kernel threads up to the number of active CPUs, and then it has to multiplex on top of those. That's why I'm thinking it may be best to look at how Linux and others are doing their I/O, so as to avoid falling into patterns that made sense on a single-core VAX in the late 1990s. 
> 
> I'll add more detail here as I learn more about what works and what doesn't. This rant and project were largely inspired by watching a talk Mark Russinovich gave in 2004 on Linux vs. Windows kernels, and back then Windows NT's I/O completion ports were seen as better than what Linux had to offer, because you can set up a thread pool and the OS wakes up one thread to handle each response. NT 3.5 introduced the feature in 1994, and Microsoft was granted a patent on it, which didn't expire until 2014. 
> 
> Ironically, the pendulum today has now swung in the opposite direction, probably due to memory cache contention issues in SMP, to having a single async I/O thread and everything inside of its data structures being non-thread-safe, so Windows' IOCPs don't even seem relevant to me today. In theory, OpenVMS can support the modern paradigm quite well, because it was designed that way 45 years ago. In practice, I think good performance is achievable, especially in a few years when the x86 port is more optimized. 
> 
> Regards, 
> Jake Hamby

Have you looked at the Fast I/O routines? https://docs.vmssoftware.com/vsi-openvms-io-user-s-reference-manual/#ch10.html