[Info-vax] OpenVMS async I/O, fast vs. slow

Fri Nov 3 17:45:11 EDT 2023

In article <ff7f2845-84cc-4c59-a007-1b388c82543fn at googlegroups.com>,
Jake Hamby (Solid State Jake) <jake.hamby at gmail.com> wrote:
>I've become a little obsessed with the question of how well
>OpenVMS performs relative to Linux inside a VM, under different
>conditions. My current obsession is the libuv library which
>provides a popular async I/O abstraction layer implemented for
>all the different flavors of UNIX that have async I/O, as well
>as for Windows. What might a VMS version look like? How many
>cores could it scale up to without too much synchronization
>overhead?
>
>Alternatively, for existing VMS apps, how might they be sped up
>on non-VAX hardware? Based on the mailbox copy driver loop in
>the VMS port of Perl that I spent some time piecing together,
>I've noticed a few patterns that can't possibly perform well on
>any hardware newer than Alpha, and maybe not on Alpha either.
>
>My first concern is the use of VAX interlocked queue
>instructions, because they're system calls on Itanium and x86
>(and PALcode on Alpha).  They can't possibly run as fast as VMS
>programmers may be assuming based on what made sense on a VAX
>30 years ago. The current hotness is to use lock-free data
>structures as much as possible.

I don't know that that's the current hotness so much as trying
to structure problems so that the vast majority of access are
local to a core, so that either locks are unnecessary, or
taking a lock is just an uncontended write to an owned cache
line.

Consider the problem of a per-core queue with work-stealing;
the specifics of what's in the queue don't matter so much as the
overall structure of the problem: items in the queue might
represent IO requests, or they may represent runnable threads,
or whatever.  Anyway, a core will usually process things on
its own local queue, but if it runs out of things to do, it
may choose some victim and steal some work from it.
Conventionally, this will involve everyone taking locks on
these queues, but that's ok: we expect that work stealing is
pretty rare, and that usually a core is simply locking its own
queue, which will be an uncontended write on a cacheline it
already owns.  In any event, _most of the time_ the overhead
of using a "lock" will be minimal (assuming relatively simply
spinlocks or MCS locks or something like that).

>My next thought was that it makes sense to build your I/O
>around shared memory and atomic ops as much as possible. I
>think Linux's io_uring presents a good model for how a
>user-level VMS I/O completion buffer might look. There's no
>ability to influence how you issue the async I/O requests, so
>there's no way to make a circular buffer for submissions, but
>you have a lot of flexibility in handling the completion
>events.

I disagree with the first sentence.  Contended atomic
operations become a scalability bottleneck very quickly; this
is generally why spinlocks behave so poorly ones you have more
than a handful of cores racing against one.  Better would be to
structure the problem to minimize interaction between components
at all, and there are a number of techniques one can take
advantage of here.  For example, one might have a thread (or
async task or whatever) per IO device, set up a number of IOs in
some worker thread, and then transfer the descriptors for those
IOs to the the per-device thread _en masse_; by doing so, one
amortizes the expensive step of synchronizing between components
across many operations.

A thing to bear in mind with async systems is that they run the
risk of unbounded growth if there isn't some way to inject
backpressure into the system.  Without hysteresis, you hit an
inflection point where the amount of work queued balloons to the
point the system becomes unresponsive.

>I also think it does make sense to have a shared pool of IOSBs
>and to avoid dynamic memory allocation as much as possible.
>malloc() vs. stack memory allocation is another area I'm
>curious about. We're not in the embedded era with 64KB stack
>sizes or anything like that, so if you know that an array will
>be a fixed 4KB or 8KB or 64KB max, then why not put it on the
>stack? That seems like it'd be helpful on anything post-VAX.
>The circular buffers that io_uring uses are also a good template.

I can think of several (potential) issues with this direction.
First, stack allocations are inherently transient; if you share
a pointer to something on your stack, then that pointer is
invalidated if you return from whatever function call owns the
frame that data is allocated on.  Which isn't to say that you
shouldn't share things allocated on your stack, just that one
must exercise caution to when doing so to ensure that you don't
return while there's still a live pointer to the data on the
relevant stack frame.

Second, just because it doesn't involve an explicit `malloc`
doesn't mean that stack allocation is free, particularly for
large objects: it may be that you encur the overhead of one
(or perhaps more than one) fault as the OS grows the stack on
your behalf.

I think the more general advice is, "minimize allocations in
the hot path."  There's nothing magical about heap memory versus
stack memory; they're both just memory.  If you preallocate
space in the heap, accessing it will be pretty fast just like
accessing space on an allocated stack frame (yeah yeah yeah,
we can talk about addressing modes and frame/sp-relative
accesses and so forth here, but that's really getting into the
weeds).  But if, say, you heap allocated a per-CPU "arena" for
temporary space when your program started up and treated it
kind of like a stack, I don't think you'd see any significant
performance variation versus just making stack allocations.  It
may even be slightly more efficient, particularly if you can
arrange for the OS to preallocate your memory and you wire it
into your address space (e.g., `mlock` or equivalent).

>In general, what's the fastest native way to queue and handle
>async commands? I think setting the event flag to EFN$C_ENF
>(don't care) and using the AST and AST parameter would be the
>quickest way to match up a completed I/O to some useful data.
>There are a lot of possibilities for how to encode things into
>that 64-bit AST parameter so that it can post the right
>completion code to the right completion queue.
>
>The problem with ASTs, though, is they don't play well with
>pthreads, and you might want to use threads with libuv (Node.js
>is essentially single-threaded, using the default event queue
>only, but libuv does provide for threads and a few of its calls
>are threadsafe).

Generally, event-driven callbacks will be faster than
synchronous threading (at a minimum, you avoid register save and
restore and stack switching, and you may have better cache
locality, too).  However, there's no reason you can't multiplex
such callbacks onto multiple operating system-provided threads,
to get multicore parallelism....Indeed, you almost have to since
the practically no OS gives you any other primitive.

In OS design, there is always a tension between interrupts and
polling in terms of which has more desirable characteristics
under different criteria; generally speaking, you'll have lower
latency with polling but at the expense of burning computational
resources (since you burn a core spinning on the polling
operation).  So if you have a core to spare, _a_ way to get
faster performance in an async system is to dedicate it to
polling for IO completions and marking tasks runnable (e.g.,
by handing them out to per-core executors or something). Task-
handling cores would just spin on their executor queues or
something like that.

>The safest option on VMS, and a popular one, is to use local
>event flags. I've modified the simplest libuv async benchmark
>to use pairs of LEF's to ping-pong back and forth and it looks
>like it's about 1/5 the speed of Linux in the same VM, or
>10K/sec instead of 50K. The same test on the bare metal is
>around 250K per second, so the VM makes a huge difference.

Is the code available somewhere to poke at?

>As an aside, I'd avoid KVM/QEMU in favor of VirtualBox to run
>OpenVMS, for performance reasons. I ran some of the Java
>benchmarks I posted here recently in KVM and I didn't compute
>the exact numbers, but most of them seem around half the speed
>of VirtualBox. I suppose then that it's a good thing I started
>out with VirtualBox and then only recently copied my .vdi files
>into .qcow2 format to test in KVM. I have no idea how fast
>VMware ESXi is.

I actually find this rather surprising; KVM at least is very
good and is the hypervisor of choice for the hyperscalers (well,
except MSFT, which uses Hyper-V).  ESXi is pretty zippy.

>I think there must be a safe way to wake a POSIX thread that's
>sleeping on a condition variable without possibly sleeping on
>the associated mutex.  In general, for an async event loop,
>there'll be only one thread waiting for completion, so I'm
>thinking it may be safe to use an atomic variable as a counter
>and then wake the sleeping thread only if the previous value
>was 0. I'll have to think about the details.
>
>The AST callback
>can't ever block, especially since it's going to be occurring
>on the kernel thread that issued the request (that's my
>understanding, anyway), which has an event loop that may or may
>not be sleeping waiting for a completion event.

If I understand this correctly, the concern is blocking in the
AST upcall, and so don't want to race on the mutex that protects
the innards of the convar itself (e.g., the waiter list).

If you're really worried about this, perhaps condition variables
aren't the right tool for the job.  Semaphores have the property
you are describing; e.g. see https://swtch.com/semaphore.pdf
POSIX provides semaphores that work with pthreads.

But it sort of seems like mixing things in an odd way.  The
point of using something like a condition variable is that you
are blocking a thread until "something" happens, at which point
signaling the condvar will unblock that thread.  But in an
asynchronous system, you're trying to avoid that kind of
blocking entirely, right?  That is, why would you block a
pthread to wait on an asynchronous event when it could be
put to work doing something useful instead?  If you're going to
do that, why not just use a synchronous IO call and block on
that?

>The other detail constantly in my mind is that the DECthreads
>library doesn't have a 1:1 mapping between pthreads and kernel
>threads. In fact, it only can create kernel threads up to the
>number of active CPUs, and then it has to multiplex on top of
>those. That's why I'm thinking it may be best to look at how
>Linux and others are doing their I/O, so as to avoid falling
>into patterns that made sense on a single-core VAX in the late
>1990s.

Sadly, this is an area where the Unix folks just haven't kept up
and in some senses are suffering the consequences of design
decisions made in their very early days.  Unix was always meant
to shield the programmer from the "difficult" model of async
IO, which was only exposed to the programmer in a very limited
fashion (signal, kill; arguably fork and wait in the early
days).
(https://www.tuhs.org/pipermail/tuhs/2015-September/007509.html)

The side-effect this had was that pretty much every Unix program
was written assuming blocking IO, and so when muxing M green
threads onto N kernel-schedulable entities, you'd end up axing
1/N'th of your parallelism blocking on, say, `open()`, which has
no non-blocking analogue.  As a result, while there was a lot of
interest in scheduler activations and N:M threading models back
in the 90s, pretty much every modern Unix has eschewed that model
(that, apparently, DECthreads retains) and is now strictly 1:1.
Not to mention that the kernel thread scheduler would often fight
with the userspace scheduler, since neither had any real insight
into what the other was doing.  To the extent any modern Unix-y
system retains an N:M thread scheduler, it's mostly a vestige.
Well, at least until we start talking about things like fibers
and coroutines, but those have very different implementations.
(And there are some research systems have that very different
implementations, e.g., Akaros.)

AN issue with threaded systems is that you really do want some
kind of kernel-managed thread pool that you inject work into;
again, this helps providing some backpressure that in turn helps
with scalability without letting work queues become unbounded.

Anyway, I guess the point is that looking at Linux for guidance
here is probably not terribly relevant to VMS.

>I'll add more detail here as I learn more about what works and
>what doesn't. This rant and project were largely inspired by
>watching a talk Mark Russinovich gave in 2004 on Linux vs.
>Windows kernels, and back then Windows NT's I/O completion
>ports were seen as better than what Linux had to offer, because
>you can set up a thread pool and the OS wakes up one thread to
>handle each response. NT 3.5 introduced the feature in 1994,
>and Microsoft was granted a patent on it, which didn't expire
>until 2014.

I don't think that's quite what he was saying.  He did give an
idealized model for IO in enterprise software: basically, you've
got one runnable thread per CPU and all other threads would be
blocked waiting on IO.  Completion ports are just an API with
hooks into the scheduler for maintaining the status of
outstanding IO requests.  I think of it as sort of an
intermediary between IO requests and processes.

>Ironically, the pendulum today has now swung in the opposite
>direction, probably due to memory cache contention issues in
>SMP, to having a single async I/O thread and everything inside
>of its data structures being non-thread-safe, so Windows' IOCPs
>don't even seem relevant to me today. In theory, OpenVMS can
>support the modern paradigm quite well, because it was designed
>that way 45 years ago.

I don't know that it's that difference.  An issue these days is
that we're trying to drive large asynchronous runtimes through a
synchronous system call interface (at least, in the case of
POSIX-style APIs) that in term mux highly synchronous hardware
resources.  Things like io_uring are designed to try and expose
a more hardware-like IO abstraction to userspace software,
cutting out the middle.

>In practice, I think good performance is achievable, especially
>in a few years when the x86 port is more optimized.

Agreed.

	- Dan C.