[Info-vax] OpenVMS async I/O, fast vs. slow

Sun Nov 5 21:07:48 EST 2023

On 11/5/2023 5:25 PM, Jake Hamby (Solid State Jake) wrote:
> I pushed some work in progress to my GitHub repo for the VMS port of
> libuv that I started to work on. I'm not sure if it should be
> considered a bug in DECthreads that it behaves so impossibly poorly
> on the async ping test that writes a byte to a pipe to wake a child
> sleeping on a poll() and then sleeps on poll() to wait for the reply
> on a pipe going the other way. I'm getting no more than 60 round
> trips / sec, even after commenting out the sched_yield() call before
> it goes back to sleep. Being so unworkably slow using the POSIX APIs,
> I didn't bother to try to run the test cases.
> 
> My guess would be that VMS's user-mode thread scheduler doesn't
> realize immediately that one of the benchmark process's threads has
> signaled another one to wake up from a poll() by writing to a pipe
> shared between them, and it's going to sleep on a timer before
> eventually waking up. The poll() version of the benchmark uses almost
> no CPU because it's mostly sleeping. You can get instant inter-thread
> wakeup with local event flags, and as I mentioned at the start of
> this thread, I did see above 10K round-trips/sec with a hacked-up
> version of libuv's async benchmark to test the do-nothing wake-up
> case.

It is not that unsual that when trying to emulate OS A on OS B,
then performance suffers.

> What makes libuv such an interesting test is that the naive port
> using the C RTL synchronous POSIX APIs is almost unusably bad, and in
> the best-case scenario has too many unneeded layers of emulation, but
> by using the Win32 and Linux io_uring implementations as a model,
> it's possible to get acceptable behavior using all native VMS APIs,
> with completion ASTs waking up the event loops using one local event
> flag per event loop.

That optimal performance require native OS features follows from above.

>                      App servers like Node.js either have only a
> single event loop, or one per core, so VMS's limit of 48 (64 - 8
> reserved) should be more than adequate.

First I am not sure how you get 64 - 8 to be 48. My calculator
claims 56.

:-)

But secondly the number of threads in an app server vary a
lot between app servers.

If you go to the Java world (servlet containers like Tomcat
or full application servers like WildFly) then expect to see
a lot more threads. From like 100 for development server to
like 1000 for production server.

> So that gets back to RMS vs. $QIO(W)/$IO_PERFORM(W). Since nobody
> cares about anything other than Stream_LF sequential files in UNIX
> world, there's nothing advanced that we need or want from RMS as far
> as record handling,

Not so sure about that.

Application that only work with STREAM_LF files is a PITA. Unfortunately
not unheard off. But no need to produce any more.

>                     but we do want to perform reads/writes of any
> size starting from any offset. If you use $QIO(W), then, like
> $IO_PERFORM(W), you must access file data by virtual block offset
> (starting at block 1, not 0), and then you're effectively doing raw
> disk I/O, except with the OS mapping only the blocks in the file you
> have access to through your channel.
> 
> The reason C RTL and everyone else go through RMS is so they can do
> byte-level file I/O instead of block-level, and to take advantage of
> XFC caching.

I don't think XFC require RMS.

>              Everyone keeps talking about raw throughput and database
> servers and not the much more common use cases for file I/O.

On VMS today a lot of applications use either index-sequential files
or some custom access to sequential files.

But in the general IT industry today the performance usually depend
on some sort of database (RDBMS or NoSQL) not flat files.

(index-sequential files are really a NoSQL database, but code wide
it is together with the rest of RMS on VMS)

Arne