[Info-vax] OpenVMS async I/O, fast vs. slow

Sun Nov 5 13:03:59 EST 2023

In article <ui8dkv$20n6$1 at dont-email.me>,
Craig A. Berry <craigberry at nospam.mac.com> wrote:
>
>On 11/4/23 4:41 PM, Dan Cross wrote:
>> In article <ui64is$3go9i$1 at dont-email.me>,
>> Craig A. Berry <craigberry at nospam.mac.com> wrote:
>>> For disk I/O, yes, it's
>>> almost certain that using virtual memory primitives to synchronize
>>> integral pages between disk and memory will be faster than any other I/O
>>> method; that's why pretty much every database product on every platform
>>> does it.
>> 
>> Everyone starts out thinking that, but most are wrong:
>> https://db.cs.cmu.edu/mmap-cidr2022/
>
>Interesting article.  I don't buy its conclusions as an indication of
>what most databases do.  Its main point seems to be that mapped memory
>makes transactions difficult because, "due to transparent paging, the OS
>can flush a dirty page to secondary storage at any time, irrespective of
>whether the writing transaction has committed."  Well, yeah, you have to
>lock pages in memory, and apparently mlock() doesn't guarantee that it
>does what is says on the tin.

That's not what `mlock()` does; `mlock` makes sure that pages
associated with the locked region are physically present in RAM;
it says nothing about when those pages, if dirtied, are flushed
back to disk.  Note that flushing the contents of a page back to
secondary storage doesn't automatically evict that page from
memory; it just ensures that the page contents on durable
storage match those in RAM.

Furthermore, lots of databases are too large to fit entirely
into physical memory, thus making `mlock` on their backing file
stores impractical.

Also, that was just one of the four problems they identified.
The others are unpredictable IO stalls, issues with error
handling, and performance issues due to outdated implementation
assumptions.

>They also seem to be surprised by the fact
>that paging to a remote storage system is slow.  Well, duh.  So there
>are complexities and difficulties arising from using mapped memory.

I think you are confusing what they refer to as "remote" in the
paper with something unrelated: they do not mean paging to a
remote _system_, as in a totally separate computer across a
network like an Ethernet, but rather are talking about problems
inherent in SMP systems and the need to synchronize with "remote
cores" sharing an address in the same physical system.  Here, a
"remote core" is simply any CPU that is not the local CPU.  This
is in a discussion of synchronization of page tables (and page
table contention) and the need to issue TLB shootdowns to other
cores that likely have cached translations for a page that is
being evicted from an address space; these are a known
performance bottleneck in managing virtual memory systems.
Multiple simultaneous modifications to a virtual address space
are known to be expensive and lead to performance bottlenecks.
For more details on both the problem space and an example
implementation that addresses some of the issues, see:
https://people.csail.mit.edu/nickolai/papers/clements-radixvm-2014-08-05.pdf

>But there are also complexities and difficulties with their recommended
>alternative of "traditional buffer pools" and libaio or io_uring.  Stack
>Overflow is full of articles about problems, including broken
>implementations and security problems, with those APIs. I don't doubt
>they can be useful if the complexities and difficulties are managed
>correctly, but the same is true of mapped memory.

If one is going to implement a DBMS, I would hope one would have
enough experience and knowledge to avoid the sorts of problems
one would reach to Stack Overflow for help with.  :-/

Regardless, I believe the authors' point is that, once one takes
into consideration the complexities they point out with respect
to mmap, it becomes much less attractive because one realizes
that all of the details one wanted to avoid (or rather, throw
over the wall to the OS) with respect to manual buffer management
are details one has to confront anyway.  And other mechanisms
give you more precise control over these behaviors; combined
with _other_ services provided by the OS (io_uring, aio, etc)
you can get all of the benefits one naively expects out of mmap
without the downsides, at a similar level of complexity.

>Also noteworthy is that in the long list of databases whose
>implementations they considered, SQLite is the only major player they
>mention. SQLite has an unusual storage model (data typing imposed at run
>time, IIRC) and was never intended as a multi-user OLTP system, so its
>I/O choices aren't much of a guide to what databases in general do.
>
>If the authors of the article could show evidence that the research
>teams at Oracle, Microsoft, and IBM have come to the same conclusions
>they did and don't use mapped memory in their products anymore, that
>would be interesting, but they make only an oblique reference to SQL
>Server in a context that implies it does use mapped memory.

SQLite is public domain software, whereas most databases have
restrictive licenses that prohibit authors from mentioning them
by name.  So no, they can't really show that directly.

>I suspect all of the major databases use any and every I/O mechanism
>available in different situations, chosen by a variety of engine
>choices, run-time heuristics, and configuration options. Mapped memory
>may not be the only game in town besides file I/O like it once was, but
>I'm just not buying that it's been entirely eclipsed.

See above.

	- Dan C.