[Info-vax] Rust as a HS language, was: Re: Quiet?
Dan Cross
cross at spitfire.i.gajendra.net
Tue Apr 5 15:59:57 EDT 2022
In article <jb3isvF8h79U1 at mid.individual.net>,
Bill Gunshannon <bill.gunshannon at gmail.com> wrote:
>On 4/5/22 15:04, Dan Cross wrote:
>> [snip]
>> A committee driven standards document (which, to reiterate yet
>> again, _will_ come) isn't a talisman against incompatibility.
>> Again, I bring up how pervasive use of UB in C means that
>> programs written 30 years ago to the then-current standard will
>> fail today when compiled with modern toolchains.
>
>Maybe I'm just confused, but I don't see what time has to do with
>anyone writing code relying on the results of UB. I would expect
>there would be no guarantee of repeated results on the same hardware
>using the same compiler and then running the program twice. Or am
>I missing just what "UB" actually means here.
Well, consider this code from sec 2.3 of the paper I linked earlier
(https://people.csail.mit.edu/nickolai/papers/wang-undef-2012-08-21.pdf):
01. int do_fallocate(..., loff_t offset, loff_t len)
02. {
03. struct inode *inode = ...;
04. if (offset < 0 || len <= 0)
05. return -EINVAL;
06. /* Check for wrap through zero too */
07. if ((offset + len > inode->i_sb->s_maxbytes)
08. || (offset + len < 0))
09. return -EFBIG;
10. ...
$n. }
(This code was originally taken from the Linux kernel)
As a result of the `if` statement on line 04, the compiler
"knows" that both `offset` and `len` are non-negative (indeed,
it even "knows" that `len` is positive! But I digress). But,
in C, signed integer overflow is UB, so the compiler is free to
assume that the condition on line 08 _cannot happen_. Thus, it
is free to elide that part of the condition in the `if` on 07,
simplifying the entire condition to,
`if (offset + len) > inode->i_sb->s_maxbytes) return -EFBIG;`.
That is, it can assume that `do_fallocate` is _always_ called
in such a way that `offset+len` _never_ overflows. But of
course, in the real world, that's just not true; something
_could_ call `do_fallocate` in such a way that `offset+len`
overflows.
The issue in production arose because a programmer relied on the
compiler not eliding the overflow check in the conditional; that
is, the code relied on the compilier ignoring the UB and doing
the right thing. I'd imagine that at the time the code was
written, this was probably true; but then a new version of the
compiler came along that included the elision as an optimization
and hilarity ensued.
The point is that, even though C has a standard, it is riddled
with almost inescapable UB, and compiler writers can, and WILL,
take advantage of that over time. New versions of compilers may
well introduce UB-based optimizations that fundamentally alter
the behavior and indeed the correctness of programs without the
programs themselves changing.
There is _tons_ of this floating around with respect to the
memory model, which wasn't even specified until C11, even though
we've been writing multithreaded C programs since the 70s (e.g.
the Unix kernel).
And UB is extraordinarily easy to trip over in C. For example,
consider this almost trivial function:
uint16_t
mul(uint16_t a, uint16_t b)
{
return a * b;
}
Is that well-defined? Sadly the answer is, "it depends, but
probably not." In particular, on a machine/toolchain where,
say, `int` is 32 bits wide and `uint16_t` is `unsigned short`,
the answer is "no." In particular, the "usual arithmetic
conversions" will be applied to the operands prior to the
multiplication operation; on a machine where `short` has lesser
"rank" than `int`, and the range of `uint16_t` is fully
expressible as a _signed_ integer, then the operands will
automatically be promoted to _signed_ ints prior to the
multiplication. Since there exist uint16_t's a and b such
that a*b is greater than the largest signed int, we're back in
UB land. So the compiler is free to assume it never happens;
how may that manifest itself in a real program? The compiler,
for whatever reason, may decide to use e.g. a saturating
multiplication instruction to implement the operation, which
would likely yield surprising results. Again, this is the sort
of landmine that may lay hidden in a codebase for _years_ until
a new, sufficiently aggressive compiler, steps on it.
Btw, this bit of code might be "fixed" by writing it as:
uint16_t
mul(uint16_t a, uint16_t b)
{
unsigned int aa = a;
unsigned int bb = b;
return aa * bb;
}
Thus, the multiplication will be performed as an _unsigned_
operation, which has well-defined overflow semantics, and the
truncation to the narrower type is similarly well defined.
In Rust, we'd write it as:
fn mul(a: u16, b: u16) -> u16 {
a.wrapping_mul(b)
}
In Rust, by default, _any_ overflow can panic the program, but
we can explicitly request modular arithmetic via `wrapping_mul`.
- Dan C.
More information about the Info-vax
mailing list