[Info-vax] Rust as a HS language, was: Re: Quiet?

Tue Apr 5 16:21:03 EDT 2022

On 4/5/22 15:59, Dan Cross wrote:
> In article <jb3isvF8h79U1 at mid.individual.net>,
> Bill Gunshannon  <bill.gunshannon at gmail.com> wrote:
>> On 4/5/22 15:04, Dan Cross wrote:
>>> [snip]
>>> A committee driven standards document (which, to reiterate yet
>>> again, _will_ come) isn't a talisman against incompatibility.
>>> Again, I bring up how pervasive use of UB in C means that
>>> programs written 30 years ago to the then-current standard will
>>> fail today when compiled with modern toolchains.
>>
>> Maybe I'm just confused, but I don't see what time has to do with
>> anyone writing code relying on the results of UB.  I would expect
>> there would be no guarantee of repeated results on the same hardware
>> using the same compiler and then running the program twice.  Or am
>> I missing just what "UB" actually means here.
> 
> Well, consider this code from sec 2.3 of the paper I linked earlier
> (https://people.csail.mit.edu/nickolai/papers/wang-undef-2012-08-21.pdf):
> 
> 01. int do_fallocate(..., loff_t offset, loff_t len)
> 02. {
> 03. 	struct inode *inode = ...;
> 04. 	if (offset < 0 || len <= 0)
> 05. 		return -EINVAL;
> 06. 	/* Check for wrap through zero too */
> 07. 	if ((offset + len > inode->i_sb->s_maxbytes)
> 08. 	    || (offset + len < 0))
> 09. 		return -EFBIG;
> 10. 	...
> $n. }
> 
> (This code was originally taken from the Linux kernel)
> 
> As a result of the `if` statement on line 04, the compiler
> "knows" that both `offset` and `len` are non-negative (indeed,
> it even "knows" that `len` is positive!  But I digress).  But,
> in C, signed integer overflow is UB, so the compiler is free to
> assume that the condition on line 08 _cannot happen_.  Thus, it
> is free to elide that part of the condition in the `if` on 07,
> simplifying the entire condition to,
> `if (offset + len) > inode->i_sb->s_maxbytes) return -EFBIG;`.
> That is, it can assume that `do_fallocate` is _always_ called
> in such a way that `offset+len` _never_ overflows.  But of
> course, in the real world, that's just not true; something
> _could_ call `do_fallocate` in such a way that `offset+len`
> overflows.
> 
> The issue in production arose because a programmer relied on the
> compiler not eliding the overflow check in the conditional; that
> is, the code relied on the compilier ignoring the UB and doing
> the right thing.  I'd imagine that at the time the code was
> written, this was probably true; but then a new version of the
> compiler came along that included the elision as an optimization
> and hilarity ensued.
> 
> The point is that, even though C has a standard, it is riddled
> with almost inescapable UB, and compiler writers can, and WILL,
> take advantage of that over time.  New versions of compilers may
> well introduce UB-based optimizations that fundamentally alter
> the behavior and indeed the correctness of programs without the
> programs themselves changing.

But, that was my point exactly.  A definition of "UB" means that
particular case should never be used because the results are "UB".
Not the fault of the language.  Not the fault of the compiler.
Strictly the fault of truly bad programmers.

> 
> There is _tons_ of this floating around with respect to the
> memory model, which wasn't even specified until C11, even though
> we've been writing multithreaded C programs since the 70s (e.g.
> the Unix kernel).
> 
> And UB is extraordinarily easy to trip over in C.  For example,
> consider this almost trivial function:
> 
> uint16_t
> mul(uint16_t a, uint16_t b)
> {
> 	return a * b;
> }

OK, I have to ask.  "UB" means "undefined Behaviour", right?
Just so I know we are talking about the same thing here.

> 
> Is that well-defined?  Sadly the answer is, "it depends, but
> probably not."  In particular, on a machine/toolchain where,
> say, `int` is 32 bits wide and `uint16_t` is `unsigned short`,
> the answer is "no."  In particular, the "usual arithmetic
> conversions" will be applied to the operands prior to the
> multiplication operation; on a machine where `short` has lesser
> "rank" than `int`, and the range of `uint16_t` is fully
> expressible as a _signed_ integer, then the operands will
> automatically be promoted to _signed_ ints prior to the
> multiplication.  Since there exist uint16_t's a and b such
> that a*b is greater than the largest signed int, we're back in
> UB land.  So the compiler is free to assume it never happens;
> how may that manifest itself in a real program?  The compiler,
> for whatever reason, may decide to use e.g. a saturating
> multiplication instruction to implement the operation, which
> would likely yield surprising results.  Again, this is the sort
> of landmine that may lay hidden in a codebase for _years_ until
> a new, sufficiently aggressive compiler, steps on it.
> 
> Btw, this bit of code might be "fixed" by writing it as:
> 
> uint16_t
> mul(uint16_t a, uint16_t b)
> {
> 	unsigned int aa = a;
> 	unsigned int bb = b;
> 	return aa * bb;
> }
> 
> Thus, the multiplication will be performed as an _unsigned_
> operation, which has well-defined overflow semantics, and the
> truncation to the narrower type is similarly well defined.

Exactly.  A good programmer would have done it correctly and not
written it relying on UB.  That's my point.

> 
> In Rust, we'd write it as:
> 
> fn mul(a: u16, b: u16) -> u16 {
>      a.wrapping_mul(b)
> }
> 
> In Rust, by default, _any_ overflow can panic the program, but
> we can explicitly request modular arithmetic via `wrapping_mul`.

Well, I don't know Rust so I really have no idea what that means.
But, once again, it looks like we are trying to create languages
who's intent is to stop programmers from shooting themselves in
the foot.  A better idea would be teaching the programmer not to
shoot himself in the foot because if you rely on a language to do
it programmers will continue to shoot themselves in the foot.

bill