[Info-vax] Itanium support is back in GCC 15

Tue Feb 25 16:35:46 EST 2025

In article <c662907ab88c91b96d12e78e0444d8626f421ebf at i2pn2.org>,
John Reagan  <johnrreagan at earthlink.net> wrote:
>On 2/24/2025 4:43 PM, Arne VajhÃ¸j wrote:
>> On 2/24/2025 4:22 PM, Michael S wrote:
>>> On Mon, 24 Feb 2025 15:08:57 -0500
>>> Arne VajhÃ¸j <arne at vajhoej.dk> wrote:
>>>> On 2/24/2025 12:42 PM, Michael S wrote:
>> 
>>>> C++ VMS x86-64 is clang which in the (older) clang version used
>>>> should mean C++14 while C++ VMS Itanium is very very old (like
>>>> C++ 98 old).
>>>>
>>>>> According to the benchmarks that you posted here several months (a
>>>>> year?) ago, VMS x86-64 compilers are quite awful comparatively to
>>>>> x86-64 compilers available on Windows/Linux/BSD.
>>>>> Do you want to say that VMS Itanium compilers are worse?
>>>>
>>>> I believe the conclusion was that the VMS x86-64 compilers except C++
>>>> was slower than C/C++ on other OS and C++ on VMS.
>>>
>>> Somehow I got an impression that C++ compilers were also significantly
>>> slower than C++ compilers on other platforms.
>>> Do I misremember?
>> 
>> I don't even remember that I posted non-VMS numbers here. Age! :-)
>> 
>> But I just checked VMS C++ latest (CXX/OPT=LEVEL:5 and clang -O3) vs a 
>> random Windows GCC 14.1 (g++ -O3):
>> 
>> VMS is a little faster for integer
>> they are about the same for floating point
>> Windows is a lot faster for string
>> 
>> And given that this is a micro-benchmark with in reality just an inner
>> loop evaluating a single expression, which means huge uncertainty, then
>> I don't see this as proof of a significant difference.
>> 
>> Arne
>> 
>We are aware of the string/char performance issues.
>
>On Alpha and Itanium, the lowlevel routines inside of LIBOTS for things 
>like OTS$MOVE, string compare, memmove, etc. are all written in 
>hand-crafted assembly.  For x86, we are still using a set of BLISS 
>reference code that is simple.  Plus the LIBOTS we all have on our 
>systems was compiled with a non-optimizing BLISS cross-compiler.

Hmm. It strikes me that LLVM has intrinsics for `memmove` that
would also work for OTS$MOVE3; I would think that that would be
most efficient, as for small moves, this could lower directly
to a couple of loads and/or stores?

>We are currently playing with native compiled LIBOTS code and doing some 
>benchmarks.  Besides the brain-dead BLISS code, we have versions that 
>loop with larger chunks of data which are even faster.  The fastest 
>we've seen so far is a native assembly version that uses the REP 
>instruction prefix on the MOVSB.  That version didn't check for 
>overlapping source/dest however so any real version gets a little 
>slower.  I'm not sure when we can incorporate these, but I'm trying to 
>push them as soon as possible.

Yeah, Intel made `REP MOVESB`/`REP STOSB` actually fast a few
uarchs ago.  Good stuff, though startup overhead still dominates
for <128 bytes or something like that, and having to muck with
the DF flag remains a bummer.

>A fun reference to read is
>
>https://cdrdv2-public.intel.com/814198/248966-Optimization-Reference-Manual-V1-050.pdf

Agner Fog's optimization guides can also be a useful resource
for things like this: https://www.agner.org/optimize/

	- Dan C.