[Info-vax] x86-64 VMS executable image sizes and memory requirements ?

Sun Dec 22 12:05:26 EST 2019

On Sunday, December 22, 2019 at 6:01:44 PM UTC+2, John Reagan wrote:
> On Saturday, December 21, 2019 at 4:30:27 PM UTC-5, osuv... at gmail.com wrote:
> > On Saturday, December 21, 2019 at 11:16:59 AM UTC-5, already... at yahoo.com wrote:
> > > Few numbers for clang/LLVM on x86-64 (not VMS, of course)
> > > 
> > > Option .text size
> > > -O0 75456
> > > -Oz 45772
> > > -Os 44992
> > > -O1 56836
> > > -O2 45220
> > > -O3 46084
> > > 
> > 
> > Clearly, DEC C isn't clang. Building the sqlite3shr shareable image on alpha
> > with different /optimization levels gives (size in disk blocks):
> > 
> >   sqlite3shr-O0.exe;1             2676
> >   sqlite3shr-O1.exe;1             2405
> >   sqlite3shr-O2.exe;1             3513
> >   sqlite3shr-O3.exe;1             3710
> >   sqlite3shr-O4.exe;1             3710
> >   sqlite3shr-O5.exe;1             3820
> > 
> > If I link /notrace, however, the file sizes change to:
> >   sqlite3shr-O0.exe;1             2099
> >   sqlite3shr-O1.exe;1             1835
> >   sqlite3shr-O2.exe;1             1837
> >   sqlite3shr-O3.exe;1             1962
> >   sqlite3shr-O4.exe;1             1962
> >   sqlite3shr-O5.exe;1             2059
> > 
> > I presume line number tracking is a lot more complicated for optimized code.
> > 
> > The respective $CODE$ sizes are:
> >   sqlite3shr-O0.map;1           880492
> >   sqlite3shr-O1.map;1           749156
> >   sqlite3shr-O2.map;1           773240
> >   sqlite3shr-O3.map;1           837496
> >   sqlite3shr-O4.map;1           837496
> >   sqlite3shr-O5.map;1           886824
> 
> Alpha doesn't have the complex addressing modes like x86 does.  Optimizations on Alpha tend to be code motions to hoist out of loop, strength reduction to avoid multiplies, common sub-expressions, etc.  Most of which have little impact on code size, just speed.  And at O2, routine inlining kicks in which is the size increase you see there.  You can control that with various keywords on /OPTIMIZE.
> 
> On x86, you'll see all sorts of clever addressing mode tricks from clang.
> 
> For exampile, I just tried a "* 15" with -Ofast with clang and got
> 
> int square(int num) {
>     return num * 15;
> }
> 
> square(int):                             # @square(int)
>         lea     eax, [rdi + 4*rdi]
>         lea     eax, [rax + 2*rax]
>         ret
> 
> This is the kind of stuff you'd see in VAX BLISS or VAX Pascal.  Note, it isn't much smaller, just faster.

Not necessarily faster, too.
For example, on AMD Zen, LEA with non-unit scale has latency=2, so two dependent LEA instructions in your example have latency=4. That's one cycle longer than IMUL.

If you want something that is fast on both Zen and Skylake then try
  mov eax, edi
  shl eax, 4
  sub eax, edi

Both Zen and Skylake are smart enough to eliminate a first mov instruction at the front end, making it zero-latency. So, both will do the whole sequence with latency=2.