[Info-vax] x86-64 data aligment / faulting

Sat Feb 26 23:50:51 EST 2022

On Saturday, February 26, 2022 at 4:36:58 PM UTC-5, Arne Vajhøj wrote:
> On 2/25/2022 11:37 PM, Bob Gezelter wrote: 
> > On Friday, February 25, 2022 at 7:12:55 PM UTC-5, Arne Vajhøj wrote: 
> >> On 2/25/2022 6:57 PM, Mark Daniel wrote: 
> >>> On 26/2/22 8:23 am, Mark Daniel wrote: 
> >>>> Alpha and Itanium had data alignment requirements with 
> >>>> penalties for faulting. Does x86-64? Is 
> >>>> sys$start_align_fault_report() et al. still relevant? 
> >>> 
> >>> Hmmm. Using an alignment fault generator and reporter I'm seeing 
> >>> plenty on Alpha and Itanium; zero on x86-64. 
> >> I had an old Fortran program testing alignment overhead and I just 
> >> ran it on Windows x86-64 and it showed absolutely no overhead for 
> >> bad alignment of REAL*8 arrays (and there is a lot of overhead on 
> >> VMS Alpha and Itanium). 
> >> 
> >> I guess we can say welcome back to CISC. :-) 
> >
> > With all due respect, the performance penalty for non-aligned 
> > references is still very real, speaking as one who did a lot of work 
> > on non-faulting IBM System/370 processors back in the day. The same 
> > was true with VAX CPUs. They did not fault, but they paid a 
> > performance penalty. 
> > 
> > There is a difference in context from the days of the System/370 and 
> > the VAX: multi-level large caches. 
> > 
> > The caches close to the processing core are very fast. This obscures 
> > the loss of performance due to non-aligned references. Second, all 
> > loads/stores to/from a cache are, almost by definition, aligned. 
> > 
> > A program designed to produce alignment faults is also very likely to 
> > not abuse the memory system in a way to detect the mis-aligned data 
> > penalty. Faults, which are synchronous interrupts, have overhead 
> > orders of magnitude more than a double memory fetch, particularly 
> > when sequential elements are referenced (sequential elements may well 
> > be in the same cache line, even if not aligned on the proper 
> > boundary). 
> > 
> > If I had the spare time to play with it, I would write a program to 
> > randomly address a storage area beyond total cache size, so that 
> > every memory reference is a cache miss. Run aligned and unaligned 
> > data references and compare the result. 
> > 
> > It is easy for a benchmark to measure the incorrect phenomenon.
> There are lies, damn lies and benchmarks. 
> 
> :-) 
> 
> I tested on a 2 MB array. 
> 
> And I admit that the results can be due to many things. 
> 
> But the numbers sure show a big difference! 
> 
> Fortran/VMS/Itanium: 
> 
> OFFSET 0 : 590 ms 
> OFFSET 1 : 197510 ms 
> OFFSET 2 : 197510 ms 
> OFFSET 3 : 197520 ms 
> OFFSET 4 : 197510 ms 
> OFFSET 5 : 197510 ms 
> OFFSET 6 : 197510 ms 
> OFFSET 7 : 197510 ms 
> OFFSET 8 : 590 ms 
> OFFSET 9 : 197510 ms 
> OFFSET 10 : 197520 ms 
> OFFSET 11 : 197520 ms 
> OFFSET 12 : 197520 ms 
> OFFSET 13 : 197520 ms 
> OFFSET 14 : 197520 ms 
> OFFSET 15 : 197520 ms 
> OFFSET 16 : 580 ms 
> 
> GFortran/Windows/x86-64 (100x more reps): 
> 
> OFFSET 0 : 7473 ms 
> OFFSET 1 : 7285 ms 
> OFFSET 2 : 7301 ms 
> OFFSET 3 : 7301 ms 
> OFFSET 4 : 7269 ms 
> OFFSET 5 : 7208 ms 
> OFFSET 6 : 7191 ms 
> OFFSET 7 : 7192 ms 
> OFFSET 8 : 7519 ms 
> OFFSET 9 : 7285 ms 
> OFFSET 10 : 7270 ms 
> OFFSET 11 : 7285 ms 
> OFFSET 12 : 7270 ms 
> OFFSET 13 : 7207 ms 
> OFFSET 14 : 7176 ms 
> OFFSET 15 : 7176 ms 
> OFFSET 16 : 7473 ms 
> 
> Arne
Arne,

One needs to analyze beyond raw performance. In this case, I start with the cache organization and related structure. If you "break" the cache, buy forcing every reference to be a cache miss, one will essentially see the 2x performance loss.

If the cache is able to gain anything, it will skew the numbers.

- Bob Gezelter, http://www.rlgsc.com