[Info-vax] x86-64 data aligment / faulting
Bob Gezelter
gezelter at rlgsc.com
Sat Feb 26 23:50:51 EST 2022
On Saturday, February 26, 2022 at 4:36:58 PM UTC-5, Arne Vajhøj wrote:
> On 2/25/2022 11:37 PM, Bob Gezelter wrote:
> > On Friday, February 25, 2022 at 7:12:55 PM UTC-5, Arne Vajhøj wrote:
> >> On 2/25/2022 6:57 PM, Mark Daniel wrote:
> >>> On 26/2/22 8:23 am, Mark Daniel wrote:
> >>>> Alpha and Itanium had data alignment requirements with
> >>>> penalties for faulting. Does x86-64? Is
> >>>> sys$start_align_fault_report() et al. still relevant?
> >>>
> >>> Hmmm. Using an alignment fault generator and reporter I'm seeing
> >>> plenty on Alpha and Itanium; zero on x86-64.
> >> I had an old Fortran program testing alignment overhead and I just
> >> ran it on Windows x86-64 and it showed absolutely no overhead for
> >> bad alignment of REAL*8 arrays (and there is a lot of overhead on
> >> VMS Alpha and Itanium).
> >>
> >> I guess we can say welcome back to CISC. :-)
> >
> > With all due respect, the performance penalty for non-aligned
> > references is still very real, speaking as one who did a lot of work
> > on non-faulting IBM System/370 processors back in the day. The same
> > was true with VAX CPUs. They did not fault, but they paid a
> > performance penalty.
> >
> > There is a difference in context from the days of the System/370 and
> > the VAX: multi-level large caches.
> >
> > The caches close to the processing core are very fast. This obscures
> > the loss of performance due to non-aligned references. Second, all
> > loads/stores to/from a cache are, almost by definition, aligned.
> >
> > A program designed to produce alignment faults is also very likely to
> > not abuse the memory system in a way to detect the mis-aligned data
> > penalty. Faults, which are synchronous interrupts, have overhead
> > orders of magnitude more than a double memory fetch, particularly
> > when sequential elements are referenced (sequential elements may well
> > be in the same cache line, even if not aligned on the proper
> > boundary).
> >
> > If I had the spare time to play with it, I would write a program to
> > randomly address a storage area beyond total cache size, so that
> > every memory reference is a cache miss. Run aligned and unaligned
> > data references and compare the result.
> >
> > It is easy for a benchmark to measure the incorrect phenomenon.
> There are lies, damn lies and benchmarks.
>
> :-)
>
> I tested on a 2 MB array.
>
> And I admit that the results can be due to many things.
>
> But the numbers sure show a big difference!
>
> Fortran/VMS/Itanium:
>
> OFFSET 0 : 590 ms
> OFFSET 1 : 197510 ms
> OFFSET 2 : 197510 ms
> OFFSET 3 : 197520 ms
> OFFSET 4 : 197510 ms
> OFFSET 5 : 197510 ms
> OFFSET 6 : 197510 ms
> OFFSET 7 : 197510 ms
> OFFSET 8 : 590 ms
> OFFSET 9 : 197510 ms
> OFFSET 10 : 197520 ms
> OFFSET 11 : 197520 ms
> OFFSET 12 : 197520 ms
> OFFSET 13 : 197520 ms
> OFFSET 14 : 197520 ms
> OFFSET 15 : 197520 ms
> OFFSET 16 : 580 ms
>
> GFortran/Windows/x86-64 (100x more reps):
>
> OFFSET 0 : 7473 ms
> OFFSET 1 : 7285 ms
> OFFSET 2 : 7301 ms
> OFFSET 3 : 7301 ms
> OFFSET 4 : 7269 ms
> OFFSET 5 : 7208 ms
> OFFSET 6 : 7191 ms
> OFFSET 7 : 7192 ms
> OFFSET 8 : 7519 ms
> OFFSET 9 : 7285 ms
> OFFSET 10 : 7270 ms
> OFFSET 11 : 7285 ms
> OFFSET 12 : 7270 ms
> OFFSET 13 : 7207 ms
> OFFSET 14 : 7176 ms
> OFFSET 15 : 7176 ms
> OFFSET 16 : 7473 ms
>
> Arne
Arne,
One needs to analyze beyond raw performance. In this case, I start with the cache organization and related structure. If you "break" the cache, buy forcing every reference to be a cache miss, one will essentially see the 2x performance loss.
If the cache is able to gain anything, it will skew the numbers.
- Bob Gezelter, http://www.rlgsc.com
More information about the Info-vax
mailing list