[Info-vax] Microsoft: Alpha architecture responsible for poor Windows file compression

Thu Nov 3 14:42:21 EDT 2016

On Thursday, 3 November 2016 15:58:32 UTC, Bob Gezelter  wrote:
> On Thursday, November 3, 2016 at 11:13:55 AM UTC-4, Chris wrote:
> > On 11/03/16 06:14, Bob Gezelter wrote:
> > 
> > >
> > > Actually, setting/clearing a bit is straightforwardly accomplished
> > 
> > using a shift (or table lookup) and the standard logical instructions
> > 
> > (e.g., AND, OR, NOT, XOR).
> > >
> > > Spent a fair amount of time deriving what would now probably be
> > 
> > called "templates" for so-called "bit twiddling" on IBM System/360
> > 
> > and DEC PDP-11, neither of which had purposely designed instructions.
> > >
> > > For example, on a PDP-11, one could do a bit count for a 16-bit
> > 
> > word with two registers in seven instructions without branches with
> > 
> > a 16-word auxiliary table. On the System/360, one could do a 32-bit
> > 
> > word in a few more instructions.
> > >
> > > The non-aligned operations added to Alpha make some things slightly
> > 
> > more efficient, but I would not want to place any bets on how much
> > 
> > faster in the referenced example.
> > >
> > > - Bob Gezelter, http://www.rlgsc.com
> > 
> > I guess you could just write macros in asm or C to encapsulate that
> > for you. At least for embedded work, it's often much faster
> > to use table lookup methods to do bit translation. For example, a 256
> > byte table to translate an input bit pattern to required output, with
> > the input pattern as the address into the table. May have seemed a bit
> > extravagant in the old days of small memory, but not now and is much
> > more flexible than trying to implement using algorithmic methods...
> > 
> > Chris
> 
> Chris,
> 
> Yes, typically one writes macros to do the actual operation. However, even macros do not solve the entire problem.
> 
> According to the information in the underlying blog entry (at https://blogs.msdn.microsoft.com/oldnewthing/20161101-00/?p=94615) and the discussion in comp.arch, the problem is of a different sort, namely an attempt to use the same "portable" code on vastly different processors (e.g., 16-bit through 64-bit processors). 
> 
> Reading the discussion, the portable code worked in 16-bit units. Alpha best processes memory loads/stores in quadwords (64-bits; aligned). While compilers have gotten very good at optimizing, there are limits. Whether one is hitting a limit can only be determined by reading (and analyzing) the resulting assembler listing.
> 
> In this case, it would be interesting to see if the original compression scheme (the one with "poor" Alpha performance) had been re-coded in MACRO-64 by a proficient MACRO-64 programmer. My experience on other architectures in similar situations is that re-coding such an "inner loop" construct in optimized assembler (religiously eliminating non-64-bit memory references) would have closed the gap and likely then some.
> 
> - Bob Gezelter, http://www.rlgsc.com

As I suspect you know (especially given your "inner loop" 
reference), performance on stuff like this can be
significantly affected not only by choice of algorithm
and software implementation details, but by memory system
design and implementation, especially locality of reference
and similar cache-type considerations. Poor locality of 
reference can have a horrendous performance impact. Lots of
folks don't have to worry about this, but a few do care.