## Chapter 2A World Apart

### The Unique Nature of Assembly Language Optimization

As I showed in the previous chapter, optimization is by no means always a matter of “dropping into assembly.” In fact, in performance tuning high-level language code, assembly should be used rarely, and then only after you’ve made sure a badly chosen or clumsily implemented algorithm isn’t eating you alive. Certainly if you use assembly at all, make absolutely sure you use it right. The potential of assembly code to run slowly is poorly understood by a lot of people, but that potential is great, especially in the hands of the ignorant.

Truly great optimization, however, happens only at the assembly level, and it happens in response to a set of dynamics that is totally different from that governing C/C++ or Pascal optimization. I’ll be speaking of assembly-level optimization time and again in this book, but when I do, I think it will be helpful if you have a grasp of those assembly specific dynamics.

As usual, the best way to wade in is to present a real-world example.

### Instructions: The Individual versus the Collective

Some time ago, I was asked to work over a critical assembly subroutine in order to make it run as fast as possible. The task of the subroutine was to construct a nibble out of four bits read from different bytes, rotating and combining the bits so that they ultimately ended up neatly aligned in bits 3-0 of a single byte. (In case you’re curious, the object was to construct a 16-color pixel from bits scattered over 4 bytes.) I examined the subroutine line by line, saving a cycle here and a cycle there, until the code truly seemed to be optimized. When I was done, the key part of the code looked something like this:

```LoopTop:
lodsb            ;get the next byte to extract a bit from
and   al,ah      ;isolate the bit we want
rol   al,cl      ;rotate the bit into the desired position
or    bl,al      ;insert the bit into the final nibble
dec   cx         ;the next bit goes 1 place to the right
dec   dx         ;count down the number of bits
jnz   LoopTop    ;process the next bit, if any
```

Now, it’s hard to write code that’s much faster than seven instructions, only one of which accesses memory, and most programmers would have called it a day at this point. Still, something bothered me, so I spent a bit of time going over the code again. Suddenly, the answer struck me—the code was rotating each bit into place separately, so that a multibit rotation was being performed every time through the loop, for a total of four separate time-consuming multibit rotations!

 While the instructions themselves were individually optimized, the overall approach did not make the best possible use of the instructions.

I changed the code to the following:

```LoopTop:
lodsb            ;get the next byte to extract a bit from
and   al,ah      ;isolate the bit we want
or    bl,al      ;insert the bit into the final nibble
rol   bl,1       ;make room for the next bit
dec   dx         ;count down the number of bits
jnz   LoopTop    ;process the next bit, if any
rol   bl,cl      ;rotate all four bits into their final
; positions at the same time
```

This moved the costly multibit rotation out of the loop so that it was performed just once, rather than four times. While the code may not look much different from the original, and in fact still contains exactly the same number of instructions, the performance of the entire subroutine improved by about 10 percent from just this one change. (Incidentally, that wasn’t the end of the optimization; I eliminated the DEC and JNJ instructions by expanding the four iterations of the loop—but that’s a tale for another chapter.)

The point is this: To write truly superior assembly programs, you need to know what the various instructions do and which instructions execute fastest...and more. You must also learn to look at your programming problems from a variety of perspectives so that you can put those fast instructions to work in the most effective ways.

### Assembly Is Fundamentally Different

Is it really so hard as all that to write good assembly code for the PC? Yes! Thanks to the decidedly quirky nature of the x86 family CPUs, assembly language differs fundamentally from other languages, and is undeniably harder to work with. On the other hand, the potential of assembly code is much greater than that of other languages, as well.

To understand why this is so, consider how a program gets written. A programmer examines the requirements of an application, designs a solution at some level of abstraction, and then makes that design come alive in a code implementation. If not handled properly, the transformation that takes place between conception and implementation can reduce performance tremendously; for example, a programmer who implements a routine to search a list of 100,000 sorted items with a linear rather than binary search will end up with a disappointingly slow program.

Graphics Programming Black Book © 2001 Michael Abrash