Listing 63.2 1 L63-2.ASM

```; unoptimized dot product; 17 cycles
fld     [vec0+0]        ;starts & ends on cycle 0
fmul    [vec1+0]        ;starts on cycle 1
fld     [vec0+4]        ;starts & ends on cycle 2
fmul    [vec1+4]        ;starts on cycle 3
fld     [vec0+8]        ;starts & ends on cycle 4
fmul    [vec1+8]        ;starts on cycle 5
;stalls for cycles 6-7
faddp   st(1),st(0)     ;starts on cycle 8
;stalls for cycles 9-10
faddp   st(1),st(0)     ;starts on cycle 11
;stalls for cycles 12-14
fstp    [dot]           ;starts on cycle 15,
; ends on cycle 16
```

Listing 63.3 L63-3.ASM

```;optimized dot product; 15 cycles
fld     [vec0+0]        ;starts & ends on cycle 0
fmul    [vec1+0]        ;starts on cycle 1
fld     [vec0+4]        ;starts & ends on cycle 2
fmul    [vec1+4]        ;starts on cycle 3
fld     [vec0+8]        ;starts & ends on cycle 4
fmul    [vec1+8]        ;starts on cycle 5
fxch    st(1)           ;no cost
faddp   st(2),st(0)     ;starts on cycle 6
;stalls for cycles 7-8
faddp   st(1),st(0)     ;starts on cycle 9
;stalls for cycles 10-12
fstp    [dot]           ;starts on cycle 13,
; ends on cycle 14
```

### The Cross Product

When last we looked at the cross product, we found that it’s handy for generating a vector that’s normal to two other vectors. The cross product is calculated as [u2v3-u3v2 u3v1-u1v3 u1v2-u2v1]. The theoretical minimum cycle count for the cross product is 21 cycles. Listing 63.4 shows a straightforward implementation that calculates each component of the result separately, losing 15 cycles to stalls.

Listing 63.4 L63-4.ASM

```;unoptimized cross product; 36 cycles
fld     [vec0+4]         ;starts & ends on cycle 0
fmul    [vec1+8]         ;starts on cycle 1
fld     [vec0+8]         ;starts & ends on cycle 2
fmul    [vec1+4]         ;starts on cycle 3
;stalls for cycles 4-5
fsubrp  st(1),st(0)      ;starts on cycle 6
;stalls for cycles 7-9
fstp    [vec2+0]         ;starts on cycle 10,
; ends on cycle 11
fld     [vec0+8]         ;starts & ends on cycle 12
fmul    [vec1+0]         ;starts on cycle 13
fld     [vec0+0]         ;starts & ends on cycle 14
fmul    [vec1+8]         ;starts on cycle 15
;stalls for cycles 16-17
fsubrp  st(1),st(0)      ;starts on cycle 18
;stalls for cycles 19-21
fstp    [vec2+4]         ;starts on cycle 22,
; ends on cycle 23

fld     [vec0+0]         ;starts & ends on cycle 24
fmul    [vec1+4]         ;starts on cycle 25
fld     [vec0+4]         ;starts & ends on cycle 26
fmul    [vec1+0]         ;starts on cycle 27
;stalls for cycles 28-29
fsubrp  st(1),st(0)      ;starts on cycle 30
;stalls for cycles 31-33
fstp    [vec2+8]         ;starts on cycle 34,
; ends on cycle 35
```

We couldn’t get rid of many of the stalls in the dot product code because with six inputs and one output, it was impossible to interleave all the operations. However, the cross product, with three outputs, is much more amenable to optimization. In fact, three is the magic number; because we have three calculation streams and the latency of FADD, FSUB, and FMUL is 3 cycles, we can eliminate almost every single stall in the cross-product calculation, as shown in Listing 63.5. Listing 63.5 loses only one cycle to a stall, the cycle before the first FST; the relevant FSUB has just finished on the preceding cycle, so we run into the extra cycle of latency associated with FST. Listing 63.5 is more than 60 percent faster than Listing 63.4, a striking illustration of the power of properly managing the Pentium’s FP pipeline.

Listing 63.5 L63-5.ASM

```;optimized cross product; 22 cycles
fld       [vec0+4]        ;starts & ends on cycle 0
fmul      [vec1+8]        ;starts on cycle 1
fld       [vec0+8]        ;starts & ends on cycle 2
fmul      [vec1+0]        ;starts on cycle 3
fld       [vec0+0]        ;starts & ends on cycle 4
fmul      [vec1+4]        ;starts on cycle 5
fld       [vec0+8]        ;starts & ends on cycle 6
fmul      [vec1+4]        ;starts on cycle 7
fld       [vec0+0]        ;starts & ends on cycle 8
fmul      [vec1+8]        ;starts on cycle 9
fld       [vec0+4]        ;starts & ends on cycle 10
fmul      [vec1+0]        ;starts on cycle 11
fxch      st(2)           ;no cost
fsubrp    st(5),st(0)     ;starts on cycle 12
fsubrp    st(3),st(0)     ;starts on cycle 13
fsubrp    st(1),st(0)     ;starts on cycle 14
fxch      st(2)           ;no cost
;stalls for cycle 15
fstp      [vec2+0]        ;starts on cycle 16,
; ends on cycle 17
fstp      [vec2+4]        ;starts on cycle 18,
; ends on cycle 19
fstp      [vec2+8]        ;starts on cycle 20,
; ends on cycle 21
```

### Transformation

Transforming a point, for example from worldspace to viewspace, is one of the most heavily used FP operations in realtime 3-D. Conceptually, transformation is nothing more than three dot products and three additions, as I will discuss in Chapter 61. (Note that I’m talking about a subset of a general 4×4 transformation matrix, where the fourth row is always implicitly [0 0 0 1]. This limited form suffices for common transformations, and does 25 percent less work than a full 4×4 transformation.)

Transformation is calculated as:

```-  -    -                -  -   -
v1      m11 m12 m13 m14    u1
v2  =   m21 m22 m23 m24    u2
v3      m31 m32 m33 m34    u3
1       0   0   0   1     1
-  -    -                -  -   -
```

or

```v1 = m11u1 + m12u2 + m13u3 + m14
v2 = m21u1 + m22u2 + m23u3 + m24
v3 = m31u1 + m32u2 + m33u3 + m34.
```

Graphics Programming Black Book © 2001 Michael Abrash