AMD x86 Typewriter User Manual

TM
AMD Athlon Processor
x86 Code Optimization
Guide
© 1999 Advanced Micro Devices, Inc. All rights reserved.
The contents of this document are provided in connection with Advanced
Micro Devices, Inc. (“AMD”) products. AMD makes no representations or
warranties with respect to the accuracy or completeness of the contents of
this publication and reserves the right to make changes to specifications and
product descriptions at any time without notice. No license, whether express,
implied, arising by estoppel or otherwise, to any intellectual property rights
is granted by this publication. Except as set forth in AMD’s Standard Terms
and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims
any express or implied warranty, relating to its products including, but not
limited to, the implied warranty of merchantability, fitness for a particular
purpose, or infringement of any intellectual property right.
AMD’s products are not designed, intended, authorized or warranted for use
as components in systems intended for surgical implant into the body, or in
other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or environmental damage may occur.
AMD reserves the right to discontinue or make changes to its products at any
time without notice.
Trademarks
AMD, the AMD logo, AMD Athlon, K6, 3DNow!, and combinations thereof, K86, and Super7 are trademarks,
and AMD-K6 is a registered trademark of Advanced Micro Devices, Inc.
Microsoft, Windows, and Windows NT are registered trademarks of Microsoft Corporation.
MMX is a trademark and Pentium is a registered trademark of Intel Corporation.
Other product names used in this publication are for identification purposes only and may be trademarks of
their respective companies.
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Contents
Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1
Introduction
1
About this Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
AMD Athlon™ Processor Family. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
AMD Athlon Processor Microarchitecture Summary . . . . . . . . . . . . . 4
2
Top Optimizations
7
Optimization Star . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Group I Optimizations — Essential Optimizations . . . . . . . . . . . . . . . 8
Memory Size and Alignment Issues . . . . . . . . . . . . . . . . . . . . . . 8
Use the 3DNow!™ PREFETCH and PREFETCHW
Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Select DirectPath Over VectorPath Instructions . . . . . . . . . . . 9
Group II Optimizations—Secondary Optimizations . . . . . . . . . . . . . . 9
Load-Execute Instruction Usage. . . . . . . . . . . . . . . . . . . . . . . . . 9
Take Advantage of Write Combining. . . . . . . . . . . . . . . . . . . . 10
Use 3DNow! Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Avoid Branches Dependent on Random Data . . . . . . . . . . . . . 10
Avoid Placing Code and Data in the Same
64-Byte Cache Line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3
C Source Level Optimizations
13
Ensure Floating-Point Variables and Expressions
are of Type Float . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Use 32-Bit Data Types for Integer Code . . . . . . . . . . . . . . . . . . . . . . . 13
Consider the Sign of Integer Operands . . . . . . . . . . . . . . . . . . . . . . . 14
Use Array Style Instead of Pointer Style Code . . . . . . . . . . . . . . . . . 15
Completely Unroll Small Loops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Avoid Unnecessary Store-to-Load Dependencies . . . . . . . . . . . . . . . 18
Consider Expression Order in Compound Branch Conditions . . . . . 20
Contents
iii
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Switch Statement Usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Optimize Switch Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Use Prototypes for All Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Use Const Type Qualifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Generic Loop Hoisting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Generalization for Multiple Constant Control Code. . . . . . . . 23
Declare Local Functions as Static . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Dynamic Memory Allocation Consideration . . . . . . . . . . . . . . . . . . . 25
Introduce Explicit Parallelism into Code . . . . . . . . . . . . . . . . . . . . . . 25
Explicitly Extract Common Subexpressions . . . . . . . . . . . . . . . . . . . 26
C Language Structure Component Considerations . . . . . . . . . . . . . . 27
Sort Local Variables According to Base Type Size . . . . . . . . . . . . . . 28
Accelerating Floating-Point Divides and Square Roots . . . . . . . . . . 29
Avoid Unnecessary Integer Division. . . . . . . . . . . . . . . . . . . . . . . . . . 31
Copy Frequently De-referenced Pointer Arguments to
Local Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4
Instruction Decoding Optimizations
33
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Select DirectPath Over VectorPath Instructions. . . . . . . . . . . . . . . . 34
Load-Execute Instruction Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Use Load-Execute Integer Instructions . . . . . . . . . . . . . . . . . . 34
Use Load-Execute Floating-Point Instructions with
Floating-Point Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Avoid Load-Execute Floating-Point Instructions with
Integer Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Align Branch Targets in Program Hot Spots . . . . . . . . . . . . . . . . . . . 36
Use Short Instruction Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Avoid Partial Register Reads and Writes. . . . . . . . . . . . . . . . . . . . . . 37
Replace Certain SHLD Instructions with Alternative Code. . . . . . . 38
Use 8-Bit Sign-Extended Immediates . . . . . . . . . . . . . . . . . . . . . . . . . 38
iv
Contents
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Use 8-Bit Sign-Extended Displacements. . . . . . . . . . . . . . . . . . . . . . . 39
Code Padding Using Neutral Code Fillers . . . . . . . . . . . . . . . . . . . . . 39
Recommendations for the AMD Athlon Processor . . . . . . . . . 40
Recommendations for AMD-K6® Family and
AMD Athlon Processor Blended Code . . . . . . . . . . . . . . . . . . . 41
5
Cache and Memory Optimizations
45
Memory Size and Alignment Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Avoid Memory Size Mismatches . . . . . . . . . . . . . . . . . . . . . . . . 45
Align Data Where Possible . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Use the 3DNow! PREFETCH and PREFETCHW Instructions. . . . . 46
Take Advantage of Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . 50
Avoid Placing Code and Data in the Same 64-Byte Cache Line. . . . 50
Store-to-Load Forwarding Restrictions. . . . . . . . . . . . . . . . . . . . . . . . 51
Store-to-Load Forwarding Pitfalls—True Dependencies. . . . 51
Summary of Store-to-Load Forwarding Pitfalls to Avoid . . . . 54
Stack Alignment Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Align TBYTE Variables on Quadword Aligned Addresses. . . . . . . . 55
C Language Structure Component Considerations . . . . . . . . . . . . . . 55
Sort Variables According to Base Type Size . . . . . . . . . . . . . . . . . . . 56
6
Branch Optimizations
57
Avoid Branches Dependent on Random Data . . . . . . . . . . . . . . . . . . 57
AMD Athlon Processor Specific Code . . . . . . . . . . . . . . . . . . . 58
Blended AMD-K6 and AMD Athlon Processor Code . . . . . . . 58
Always Pair CALL and RETURN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Replace Branches with Computation in 3DNow! Code . . . . . . . . . . . 60
Muxing Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Sample Code Translated into 3DNow! Code . . . . . . . . . . . . . . 61
Avoid the Loop Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Avoid Far Control Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . 65
Avoid Recursive Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Contents
v
AMD Athlon™ Processor x86 Code Optimization
7
Scheduling Optimizations
22007E/0—November 1999
67
Schedule Instructions According to their Latency . . . . . . . . . . . . . . 67
Unrolling Loops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Complete Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Partial Loop Unrolling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Use Function Inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Always Inline Functions if Called from One Site . . . . . . . . . . 72
Always Inline Functions with Fewer than 25 Machine
Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Avoid Address Generation Interlocks. . . . . . . . . . . . . . . . . . . . . . . . . 72
Use MOVZX and MOVSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Minimize Pointer Arithmetic in Loops . . . . . . . . . . . . . . . . . . . . . . . . 73
Push Memory Data Carefully. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8
Integer Optimizations
77
Replace Divides with Multiplies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Multiplication by Reciprocal (Division) Utility . . . . . . . . . . . 77
Unsigned Division by Multiplication of Constant. . . . . . . . . . 78
Signed Division by Multiplication of Constant . . . . . . . . . . . . 79
Use Alternative Code When Multiplying by a Constant. . . . . . . . . . 81
Use MMX™ Instructions for Integer-Only Work . . . . . . . . . . . . . . . . 83
Repeated String Instruction Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Latency of Repeated String Instructions. . . . . . . . . . . . . . . . . 84
Guidelines for Repeated String Instructions . . . . . . . . . . . . . 84
Use XOR Instruction to Clear Integer Registers . . . . . . . . . . . . . . . . 86
Efficient 64-Bit Integer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Efficient Implementation of Population Count Function . . . . . . . . . 91
Derivation of Multiplier Used for Integer Division
by Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Unsigned Derivation for Algorithm, Multiplier, and
Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
vi
Contents
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Signed Derivation for Algorithm, Multiplier, and
Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9
Floating-Point Optimizations
97
Ensure All FPU Data is Aligned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Use Multiplies Rather than Divides . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Use FFREEP Macro to Pop One Register from the FPU Stack . . . . 98
Floating-Point Compare Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 98
Use the FXCH Instruction Rather than FST/FLD Pairs . . . . . . . . . . 99
Avoid Using Extended-Precision Data . . . . . . . . . . . . . . . . . . . . . . . . 99
Minimize Floating-Point-to-Integer Conversions . . . . . . . . . . . . . . . 100
Floating-Point Subexpression Elimination. . . . . . . . . . . . . . . . . . . . 103
Check Argument Range of Trigonometric Instructions
Efficiently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Take Advantage of the FSINCOS Instruction . . . . . . . . . . . . . . . . . 105
10
3DNow!™ and MMX™ Optimizations
107
Use 3DNow! Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Use FEMMS Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Use 3DNow! Instructions for Fast Division . . . . . . . . . . . . . . . . . . . 108
Optimized 14-Bit Precision Divide . . . . . . . . . . . . . . . . . . . . .
Optimized Full 24-Bit Precision Divide . . . . . . . . . . . . . . . . .
Pipelined Pair of 24-Bit Precision Divides. . . . . . . . . . . . . . .
Newton-Raphson Reciprocal . . . . . . . . . . . . . . . . . . . . . . . . . .
108
108
109
109
Use 3DNow! Instructions for Fast Square Root and
Reciprocal Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Optimized 15-Bit Precision Square Root . . . . . . . . . . . . . . . . 110
Optimized 24-Bit Precision Square Root . . . . . . . . . . . . . . . . 110
Newton-Raphson Reciprocal Square Root. . . . . . . . . . . . . . . 111
Use MMX PMADDWD Instruction to Perform
Two 32-Bit Multiplies in Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3DNow! and MMX Intra-Operand Swapping . . . . . . . . . . . . . . . . . . 112
Contents
vii
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Fast Conversion of Signed Words to Floating-Point . . . . . . . . . . . . 113
Use MMX PXOR to Negate 3DNow! Data . . . . . . . . . . . . . . . . . . . . 113
Use MMX PCMP Instead of 3DNow! PFCMP. . . . . . . . . . . . . . . . . . 114
Use MMX Instructions for Block Copies and Block Fills . . . . . . . . 115
Use MMX PXOR to Clear All Bits in an MMX Register . . . . . . . . . 118
Use MMX PCMPEQD to Set All Bits in an MMX Register . . . . . . . 119
Use MMX PAND to Find Absolute Value in 3DNow! Code . . . . . . 119
Optimized Matrix Multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Efficient 3D-Clipping Code Computation Using
3DNow! Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Use 3DNow! PAVGUSB for MPEG-2 Motion Compensation . . . . . 123
Stream of Packed Unsigned Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Complex Number Arithmetic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
11
General x86 Optimization Guidelines
127
Short Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Register Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Stack Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Appendix A
AMD Athlon™ Processor Microarchitecture
129
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
AMD Athlon Processor Microarchitecture . . . . . . . . . . . . . . . . . . . . 130
Superscalar Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Instruction Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Predecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Branch Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Early Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Instruction Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Integer Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
viii
130
131
132
132
133
134
134
135
Contents
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
Integer Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Floating-Point Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Floating-Point Execution Unit . . . . . . . . . . . . . . . . . . . . . . . .
Load-Store Unit (LSU). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
L2 Cache Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
AMD Athlon System Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix B
Pipeline and Execution Unit Resources Overview
135
136
137
138
139
139
139
141
Fetch and Decode Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Integer Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Floating-Point Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Execution Unit Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Integer Pipeline Operations . . . . . . . . . . . . . . . . . . . . . . . . . .
Floating-Point Pipeline Operations . . . . . . . . . . . . . . . . . . . .
Load/Store Pipeline Operations . . . . . . . . . . . . . . . . . . . . . . .
Code Sample Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix C
Implementation of Write Combining
148
149
150
151
152
155
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Write-Combining Definitions and Abbreviations . . . . . . . . . . . . . . 156
What is Write Combining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Programming Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Write-Combining Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Sending Write-Buffer Data to the System . . . . . . . . . . . . . . . 159
Appendix D
Performance-Monitoring Counters
161
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Performance Counter Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
PerfEvtSel[3:0] MSRs
(MSR Addresses C001_0000h–C001_0003h) . . . . . . . . . . . . . 162
Contents
ix
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
PerfCtr[3:0] MSRs
(MSR Addresses C001_0004h–C001_0007h) . . . . . . . . . . . . . 167
Starting and Stopping the Performance-Monitoring
Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Event and Time-Stamp Monitoring Software. . . . . . . . . . . . . . . . . . 168
Monitoring Counter Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Appendix E
Programming the MTRR and PAT
171
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Memory Type Range Register (MTRR) Mechanism . . . . . . . . . . . . 171
Page Attribute Table (PAT). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Appendix F
Instruction Dispatch and Execution Resources
187
Appendix G
DirectPath versus VectorPath Instructions
219
Select DirectPath Over VectorPath Instructions. . . . . . . . . . . . . . . 219
DirectPath Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
VectorPath Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
x
Contents
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
List of Figures
Figure 1. AMD Athlon™ Processor Block Diagram . . . . . . . . . . . 131
Figure 2. Integer Execution Pipeline . . . . . . . . . . . . . . . . . . . . . . . 135
Figure 3. Floating-Point Unit Block Diagram . . . . . . . . . . . . . . . . 137
Figure 4. Load/Store Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Figure 5. Fetch/Scan/Align/Decode Pipeline Hardware . . . . . . . . 142
Figure 6. Fetch/Scan/Align/Decode Pipeline Stages . . . . . . . . . . . 142
Figure 7. Integer Execution Pipeline . . . . . . . . . . . . . . . . . . . . . . . 144
Figure 8. Integer Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Figure 9. Floating-Point Unit Block Diagram . . . . . . . . . . . . . . . . 146
Figure 10. Floating-Point Pipeline Stages . . . . . . . . . . . . . . . . . . . . 146
Figure 11. PerfEvtSel[3:0] Registers . . . . . . . . . . . . . . . . . . . . . . . . 162
Figure 12. MTRR Mapping of Physical Memory . . . . . . . . . . . . . . . 173
Figure 13. MTRR Capability Register Format . . . . . . . . . . . . . . . . 174
Figure 14. MTRR Default Type Register Format . . . . . . . . . . . . . . 175
Figure 15. Page Attribute Table (MSR 277h) . . . . . . . . . . . . . . . . . 177
Figure 16. MTRRphysBasen Register Format . . . . . . . . . . . . . . . . . 183
Figure 17. MTRRphysMaskn Register Format . . . . . . . . . . . . . . . . 184
List of Figures
xi
AMD Athlon™ Processor x86 Code Optimization
xii
22007E/0—November 1999
List of Figures
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
List of Tables
Table 1.
Table 2.
Table 3.
Table 4.
Table 5.
Table 6.
Table 7.
Table 8.
Table 9.
Table 10.
Table 11.
Table 12.
Table 13.
Table 14.
Table 15.
Table 16.
Table 17.
Table 18.
Table 19.
Table 20.
Table 21.
Table 22.
Table 23.
Table 24.
Table 25.
Table 26.
Table 27.
Table 28.
List of Tables
Latency of Repeated String Instructions. . . . . . . . . . . . . 84
Integer Pipeline Operation Types . . . . . . . . . . . . . . . . . 149
Integer Decode Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Floating-Point Pipeline Operation Types . . . . . . . . . . . 150
Floating-Point Decode Types . . . . . . . . . . . . . . . . . . . . . 150
Load/Store Unit Stages . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Sample 1 – Integer Register Operations . . . . . . . . . . . . 153
Sample 2 – Integer Register and Memory Load
Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Write Combining Completion Events . . . . . . . . . . . . . . 158
AMD Athlon™ System Bus Commands
Generation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Performance-Monitoring Counters. . . . . . . . . . . . . . . . . 164
Memory Type Encodings . . . . . . . . . . . . . . . . . . . . . . . . . 174
Standard MTRR Types and Properties . . . . . . . . . . . . . 176
PATi 3-Bit Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Effective Memory Type Based on PAT and
MTRRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Final Output Memory Types . . . . . . . . . . . . . . . . . . . . . . 180
MTRR Fixed Range Register Format . . . . . . . . . . . . . . 182
MTRR-Related Model-Specific Register
(MSR) Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
MMX™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
MMX Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . 212
3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
3DNow! Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
DirectPath Integer Instructions . . . . . . . . . . . . . . . . . . . 220
DirectPath MMX Instructions. . . . . . . . . . . . . . . . . . . . . 227
DirectPath MMX Extensions. . . . . . . . . . . . . . . . . . . . . . 228
DirectPath Floating-Point Instructions . . . . . . . . . . . . . 229
xiii
AMD Athlon™ Processor x86 Code Optimization
Table 29.
Table 30.
Table 31.
Table 32.
xiv
22007E/0—November 1999
VectorPath Integer Instructions . . . . . . . . . . . . . . . . . . .
VectorPath MMX Instructions . . . . . . . . . . . . . . . . . . . .
VectorPath MMX Extensions . . . . . . . . . . . . . . . . . . . . .
VectorPath Floating-Point Instructions . . . . . . . . . . . . .
231
234
234
235
List of Tables
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
Revision History
Date
Rev
Description
Added “About this Document” on page 1.
Further clarification of “Consider the Sign of Integer Operands” on page 14.
Added the optimization, “Use Array Style Instead of Pointer Style Code” on page 15.
Added the optimization, “Accelerating Floating-Point Divides and Square Roots” on page 29.
Clarified examples in “Copy Frequently De-referenced Pointer Arguments to Local Variables” on page 31.
Further clarification of “Select DirectPath Over VectorPath Instructions” on page 34.
Further clarification of “Align Branch Targets in Program Hot Spots” on page 36.
Further clarification of REP instruction as a filler in “Code Padding Using Neutral Code Fillers” on page 39.
Further clarification of “Use the 3DNow!™ PREFETCH and PREFETCHW Instructions” on page 46.
Modified examples 1 and 2 of “Unsigned Division by Multiplication of Constant” on page 78.
Added the optimization, “Efficient Implementation of Population Count Function” on page 91.
Further clarification of “Use FFREEP Macro to Pop One Register from the FPU Stack” on page 98.
Further clarification of “Minimize Floating-Point-to-Integer Conversions” on page 100.
Added the optimization, “Check Argument Range of Trigonometric Instructions Efficiently” on page 103.
Nov.
1999
Added the optimization, “Take Advantage of the FSINCOS Instruction” on page 105.
E
Further clarification of “Use 3DNow!™ Instructions for Fast Division” on page 108.
Further clarification “Use FEMMS Instruction” on page 107.
Further clarification of “Use 3DNow!™ Instructions for Fast Square Root and Reciprocal Square Root” on
page 110.
Clarified “3DNow!™ and MMX™ Intra-Operand Swapping” on page 112.
Corrected PCMPGT information in “Use MMX™ PCMP Instead of 3DNow!™ PFCMP” on page 114.
Added the optimization, “Use MMX™ Instructions for Block Copies and Block Fills” on page 115.
Modified the rule for “Use MMX™ PXOR to Clear All Bits in an MMX™ Register” on page 118.
Modified the rule for “Use MMX™ PCMPEQD to Set All Bits in an MMX™ Register” on page 119.
Added the optimization, “Optimized Matrix Multiplication” on page 119.
Added the optimization, “Efficient 3D-Clipping Code Computation Using 3DNow!™ Instructions” on page
122.
Added the optimization, “Complex Number Arithmetic” on page 126.
Added Appendix E, “Programming the MTRR and PAT”.
Rearranged the appendices.
Added Index.
Revision History
xv
AMD Athlon™ Processor x86 Code Optimization
xvi
22007E/0—November 1999
Revision History
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
1
Introduction
The AMD Athlon™ processor is the newest microprocessor in
the AMD K86™ family of microprocessors. The advances in the
AMD Athlon processor take superscalar operation and
out-of-order execution to a new level. The AMD Athlon
processor has been designed to efficiently execute code written
for previous-generation x86 processors. However, to enable the
fastest code execution with the AMD Athlon processor,
programmers should write software that includes specific code
optimization techniques.
About this Document
This document contains information to assist programmers in
creating optimized code for the AMD Athlon processor. In
addition to compiler and assembler designers, this document
has been targeted to C and assembly language programmers
writing execution-sensitive code sequences.
This document assumes that the reader possesses in-depth
knowledge of the x86 instruction set, the x86 architecture
(registers, programming modes, etc.), and the IBM PC-AT
platform.
This guide has been written specifically for the AMD Athlon
p r o c e s s o r, b u t i t i n c l u d e s c o n s i d e ra t i o n s fo r
About this Document
1
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
previous-generation processors and describes how those
optimizations are applicable to the AMD Athlon processor. This
guide contains the following chapters:
Chapter 1: Introduction. Outlines the material covered in this
document. Summarizes the AMD Athlon microarchitecture.
Chapter 2: Top Optimizations. Provides convenient descriptions of
the most important optimizations a programmer should take
into consideration.
Chapter 3: C Source Level Optimizations. Describes optimizations that
C/C++ programmers can implement.
Chapter 4: Instruction Decoding Optimizations. Describes methods that
will make the most efficient use of the three sophisticated
instruction decoders in the AMD Athlon processor.
Chapter 5: Cache and Memory Optimizations. Describes optimizations
that makes efficient use of the large L1 caches and highbandwidth buses of the AMD Athlon processor.
Chapter 6: Branch Optimizations. D e s c r i b e s o p t i m i z a t i o n s t h a t
improves branch prediction and minimizes branch penalties.
Chapter 7: Scheduling Optimizations. Describes optimizations that
improves code scheduling for efficient execution resource
utilization.
Chapter 8: Integer Optimizations. D e s c r i b e s o pt i m i z a t io ns t h a t
improves integer arithmetic and makes efficient use of the
integer execution units in the AMD Athlon processor.
Chapter 9: Floating-Point Optimizations. Describes optimizations that
makes maximum use of the superscalar and pipelined floatingpoint unit (FPU) of the AMD Athlon processor.
Chapter 10: 3DNow!™ and MMX™ Optimizations. Describes guidelines
for Enhanced 3DNow! and MMX code optimization techniques.
Chapter 11: General x86 Optimizations Guidelines. L i s t s
generic
optimizations techniques applicable to x86 processors.
Appendix A: AMD Athlon Processor Microarchitecture. D e s c r i b e s
detail the microarchitecture of the AMD Athlon processor.
2
in
About this Document
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Appendix B: Pipeline and Execution Unit Resources Overview. Describes
in detail the execution units and its relation to the instruction
pipeline.
Appendix C: Implementation of Write Combining. D e s c r i b e s
the
algorithm used by the AMD Athlon processor to write combine.
Appendix D: Performance Monitoring Counters. Describes the usage of
the performance counters available in the AMD Athlon
processor.
Appendix E: Programming the MTRR and PAT. D e s c r i b e s t h e s t e p s
needed to program the Memory Type Range Registers and the
Page Attribute Table.
Appendix F: Instruction Dispatch and Execution Resources. L i s t s
instruction’s execution resource usage.
the
Appendix G: DirectPath versus VectorPath Instructions. L i s t s t h e x 8 6
instructions that are DirectPath and VectorPath instructions.
AMD Athlon™ Processor Family
The AMD Athlon processor family uses state-of-the-art
decoupled decode/execution design techniques to deliver
next-generation performance with x86 binary software
compatibility. This next-generation processor family advances
x86 code execution by using flexible instruction predecoding,
wide and balanced decoders, aggressive out-of-order execution,
parallel integer execution pipelines, parallel floating-point
execution pipelines, deep pipelined execution for higher
delivered operating frequency, dedicated backside cache
memory, and a new high-performance double-rate 64-bit local
bus. As an x86 binary-compatible processor, the AMD Athlon
processor implements the industry-standard x86 instruction set
by decoding and executing the x86 instructions using a
proprietary microarchitecture. This microarchitecture allows
the delivery of maximum performance when running x86-based
PC software.
AMD Athlon™ Processor Family
3
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
AMD Athlon™ Processor Microarchitecture Summary
The AMD Athlon processor brings superscalar performance
a nd hi gh op era t ing f req ue ncy t o P C s y st e m s r un ning
industry-standard x86 software. A brief summary of the
n ex t -g e n e ra t i o n d e s i g n f e a t u re s i m p l e m e n t e d i n t h e
AMD Athlon processor is as follows:
■
■
■
■
■
■
■
■
■
■
■
■
■
■
■
High-speed double-rate local bus interface
Large, split 128-Kbyte level-one (L1) cache
Dedicated backside level-two (L2) cache
Instruction predecode and branch detection during cache
line fills
Decoupled decode/execution core
Three-way x86 instruction decoding
Dynamic scheduling and speculative execution
Three-way integer execution
Three-way address generation
Three-way floating-point execution
3DNow!™ technology and MMX™ single-instruction
multiple-data (SIMD) instruction extensions
Super data forwarding
Deep out-of-order integer and floating-point execution
Register renaming
Dynamic branch prediction
Th e A M D A t h l o n p ro c e s s o r c o m mu n i c a t e s t h ro u g h a
next-generation high-speed local bus that is beyond the current
Socket 7 or Super7™ bus standard. The local bus can transfer
data at twice the rate of the bus operating frequency by using
b o t h t h e r i s i n g a n d f a l l i n g e d g e s o f t h e c l o ck ( s e e
“A M D A t h l o n ™ S y s t e m B u s ” o n p a g e 1 3 9 f o r m o re
information).
To reduce on-chip cache miss penalties and to avoid subsequent
data load or instruction fetch stalls, the AMD Athlon processor
has a dedicated high-speed backside L2 cache. The large
128-Kbyte L1 on-chip cache and the backside L2 cache allow the
4
AMD Athlon™ Processor Microarchitecture Summary
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
AMD Athlon execution core to achieve and sustain maximum
performance.
As a decoupled decode/execution processor, the AMD Athlon
processor makes use of a proprietary microarchitecture, which
defines the heart of the AMD Athlon processor. With the
inclusion of all these features, the AMD Athlon processor is
capable of decoding, issuing, executing, and retiring multiple
x86 instructions per cycle, resulting in superior scaleable
performance.
The AMD Athlon processor includes both the industry-standard
MMX SIMD integer instructions and the 3DNow! SIMD
floating-point instructions that were first introduced in the
AMD-K6®-2 processor. The design of 3DNow! technology was
based on suggestions from leading graphics and independent
software vendors (ISVs). Using SIMD format, the AMD Athlon
processor can generate up to four 32-bit, single-precision
floating-point results per clock cycle.
The 3DNow! execution units allow for high-performance
floating-point vector operations, which can replace x87
instructions and enhance the performance of 3D graphics and
other floating-point-intensive applications. Because the
3DNow! architecture uses the same registers as the MMX
instructions, switching between MMX and 3DNow! has no
penalty.
The AMD Athlon processor designers took another innovative
step by carefully integrating the traditional x87 floating-point,
MMX, and 3DNow! execution units into one operational engine.
With the introduction of the AMD Athlon processor, the
sw i t ch i n g ove r h e a d b e t we e n x 8 7 , M M X , a n d 3 D N ow !
technology is virtually eliminated. The AMD Athlon processor
combined with 3DNow! technology brings a better multimedia
experience to mainstream PC users while maintaining
backwards compatibility with all existing x86 software.
Although the AMD Athlon processor can extract code
parallelism on-the-fly from off-the-shelf, commercially available
x86 software, specific code optimization for the AMD Athlon
processor can result in even higher delivered performance. This
document describes the proprietary microarchitecture in the
AMD Athlon processor and makes recommendations for
optimizing execution of x86 software on the processor.
AMD Athlon™ Processor Microarchitecture Summary
5
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
The coding techniques for achieving peak performance on the
AMD Athlon processor include, but are not limited to, those for
the AMD-K6, AMD-K6-2, Pentium®, Pentium Pro, and Pentium
II processors. However, many of these optimizations are not
necessary for the AMD Athlon processor to achieve maximum
performance. Due to the more flexible pipeline control and
aggressive out-of-order execution, the AMD Athlon processor is
not as sensitive to instruction selection and code scheduling.
This flexibility is one of the distinct advantages of the
AMD Athlon processor.
The AMD Athlon processor uses the latest in processor
microarchitecture design techniques to provide the highest x86
performance for today’s PC. In short, the AMD Athlon
processor offers true next-generation performance with x86
binary software compatibility.
6
AMD Athlon™ Processor Microarchitecture Summary
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
2
Top Optimizations
This chapter contains concise descriptions of the best
o p t i m i z a t i o n s fo r i m p rov i n g t h e p e r fo r m a n c e o f t h e
AMD Athlon™ processor. Subsequent chapters contain more
detailed descriptions of these and other optimizations. The
optimizations in this chapter are divided into two groups and
listed in order of importance.
Group I — Essential
Optimizations
Group I contains essential optimizations. Users should follow
these critical guidelines closely. The optimizations in Group I
are as follows:
■
■
■
Group II — Secondary
Optimizations
G ro u p I I c o n t a i n s s e c o n d a ry o p t i m i z a t i o n s t h a t c a n
significantly improve the performance of the AMD Athlon
processor. The optimizations in Group II are as follows:
■
■
■
■
Top Optimizations
Memory Size and Alignment Issues—Avoid memory size
mismatches—Align data where possible
Use the 3DNow!™ PREFETCH and PREFETCHW
Instructions
Select DirectPath Over VectorPath Instructions
Load-Execute Instruction Usage—Use Load-Execute
instructions—Avoid load-execute floating-point instructions
with integer operands
Take Advantage of Write Combining
Use 3DNow! Instructions
Avoid Branches Dependent on Random Data
7
AMD Athlon™ Processor x86 Code Optimization
■
22007E/0—November 1999
Avoid Placing Code and Data in the Same 64-Byte Cache
Line
Optimization Star
✩
TOP
The top optimizations described in this chapter are flagged
with a star. In addition, the star appears beside the more
detailed descriptions found in subsequent chapters.
Group I Optimizations — Essential Optimizations
Memory Size and Alignment Issues
See “Memory Size and Alignment Issues” on page 45 for more
details.
Avoid Memory Size Mismatches
✩
✩
TOP
Avoid memory size mismatches when instructions operate on
the same data. For instructions that store and reload the same
data, keep operands aligned and keep the loads/stores of each
operand the same size.
Align Data Where Possible
TOP
Avoid misaligned data references. A misaligned store or load
operation suffers a minimum one-cycle penalty in the
AMD Athlon processor load/store pipeline.
Use the 3DNow!™ PREFETCH and PREFETCHW Instructions
✩
TOP
8
For code that can take advantage of prefetching, use the
3DNow! PREFETCH and PREFETCHW instructions to increase
the effective bandwidth to the AMD Athlon processor, which
sig n ifica n tly im p roves p er fo rma n c e. A ll t h e p ref e tch
instructions are essentially integer instructions and can be used
Optimization Star
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
anywhere, in any type of code (integer, x87, 3DNow!, MMX,
etc.). Use the following formula to determine prefetch distance:
Prefetch Length = 200 (DS/C)
■
■
■
Round up to the nearest cache line.
DS is the data stride per loop iteration.
C is the number of cycles per loop iteration when hitting in
the L1 cache.
See “Use the 3DNow!™ PREFETCH and PREFETCHW
Instructions” on page 46 for more details.
Select DirectPath Over VectorPath Instructions
✩
TOP
U s e D i re c t Pa t h i n s t r u c t i o n s ra t h e r t h a n Ve c t o r Pa t h
instructions. DirectPath instructions are optimized for decode
and execute efficiently by minimizing the number of operations
per x86 instruction. Three DirectPath instructions can be
decoded in parallel. Using VectorPath instructions will block
DirectPath instructions from decoding simultaneously.
See Appendix G, “DirectPath versus VectorPath Instructions”
on page 219 for a list of DirectPath and VectorPath instructions.
Group II Optimizations—Secondary Optimizations
Load-Execute Instruction Usage
See “Load-Execute Instruction Usage” on page 34 for more
details.
Use Load-Execute Instructions
✩
TOP
Wherever possible, use load-execute instructions to increase
code density with the one exception described below. The
split-instruction form of load-execute instructions can be used
to avoid scheduler stalls for longer executing instructions and
to explicitly schedule the load and execute operations.
Group II Optimizations—Secondary Optimizations
9
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Avoid Load-Execute Floating-Point Instructions with Integer Operands
✩
TOP
Do not use load-execute floating-point instructions with integer
operands. The floating-point load-execute instructions with
integer operands are VectorPath and generate two OPs in a
cycle, while the discrete equivalent enables a third DirectPath
instruction to be decoded in the same cycle.
Take Advantage of Write Combining
✩
TOP
This guideline applies only to operating system, device driver,
a n d B I O S p rog ra m m e rs . I n o rd e r t o i m p rove s y s t e m
performance, the AMD Athlon processor aggressively combines
multiple memory-write cycles of any data size that address
locations within a 64-byte cache line aligned write buffer.
See Appendix C, “Implementation of Write Combining” on
page 155 for more details.
Use 3DNow!™ Instructions
✩
TOP
Unless accuracy requirements dictate otherwise, perform
floating-point computations using the 3DNow! instructions
instead of x87 instructions. The SIMD nature of 3DNow!
instructions achieves twice the number of FLOPs that are
achieved through x87 instructions. 3DNow! instructions also
provide for a flat register file instead of the stack-based
approach of x87 instructions.
See Table 23 on page 217 for a list of 3DNow! instructions. For
information about instruction usage, see the 3DNow!™
Technology Manual, order# 21928.
Avoid Branches Dependent on Random Data
✩
TOP
Avoid data-dependent branches around a single instruction .
Data-dependent branches acting upon basically random data
can cause the branch prediction logic to mispredict the branch
about 50% of the time. Design branch-free alternative code
sequences, which results in shorter average execution time.
See “Avoid Branches Dependent on Random Data” on page 57
for more details.
10
Group II Optimizations—Secondary Optimizations
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Avoid Placing Code and Data in the Same 64-Byte Cache Line
✩
TOP
Consider that the AMD Athlon processor cache line is twice the
size of previous processors. Code and data should not be shared
in the same 64-byte cache line, especially if the data ever
becomes modified. In order to maintain cache coherency, the
AMD Athlon processor may thrash its caches, resulting in lower
performance.
In general the following should be avoided:
■
■
Self-modifying code
Storing data in code segments
See “Avoid Placing Code and Data in the Same 64-Byte Cache
Line” on page 50 for more details.
Group II Optimizations—Secondary Optimizations
11
AMD Athlon™ Processor x86 Code Optimization
12
22007E/0—November 1999
Group II Optimizations—Secondary Optimizations
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
3
C Source Level Optimizations
This chapter details C programming practices for optimizing
code for the AMD Athlon™ processor. Guidelines are listed in
order of importance.
Ensure Floating-Point Variables and Expressions are of
Type Float
For compilers that generate 3DNow!™ instructions, make sure
that all floating-point variables and expressions are of type
float. Pay special attention to floating-point constants. These
require a suffix of “F” or “f” (for example, 3.14f) in order to be
of type float, otherwise they default to type double. To avoid
automatic promotion of float arguments to double, always use
function prototypes for all functions that accept float
arguments.
Use 32-Bit Data Types for Integer Code
U s e 3 2 -b i t d a t a t y p e s f o r i n t e g e r c o d e . C o m p i l e r
implementations vary, but typically the following data types are
included —int, signed, signed int, unsigned, unsigned int, long,
signed long, long int, signed long int, unsigned long, and unsigned
long int.
Ensure Floating-Point Variables and Expressions are of Type Float
13
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Consider the Sign of Integer Operands
In many cases, the data stored in integer variables determines
whether a signed or an unsigned integer type is appropriate.
For example, to record the weight of a person in pounds, no
negative numbers are required so an unsigned type is
appropriate. However, recording temperatures in degrees
Celsius may require both positive and negative numbers so a
signed type is needed.
Where there is a choice of using either a signed or an unsigned
type, it should be considered that certain operations are faster
with unsigned types while others are faster for signed types.
Integer-to-floating-point conversion using integers larger than
16-bit is faster with signed types, as the x86 FPU provides
instructions for converting signed integers to floating-point, but
has no instructions for converting unsigned integers. In a
typical case, a 32-bit integer is converted as follows:
Example 1 (Avoid):
double x;
unsigned int i;
====>
x = i;
MOV
MOV
MOV
FILD
FSTP
[temp+4], 0
EAX, i
[temp], eax
QWORD PTR [temp]
QWORD PTR [x]
This code is slow not only because of the number of instructions
but also because a size mismatch prevents store-to-loadforwarding to the FILD instruction.
Example (Preferred):
double x;
int i;
====>
FILD DWORD PTR [i]
FSTP QWORD PTR [x]
x = i;
Computing quotients and remainders in integer division by
constants are faster when performed on unsigned types. In a
typical case, a 32-bit integer is divided by four as follows:
14
Consider the Sign of Integer Operands
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example (Avoid):
int i;
====>
i = i / 4;
MOV
CDQ
AND
ADD
SAR
MOV
EAX, i
SHR
i, 2
EDX, 3
EAX, EDX
EAX, 2
i, EAX
Example (Preferred):
unsigned int i; ====>
i = i / 4;
In summary:
Use unsigned types for:
■
■
■
Division and remainders
Loop counters
Array indexing
Use signed types for:
■
Integer-to-float conversion
Use Array Style Instead of Pointer Style Code
The use of pointers in C makes work difficult for the optimizers
in C compilers. Without detailed and aggressive pointer
analysis, the compiler has to assume that writes through a
pointer can write to any place in memory. This includes storage
allocated to other variables, creating the issue of aliasing, i.e.,
the same block of memory is accessible in more than one way.
In order to help the optimizer of the C compiler in its analysis,
avoid the use of pointers where possible. One example where
this is trivially possible is in the access of data organized as
arrays. C allows the use of either the array operator [] or
pointers to access the array. Using array-style code makes the
task of the optimizer easier by reducing possible aliasing.
For example, x[0] and x[2] can not possibly refer to the same
m e m o ry l o c a t io n, w h i l e * p a nd * q co u l d. I t is hi g h ly
recommended to use the array style, as significant performance
advantages can be achieved with most compilers.
Use Array Style Instead of Pointer Style Code
15
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Note that source code transformations will interact with a
compiler’s code generator and that it is difficult to control the
generated machine code from the source level. It is even
possible that source code transformations for improving
performance and compiler optimizations "fight" each other.
Depending on the compiler and the specific source code it is
therefore possible that pointer style code will be compiled into
machine code that is faster than that generated from equivalent
array style code. It is advisable to check the performance after
any source code transformation to see whether performance
indeed increased.
Example 1 (Avoid):
typedef struct {
float x,y,z,w;
} VERTEX;
typedef struct {
float m[4][4];
} MATRIX;
void XForm (float *res, const float *v, const float *m, int
numverts)
{
float dp;
int i;
const VERTEX* vv = (VERTEX *)v;
for (i =
dp =
dp +=
dp +=
dp +=
16
0; i < numverts; i++) {
vv->x * *m++;
vv->y * *m++;
vv->z * *m++;
vv->w * *m++;
*res++ = dp;
/* write transformed x */
dp = vv->x *
dp += vv->y *
dp += vv->z *
dp += vv->w *
*m++;
*m++;
*m++;
*m++;
*res++ = dp;
/* write transformed y */
dp = vv->x *
dp += vv->y *
dp += vv->z *
dp += vv->w *
*m++;
*m++;
*m++;
*m++;
Use Array Style Instead of Pointer Style Code
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
*res++ = dp;
/* write transformed z */
dp = vv->x *
dp += vv->y *
dp += vv->z *
dp += vv->w *
*m++;
*m++;
*m++;
*m++;
*res++ = dp;
/* write transformed w */
++vv;
m -= 16;
/* next input vertex */
/* reset to start of transform matrix */
}
}
Example 2 (Preferred):
typedef struct {
float x,y,z,w;
} VERTEX;
typedef struct {
float m[4][4];
} MATRIX;
void XForm (float *res, const float *v, const float *m, int
numverts)
{
int i;
const VERTEX* vv = (VERTEX *)v;
const MATRIX* mm = (MATRIX *)m;
VERTEX* rr = (VERTEX *)res;
for (i = 0; i < numverts; i++)
rr->x = vv->x*mm->m[0][0] +
vv->z*mm->m[0][2] +
rr->y = vv->x*mm->m[1][0] +
vv->z*mm->m[1][2] +
rr->z = vv->x*mm->m[2][0] +
vv->z*mm->m[2][2] +
rr->w = vv->x*mm->m[3][0] +
vv->z*mm->m[3][2] +
}
{
vv->y*mm->m[0][1] +
vv->w*mm->m[0][3];
vv->y*mm->m[1][1] +
vv->w*mm->m[1][3];
vv->y*mm->m[2][1] +
vv->w*mm->m[2][3];
vv->y*mm->m[3][1] +
vv->w*mm->m[3][3];
}
Use Array Style Instead of Pointer Style Code
17
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Completely Unroll Small Loops
Take advantage of the AMD Athlon processor’s large, 64-Kbyte
instruction cache and completely unroll small loops. Unrolling
loops can be beneficial to performance, especially if the loop
body is small which makes the loop overhead significant. Many
compilers are not aggressive at unrolling loops. For loops that
have a small fixed loop count and a small loop body, completely
unrolling the loops at the source level is recommended.
Example 1 (Avoid):
// 3D-transform: multiply vector V by 4x4 transform matrix M
for (i=0; i<4; i++) {
r[i] = 0;
for (j=0; j<4; j++) {
r[i] += M[j][i]*V[j];
}
}
Example 2 (Preferred):
// 3D-transform: multiply vector V
r[0] = M[0][0]*V[0] + M[1][0]*V[1]
M[3][0]*V[3];
r[1] = M[0][1]*V[0] + M[1][1]*V[1]
M[3][1]*V[3];
r[2] = M[0][2]*V[0] + M[1][2]*V[1]
M[3][2]*V[3];
r[3] = M[0][3]*V[0] + M[1][3]*V[1]
M[3][3]*v[3];
by 4x4 transform matrix M
+ M[2][0]*V[2] +
+ M[2][1]*V[2] +
+ M[2][2]*V[2] +
+ M[2][3]*V[2] +
Avoid Unnecessary Store-to-Load Dependencies
A store-to-load dependency exists when data is stored to
m em o ry, o n ly t o b e re a d b ack s h or t ly t h e re a f t er. S ee
“Store-to-Load Forwarding Restrictions” on page 51 for more
details. The AMD Athlon processor contains hardware to
accelerate such store-to-load dependencies, allowing the load to
obtain the store data before it has been written to memory.
However, it is still faster to avoid such dependencies altogether
and keep the data in an internal register.
Avoiding store-to-load dependencies is especially important if
they are part of a long dependency chains, as might occur in a
recurrence computation. If the dependency occurs while
operating on arrays, many compilers are unable to optimize the
18
Completely Unroll Small Loops
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
code in a way that avoids the store-to-load dependency. In some
instances the language definition may prohibit the compiler
from using code transformations that would remove the storeto-load dependency. It is therefore recommended that the
programmer remove the dependency manually, e.g., by
introducing a temporary variable that can be kept in a register.
This can result in a significant performance increase. The
following is an example of this.
Example 1 (Avoid):
double x[VECLEN], y[VECLEN], z[VECLEN];
unsigned int k;
for (k = 1; k < VECLEN; k++) {
x[k] = x[k-1] + y[k];
}
for (k = 1; k < VECLEN; k++) {
x[k] = z[k] * (y[k] - x[k-1]);
}
Example 2 (Preferred):
double x[VECLEN], y[VECLEN], z[VECLEN];
unsigned int k;
double t;
t = x[0];
for (k = 1; k < VECLEN; k++) {
t = t + y[k];
x[k] = t;
}
t = x[0];
for (k = 1; k < VECLEN; k++) {
t = z[k] * (y[k] - t);
x[k] = t;
}
Avoid Unnecessary Store-to-Load Dependencies
19
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Consider Expression Order in Compound Branch
Conditions
Branch c ondit ions in C prog rams are oft en com pound
conditions consisting of multiple boolean expressions joined by
the boolean operators && and ||. C guarantees a short-circuit
evaluation of these operators. This means that in the case of ||,
the first operand to evaluate to TRUE term inates the
evaluation, i.e., following operands are not evaluated at all.
Similarly for &&, the first operand to evaluate to FALSE
terminates the evaluation. Because of this short-circuit
evaluation, it is not always possible to swap the operands of ||
and &&. This is especially the case when the evaluation of one
of the operands causes a side effect. However, in most cases the
exchange of operands is possible.
When used to control conditional branches, expressions
involving || and && are translated into a series of conditional
branches. The ordering of the conditional branches is a function
of the ordering of the expressions in the compound condition,
and can have a significant impact on performance. It is
unfortunately not possible to give an easy, closed-form formula
on how to order the conditions. Overall performance is a
function of a variety of the following factors:
■
■
■
■
■
probability of a branch mispredict for each of the branches
generated
additional latency incurred due to a branch mispredict
cost of evaluating the conditions controlling each of the
branches generated
amount of parallelism that can be extracted in evaluating
the branch conditions
data stream consumed by an application (mostly due to the
dependence of mispredict probabilities on the nature of the
incoming data in data dependent branches)
It is therefore recommended to experiment with the ordering of
expressions in compound branch conditions in the most active
areas of a program (so called hot spots) where most of the
execution time is spent. Such hot spots can be found through
the use of profiling. A "typical" data stream should be fed to
the program while doing the experiments.
20
Consider Expression Order in Compound Branch Conditions
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Switch Statement Usage
Optimize Switch Statements
Switch statements are translated using a variety of algorithms.
The most common of these are jump tables and comparison
chains/trees. It is recommended to sort the cases of a switch
statement according to the probability of occurrences, with the
most probable first. This will improve performance when the
switch is translated as a comparison chain. It is further
recommended to make the case labels small, contiguous
integers, as this will allow the switch to be translated as a jump
table.
Example 1 (Avoid):
int days_in_month, short_months, normal_months, long_months;
switch (days_in_month) {
case 28:
case 29: short_months++; break;
case 30: normal_months++; break;
case 31: long_months++; break;
default: printf ("month has fewer than 28 or more than 31
days\n");
}
Example 2 (Preferred):
int days_in_month, short_months, normal_months, long_months;
switch (days_in_month) {
case 31: long_months++; break;
case 30: normal_months++; break;
case 28:
case 29: short_months++; break;
default: printf ("month has fewer than 28 or more than 31
days\n");
}
Use Prototypes for All Functions
In general, use prototypes for all functions. Prototypes can
convey additional information to the compiler that might
enable more aggressive optimizations.
Switch Statement Usage
21
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Use Const Type Qualifier
Use the “const” type qualifier as much as possible. This
optimization makes code more robust and may enable higher
performance code to be generated due to the additional
information available to the compiler. For example, the C
standard allows compilers to not allocate storage for objects
that are declared “const”, if their address is never taken.
Generic Loop Hoisting
To improve the performance of inner loops, it is beneficial to
reduce redundant constant calculations (i.e., loop invariant
calculations). However, this idea can be extended to invariant
control structures.
The first case is that of a constant “if()” statement in a “for()”
loop.
Example 1:
for( i ... ) {
if( CONSTANT0 ) {
DoWork0( i );
} else {
DoWork1( i );
}
}
// does not affect CONSTANT0
// does not affect CONSTANT0
The above loop should be transformed into:
if( CONSTANT0 ) {
for( i ... ) {
DoWork0( i );
}
} else {
for( i ... ) {
DoWork1( i );
}
}
This will make your inner loops tighter by avoiding repetitious
evaluation of a known “if()” control structure. Although the
branch would be easily predicted, the extra instructions and
decode limitations imposed by branching are saved, which are
usually well worth it.
22
Use Const Type Qualifier
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Generalization for Multiple Constant Control Code
To generalize this further for multiple constant control code
some more work may have to be done to create the proper outer
loop. Enumeration of the constant cases will reduce this to a
simple switch statement.
Example 2:
for(i ... ) {
if( CONSTANT0 ) {
DoWork0( i );
} else {
DoWork1( i );
}
if( CONSTANT1 ) {
DoWork2( i );
} else {
DoWork3( i );
//does not affect CONSTANT0
// or CONSTANT1
//does not affect CONSTANT0
// or CONSTANT1
//does not affect CONSTANT0
// or CONSTANT1
//does not affect CONSTANT0
// or CONSTANT1
}
}
The above loop should be transformed into:
#define combine( c1, c2 ) (((c1) << 1) + (c2))
switch( combine( CONSTANT0!=0, CONSTANT1!=0 ) ) {
case combine( 0, 0 ):
for( i ... ) {
DoWork0( i );
DoWork2( i );
}
break;
case combine( 1, 0 ):
for( i ... ) {
DoWork1( i );
DoWork2( i );
}
break;
case combine( 0, 1 ):
for( i ... ) {
DoWork0( i );
DoWork3( i );
}
break;
Generic Loop Hoisting
23
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
case combine( 1, 1 ):
for( i ... ) {
DoWork1( i );
DoWork3( i );
}
break;
default:
break;
}
The trick here is that there is some up-front work involved in
generating all the combinations for the switch constant and the
total amount of code has doubled. However, it is also clear that
the inner loops are "if()-free". In ideal cases where the
“DoWork*()” functions are inlined, the successive functions
will have greater overlap leading to greater parallelism than
would be possible in the presence of intervening “if()”
statements.
The s a m e i de a c a n b e a p p l i e d t o c o n s t a n t “ swi t ch( ) ”
statements, or combinations of “switch()” statements and “if()”
statements inside of “for()” loops. The method for combining
the input constants gets more complicated but will be worth it
for the performance benefit.
However, the number of inner loops can also substantially
increase. If the number of inner loops is prohibitively high, then
only the most common cases need to be dealt with directly, and
the remaining cases can fall back to the old code in a "default:"
clause for the “switch()” statement.
This typically comes up when the programmer is considering
runtime generated code. While runtime generated code can
lead to similar levels of performance improvement, it is much
harder to maintain, and the developer must do their own
optimizations for their code generation without the help of an
available compiler.
Declare Local Functions as Static
Functions that are not used outside the file in which they are
defined should always be declared static, which forces internal
linkage. Otherwise, such functions default to external linkage,
24
Declare Local Functions as Static
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
w h i ch m i g h t i nh ib it c e rt a i n o p t i m i z a t i o n s w i t h so m e
compilers—for example, aggressive inlining.
Dynamic Memory Allocation Consideration
Dynamic memory allocation (‘malloc’ in C language) should
always return a pointer that is suitably aligned for the largest
base type (quadword alignment). Where this aligned pointer
cannot be guaranteed, use the technique shown in the following
code to make the pointer quadword aligned, if needed. This
code assumes the pointer can be cast to a long.
Example:
double* p;
double* np;
p = (double *)malloc(sizeof(double)*number_of_doubles+7L);
np = (double *)((((long)(p))+7L) & (–8L));
Then use ‘np’ instead of ‘p’ to access the data. ‘p’ is still needed
in order to deallocate the storage.
Introduce Explicit Parallelism into Code
Where possible, long dependency chains should be broken into
several independent dependency chains which can then be
executed in parallel exploiting the pipeline execution units.
This is especially important for floating-point code, whether it
is mapped to x87 or 3DNow! instructions because of the longer
latency of floating-point operations. Since most languages,
including ANSI C, guarantee that floating-point expressions are
not re-ordered, compilers can not usually perform such
optimizations unless they offer a switch to allow ANSI noncompliant reordering of floating-point expressions according to
algebraic rules.
Note that re-ordered code that is algebraically identical to the
o r i g i n a l c o d e d o e s n o t n e c e s s a r i ly d e l ive r i d e n t i c a l
computational results due to the lack of associativity of floating
p o i n t o p e ra t i o n s . T h e re a re w e l l - k n ow n n u m e r i c a l
considerations in applying these optimizations (consult a book
on numerical analysis). In some cases, these optimizations may
Dynamic Memory Allocation Consideration
25
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
lead to unexpected results. Fortunately, in the vast majority of
cases, the final result will differ only in the least significant
bits.
Example 1 (Avoid):
double a[100],sum;
int i;
sum = 0.0f;
for (i=0; i<100; i++) {
sum += a[i];
}
Example 2 (Preferred):
double a[100],sum1,sum2,sum3,sum4,sum;
int i;
sum1 = 0.0;
sum2 = 0.0;
sum3 = 0.0;
sum4 = 0.0;
for (i=0; i<100; i+4) {
sum1 += a[i];
sum2 += a[i+1];
sum3 += a[i+2];
sum4 += a[i+3];
}
sum = (sum4+sum3)+(sum1+sum2);
Notice that the 4-way unrolling was chosen to exploit the 4-stage
fully pipelined floating-point adder. Each stage of the floatingpoint adder is occupied on every clock cycle, ensuring maximal
sustained utilization.
Explicitly Extract Common Subexpressions
In certain situations, C compilers are unable to extract common
subexpressions from floating-point expressions due to the
guarantee against reordering of such expressions in the ANSI
standard. Specifically, the compiler can not re-arrange the
computation according to algebraic equivalencies before
extracting common subexpressions. In such cases, the
p r o g ra m m e r s h o u l d m a nu a l ly e x t ra c t t h e c o m m o n
subexpression. It should be noted that re-arranging the
expression may result in different computational results due to
the lack of associativity of floating-point operations, but the
results usually differ in only the least significant bits.
26
Explicitly Extract Common Subexpressions
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 1
Avoid:
double
a,b,c,d,e,f;
e = b*c/d;
f = b/d*a;
Preferred:
double
a,b,c,d,e,f,t;
t = b/d;
e = c*t;
f = a*t;
Example 2
Avoid:
double a,b,c,e,f;
e = a/c;
f = b/c;
Preferred:
double a,b,c,e,f,t;
t = 1/c;
e = a*t
f = b*t;
C Language Structure Component Considerations
Many compilers have options that allow padding of structures
to make their siz e multiples of words, doublewords, or
quadwords, in order to achieve better alignment for structures.
In addition, to improve the alignment of structure members,
some compilers might allocate structure elements in an order
that differs from the order in which they are declared. However,
some compilers might not offer any of these features, or their
implementation might not work properly in all situations.
Therefore, to achieve the best alignment of structures and
structure members while minimizing the amount of padding
regardless of compiler optimizations, the following methods are
suggested.
Sort by Base Type
Size
Sort structure members according to their base type size,
declaring members with a larger base type size ahead of
members with a smaller base type size.
C Language Structure Component Considerations
27
AMD Athlon™ Processor x86 Code Optimization
Pad by Multiple of
Largest Base Type
Size
22007E/0—November 1999
Pad the structure to a multiple of the largest base type size of
any member. In this fashion, if the first member of a structure is
naturally aligned, all other members are naturally aligned as
well. The padding of the structure to a multiple of the largest
based type size allows, for example, arrays of structures to be
perfectly aligned.
The following example demonstrates the reordering of
structure member declarations:
Original ordering (Avoid):
struct {
char
long
double
} baz;
a[5];
k;
x;
New ordering, with padding (Preferred):
struct {
double
long
char
char
} baz;
x;
k;
a[5];
pad[7];
See “C Language Structure Component Considerations” on
page 55 for a different perspective.
Sort Local Variables According to Base Type Size
When a compiler allocates local variables in the same order in
which they are declared in the source code, it can be helpful to
declare local variables in such a manner that variables with a
larger base type size are declared ahead of the variables with
smaller base type size. Then, if the first variable is allocated so
that it is naturally aligned, all other variables are allocated
contiguously in the order they are declared, and are naturally
aligned without any padding.
Some compilers do not allocate variables in the order they are
declared. In these cases, the compiler should automatically
allocate variables in such a manner as to make them naturally
aligned with the minimum amount of padding. In addition,
some compilers do not guarantee that the stack is aligned
suitably for the largest base type (that is, they do not guarantee
28
Sort Local Variables According to Base Type Size
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
quadword alignment), so that quadword operands might be
misaligned, even if this technique is used and the compiler does
allocate variables in the order they are declared.
The following example demonstrates the reordering of local
variable declarations:
Original ordering (Avoid):
short
long
double
char
float
ga, gu, gi;
foo, bar;
x, y, z[3];
a, b;
baz;
Improved ordering (Preferred):
double
double
long
float
short
z[3];
x, y;
foo, bar;
baz;
ga, gu, gi;
See “Sort Variables According to Base Type Size” on page 56 for
more information from a different perspective.
Accelerating Floating-Point Divides and Square Roots
Divides and square roots have a much longer latency than other
floating-point operations, even though the AMD Athlon
processor provides significant acceleration of these two
operations. In some codes, these operations occur so often as to
s e r i o u s ly i m p a c t p e r f o r m a n c e . I n t h e s e c a s e s , i t i s
recommended to port the code to 3DNow! inline assembly or to
use a compiler that can generate 3DNow! code. If code has hot
spots that use single-precision arithmetic only (i.e., all
computation involves data of type float) and for some reason
cannot be ported to 3DNow!, the following technique may be
used to improve performance.
The x87 FPU has a precision-control field as part of the FPU
control word. The precision-control setting determines what
precision results get rounded to. It affects the basic arithmetic
operations, including divides and square roots. AMD Athlon
and AMD-K6® family processors implement divide and square
root in such fashion as to only compute the number of bits
Accelerating Floating-Point Divides and Square Roots
29
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
necessary for the currently selected precision. This means that
setting precision control to single precision (versus Win32
default of double precision) lowers the latency of those
operations.
The Microsoft ® Visual C environment provides functions to
manipulate the FPU control word and thus the precision
control. Note that these functions are not very fast, so changes
of precision control should be inserted where it creates little
overhead, such as outside a computation-intensive loop.
Otherwise the overhead created by the function calls outweighs
the benefit from reducing the latencies of divide and square
root operations.
The following example shows how to set the precision control to
single precision and later restore the original settings in the
Microsoft Visual C environment.
Example:
/* prototype for _controlfp() function */
#include <float.h>
unsigned int orig_cw;
/* Get current FPU control word and save it */
orig_cw = _controlfp (0,0);
/* Set precision control in FPU control word to single
precision. This reduces the latency of divide and square
root operations.
*/
_controlfp (_PC_24, MCW_PC);
/* restore original FPU control word */
_controlfp (orig_cw, 0xfffff);
30
Accelerating Floating-Point Divides and Square Roots
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Avoid Unnecessary Integer Division
Integer division is the slowest of all integer arithmetic
operations and should be avoided wherever possible. One
possibility for reducing the number of integer divisions is
multiple divisions, in which division can be replaced with
multiplication as shown in the following examples. This
replacement is possible only if no overflow occurs during the
computation of the product. This can be determined by
considering the possible ranges of the divisors.
Example 1 (Avoid):
int i,j,k,m;
m = i / j / k;
Example 2 (Preferred):
int i,j,k,l;
m = i / (j * k);
Copy Frequently De-referenced Pointer Arguments to Local
Variables
Avoid frequently de-referencing pointer arguments inside a
function. Since the compiler has no knowledge of whether
aliasing exists between the pointers, such de-referencing can
not be optimized away by the compiler. This prevents data from
being kept in registers and significantly increases memory
traffic.
Note that many compilers have an “assume no aliasing”
optimization switch. This allows the compiler to assume that
two different pointers always have disjoint contents and does
not require copying of pointer arguments to local variables.
Otherwise, copy the data pointed to by the pointer arguments
to local variables at the start of the function and if necessary
copy them back at the end of the function.
Avoid Unnecessary Integer Division
31
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 1 (Avoid):
//assumes pointers are different and q!=r
void isqrt ( unsigned long a,
unsigned long *q,
unsigned long *r)
{
*q = a;
if (a > 0)
{
while (*q > (*r = a / *q))
{
*q = (*q + *r) >> 1;
}
}
*r = a - *q * *q;
}
Example 2 (Preferred):
//assumes pointers are different and q!=r
void isqrt ( unsigned long a,
unsigned long *q,
unsigned long *r)
{
unsigned long qq, rr;
qq = a;
if (a > 0)
{
while (qq > (rr = a / qq))
{
qq = (qq + rr) >> 1;
}
}
rr = a - qq * qq;
*q = qq;
*r = rr;
}
32
Copy Frequently De-referenced Pointer Arguments to Local Variables
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
4
Instruction Decoding
Optimizations
This chapter discusses ways to maximize the number of
instructions decoded by the instruction decoders in the
AMD Athlon™ processor. Guidelines are listed in order of
importance.
Overview
The AMD Athlon processor instruction fetcher reads 16-byte
aligned code windows from the instruction cache. The
instruction bytes are then merged into a 24-byte instruction
queue. On each cycle, the in-order front-end engine selects for
decode up to three x86 instructions from the instruction-byte
queue.
All instructions (x86, x87, 3DNow!™, and MMX™) are
cla ssified int o t wo types of decodes — D i rect Pat h and
VectorPath (see “DirectPath Decoder” and “VectorPath
Decoder” on page 133 for more information). DirectPath
instructions are common instructions that are decoded directly
in hardware. VectorPath instructions are more complex
instructions that require the use of a sequence of multiple
operations issued from an on-chip ROM.
Up to three DirectPath instructions can be selected for decode
per cycle. Only one VectorPath instruction can be selected for
decode per cycle. DirectPath instructions and VectorPath
instructions cannot be simultaneously decoded.
Overview
33
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Select DirectPath Over VectorPath Instructions
✩
TOP
U s e D i re c t Pa t h i n s t r u c t i o n s ra t h e r t h a n Ve c t o r Pa t h
instructions. DirectPath instructions are optimized for decode
and execute efficiently by minimizing the number of operations
per x86 instruction, which includes ‘register ← register op
memory’ as well as ‘register ← register op register’ forms of
instructions. Up to three DirectPath instructions can be
decoded per cycle. VectorPath instructions will block the
decoding of DirectPath instructions.
The very high majority of instructions used be a compiler has
b e e n i m p l e m e n t e d a s D i re c t Pa t h i n s t r u c t i o n s i n t h e
AMD Athlon processor. Assembly writers must still take into
consideration the usage of DirectPath versus VectorPath
instructions.
See Appendix F, “Instruction Dispatch and Execution
Resources” on page 187 and Appendix G, “DirectPath versus
VectorPath Instructions” on page 219 for tables of DirectPath
and VectorPath instructions.
Load-Execute Instruction Usage
Use Load-Execute Integer Instructions
✩
TOP
34
Most load-execute integer instructions are DirectPath
decodable and can be decoded at the rate of three per cycle.
Splitting a load-execute integer instruction into two separate
instructions—a load instruction and a “reg, reg” instruction—
reduces decoding bandwidth and increases register pressure,
which results in lower performance. The split-instruction form
can be used to avoid scheduler stalls for longer executing
instructions and to explicitly schedule the load and execute
operations.
Select DirectPath Over VectorPath Instructions
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Use Load-Execute Floating-Point Instructions with Floating-Point
Operands
✩
TOP
When operating on single-precision or double-precision
floating-point data, wherever possible use floating-point
load-execute instructions to increase code density.
Note: This optimization applies only to floating-point instructions
with floating-point operands and not with integer operands,
as described in the next optimization.
This coding style helps in two ways. First, denser code allows
more work to be held in the instruction cache. Second, the
denser code generates fewer internal OPs and, therefore, the
FPU scheduler holds more work, which increases the chances of
extracting parallelism from the code.
Example 1 (Avoid):
FLD
FLD
FMUL
QWORD PTR [TEST1]
QWORD PTR [TEST2]
ST, ST(1)
Example 2 (Preferred):
FLD
FMUL
QWORD PTR [TEST1]
QWORD PTR [TEST2]
Avoid Load-Execute Floating-Point Instructions with Integer Operands
✩
TOP
Do not use load-execute floating-point instructions with integer
operands: FIADD, FISUB, FISUBR, FIMUL, FIDIV, FIDIVR,
F I C O M , a n d F I C O M P. R e m e m b e r t h a t f l o a t i n g -p o i n t
ins tructions can have int ege r ope rands while int ege r
instruction cannot have floating-point operands.
Floating-point computations involving integer-memory
operands should use separate FILD and arithmetic instructions.
This optimization has the potential to increase decode
bandwidth and OP density in the FPU scheduler. The floatingpoint load-execute instructions with integer operands are
VectorPath and generate two OPs in a cycle, while the discrete
equivalent enables a third DirectPath instruction to be decoded
in the same cycle. In some situations this optimizations can also
reduce execution time if the FILD can be scheduled several
instructions ahead of the arithmetic instruction in order to
cover the FILD latency.
Load-Execute Instruction Usage
35
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 1 (Avoid):
FLD
FIMUL
FIADD
QWORD PTR [foo]
DWORD PTR [bar]
DWORD PTR [baz]
Example 2 (Preferred):
FILD
FILD
FLD
FMULP
FADDP
DWORD PTR [bar]
DWORD PTR [baz]
QWORD PTR [foo]
ST(2), ST
ST(1),ST
Align Branch Targets in Program Hot Spots
In program hot spots (i.e., innermost loops in the absence of
profiling data), place branch targets at or near the beginning of
16-byte aligned code windows. This technique helps to
maximize the number of instructions that are filled into the
instruction-byte queue while preventing I-cache space in
branch intensive code.
Use Short Instruction Lengths
Assemblers and compilers should generate the tightest code
possible to optimize use of the I-cache and increase average
decode rate. Wherever possible, use instructions with shorter
lengths. Using shorter instructions increases the number of
instructions that can fit into the instruction-byte queue. For
example, use 8-bit displacements as opposed to 32-bit
displacements. In addition, use the single-byte format of simple
integer instructions whenever possible, as opposed to the
2-byte opcode ModR/M format.
Example 1 (Avoid):
81 C0 78 56 34 12
81 C3 FB FF FF FF
0F 84 05 00 00 00
36
add eax, 12345678h ;uses 2-byte opcode
; form (with ModR/M)
add ebx, -5
;uses 32-bit
; immediate
jz $label1
;uses 2-byte opcode,
; 32-bit immediate
Align Branch Targets in Program Hot Spots
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 2 (Preferred):
05 78 56 34 12
add eax, 12345678h
83 C3 FB
add ebx, -5
74 05
jz
$label1
;uses single byte
; opcode form
;uses 8-bit sign
; extended immediate
;uses 1-byte opcode,
; 8-bit immediate
Avoid Partial Register Reads and Writes
In order to handle partial register writes, the AMD Athlon
processor execution core implements a data-merging scheme.
In the execution unit, an instruction writing a partial register
merges the modified portion with the current state of the
remainder of the register. Therefore, the dependency hardware
can potentially force a false dependency on the most recent
instruction that writes to any part of the register.
Example 1 (Avoid):
MOV
MOV
AL, 10
AH, 12
;inst 1
;inst 2 has a false dependency on
; inst 1
;inst 2 merges new AH with current
; EAX register value forwarded
; by inst 1
In addition, an instruction that has a read dependency on any
part of a given architectural register has a read dependency on
the most recent instruction that modifies any part of the same
architectural register.
Example 2 (Avoid):
MOV
MOV
BX, 12h
BL, DL
MOV
BH, CL
MOV
AL, BL
Avoid Partial Register Reads and Writes
;inst 1
;inst 2, false dependency on
; completion of inst 1
;inst 3, false dependency on
; completion of inst 2
;inst 4, depends on completion of
; inst 2
37
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Replace Certain SHLD Instructions with Alternative Code
Certain instances of the SHLD instruction can be replaced by
alternative code using SHR and LEA. The alternative code has
lower latency and requires less execution resources. SHR and
LEA (32-bit version) are DirectPath instructions, while SHLD is
a VectorPath instruction. SHR and LEA preserves decode
bandwidth as it potentially enables the decoding of a third
DirectPath instruction.
Example 1
(Avoid):
SHLD REG1, REG2, 1
(Preferred):
SHR REG2, 31
LEA REG1, [REG1*2 + REG2]
Example 2
(Avoid):
SHLD REG1, REG2, 2
(Preferred):
SHR REG2, 30
LEA REG1, [REG1*4 + REG2]
Example 3
(Avoid):
SHLD REG1, REG2, 3
(Preferred):
SHR REG2, 29
LEA REG1, [REG1*8 + REG2]
Use 8-Bit Sign-Extended Immediates
Using 8-bit sign-extended immediates improves code density
with no negative effects on the AMD Athlon processor. For
example, ADD BX, –5 should be encoded “83 C3 FB” and not
“81 C3 FF FB”.
38
Replace Certain SHLD Instructions with Alternative
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Use 8-Bit Sign-Extended Displacements
Use 8-bit sign-extended displacements for conditional
branches. Using short, 8-bit sign-extended displacements for
conditional branches improves code density with no negative
effects on the AMD Athlon processor.
Code Padding Using Neutral Code Fillers
Occasionally a need arises to insert neutral code fillers into the
code stream, e.g., for code alignment purposes or to space out
branches. Since this filler code can be executed, it should take
up as few execution resources as possible, not diminish decode
density, and not modify any processor state other than
advancing EIP. A one byte padding can easily be achieved using
the NOP instructions (XCHG EAX, EAX; opcode 0x90). In the
x86 archit ect ure , there are seve ral multi- byt e "N OP"
instructions available that do not change processor state other
than EIP:
■
■
■
■
■
■
■
■
■
■
■
■
■
MOV REG, REG
XCHG REG, REG
CMOVcc REG, REG
SHR REG, 0
SAR REG, 0
SHL REG, 0
SHRD REG, REG, 0
SHLD REG, REG, 0
LEA REG, [REG]
LEA REG, [REG+00]
LEA REG, [REG*1+00]
LEA REG, [REG+00000000]
LEA REG, [REG*1+00000000]
Not all of these instructions are equally suitable for purposes of
code padding. For example, SHLD/SHRD are microcoded which
reduces decode bandwidth and takes up execution resources.
Use 8-Bit Sign-Extended Displacements
39
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Recommendations for the AMD Athlon™ Processor
For code that is optimized specifically for the AMD Athlon
processor, the optimal code fillers are NOP instructions (opcode
0x90) with up to two REP prefixes (0xF3). In the AMD Athlon
processor, a NOP with up to two REP prefixes can be handled
by a single decoder with no overhead. As the REP prefixes are
redundant and meaningless, they get discarded, and NOPs are
handled without using any execution resources. The three
decoders of AMD Athlon processor can handle up to three
NOPs, each with up to two REP prefixes each, in a single cycle,
for a neutral code filler of up to nine bytes.
Note: When used as a filler instruction, REP/REPNE prefixes can
be used in conjunction only with NOPs. REP/REPNE has
undefined behavior when used with instructions other than
a NOP.
I f a l a rg e r a m o u n t o f c o d e p a dd i n g i s re q u i re d , i t i s
recommended to use a JMP instruction to jump across the
padding region. The following assembly language macros show
this:
NOP1_ATHLON
NOP2_ATHLON
NOP3_ATHLON
NOP4_ATHLON
NOP5_ATHLON
NOP6_ATHLON
NOP7_ATHLON
<DB 090h>
<DB 0F3h, 090h>
<DB 0F3h, 0F3h, 090h>
<DB 0F3h, 0F3h, 090h, 090h>
<DB 0F3h, 0F3h, 090h, 0F3h, 090h>
<DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h>
<DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h,
090h>
NOP8_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h,
0F3h, 090h>
NOP9_ATHLON TEXTEQU <DB 0F3h, 0F3h, 090h, 0F3h, 0F3h, 090h,
0F3h, 0F3h, 090h>
NOP10_ATHLONTEXTEQU <DB 0EBh, 008h, 90h, 90h, 90h, 90h,
90h, 90h, 90h, 90h>
40
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
Code Padding Using Neutral Code Fillers
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Recommendations for AMD-K6® Family and AMD Athlon™ Processor
Blended Code
On x86 processors other than the AMD Athlon processor
(including the AMD-K6 family of processors), the REP prefix
and especially multiple prefixes cause decoding overhead, so
the above technique is not recommended for code that has to
run well both on AMD Athlon processor and other x86
processors (blended code). In such cases the instructions and
instruction sequences below are recommended. For neutral
code fillers longer than eight bytes in length, the JMP
instruction can be used to jump across the padding region.
Note that each of the instructions and instruction sequences
b e l ow u t il i z e s a n x 8 6 re g i s t e r. To avo id p e rfo r m a n c e
degradation, the register used in the padding should be
selected so as to not lengthen existing dependency chains, i.e.,
one should select a register that is not used by instructions in
the vicinity of the neutral code filler. Note that certain
instructions use registers implicitly. For example, PUSH, POP,
CALL, and RET all make implicit use of the ESP register. The
5-byte filler sequence below consists of two instructions. If flag
changes across the code padding are acceptable, the following
instructions may be used as single instruction, 5-byte code
fillers:
■
■
TEST EAX, 0FFFF0000h
CMP EAX, 0FFFF0000h
T h e f o l l o w i n g a s s e m b ly l a n g u a g e m a c r o s s h o w t h e
recommended neutral code fillers for code optimized for the
AMD Athlon processor that also has to run well on other x86
processors. Note for some padding lengths, versions using ESP
or EBP are missing due to the lack of fully generalized
addressing modes.
NOP2_EAX
NOP2_EBX
NOP2_ECX
NOP2_EDX
NOP2_ESI
NOP2_EDI
NOP2_ESP
NOP2_EBP
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
<DB
<DB
<DB
<DB
<DB
<DB
<DB
<DB
08Bh,0C0h>
08Bh,0DBh>
08Bh,0C9h>
08Bh,0D2h>
08Bh,0F6h>
08Bh,0FFh>
08Bh,0E4h>
08Bh,0EDh>
;mov
;mov
;mov
;mov
;mov
;mov
;mov
;mov
eax,
ebx,
ecx,
edx,
esi,
edi,
esp,
ebp,
eax
ebx
ecx
edx
esi
edi
esp
ebp
NOP3_EAX TEXTEQU <DB 08Dh,004h,020h> ;lea eax, [eax]
NOP3_EBX TEXTEQU <DB 08Dh,01Ch,023h> ;lea ebx, [ebx]
Code Padding Using Neutral Code Fillers
41
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
NOP3_ECX
NOP3_EDX
NOP3_ESI
NOP3_EDI
NOP3_ESP
NOP3_EBP
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
<DB
<DB
<DB
<DB
<DB
<DB
08Dh,00Ch,021h>
08Dh,014h,022h>
08Dh,024h,024h>
08Dh,034h,026h>
08Dh,03Ch,027h>
08Dh,06Dh,000h>
;lea
;lea
;lea
;lea
;lea
;lea
ecx,
edx,
esi,
edi,
esp,
ebp,
[ecx]
[edx]
[esi]
[edi]
[esp]
[ebp]
NOP4_EAX
NOP4_EBX
NOP4_ECX
NOP4_EDX
NOP4_ESI
NOP4_EDI
NOP4_ESP
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
TEXTEQU
<DB
<DB
<DB
<DB
<DB
<DB
<DB
08Dh,044h,020h,000h>
08Dh,05Ch,023h,000h>
08Dh,04Ch,021h,000h>
08Dh,054h,022h,000h>
08Dh,064h,024h,000h>
08Dh,074h,026h,000h>
08Dh,07Ch,027h,000h>
;lea
;lea
;lea
;lea
;lea
;lea
;lea
eax,
ebx,
ecx,
edx,
esi,
edi,
esp,
[eax+00]
[ebx+00]
[ecx+00]
[edx+00]
[esi+00]
[edi+00]
[esp+00]
;lea eax, [eax+00];nop
NOP5_EAX TEXTEQU <DB 08Dh,044h,020h,000h,090h>
;lea ebx, [ebx+00];nop
NOP5_EBX TEXTEQU <DB 08Dh,05Ch,023h,000h,090h>
;lea ecx, [ecx+00];nop
NOP5_ECX TEXTEQU <DB 08Dh,04Ch,021h,000h,090h>
;lea edx, [edx+00];nop
NOP5_EDX TEXTEQU <DB 08Dh,054h,022h,000h,090h>
;lea esi, [esi+00];nop
NOP5_ESI TEXTEQU <DB 08Dh,064h,024h,000h,090h>
;lea edi, [edi+00];nop
NOP5_EDI TEXTEQU <DB 08Dh,074h,026h,000h,090h>
;lea esp, [esp+00];nop
NOP5_ESP TEXTEQU <DB 08Dh,07Ch,027h,000h,090h>
;lea eax, [eax+00000000]
NOP6_EAX TEXTEQU <DB 08Dh,080h,0,0,0,0>
;lea ebx, [ebx+00000000]
NOP6_EBX TEXTEQU <DB 08Dh,09Bh,0,0,0,0>
;lea ecx, [ecx+00000000]
NOP6_ECX TEXTEQU <DB 08Dh,089h,0,0,0,0>
;lea edx, [edx+00000000]
NOP6_EDX TEXTEQU <DB 08Dh,092h,0,0,0,0>
;lea esi, [esi+00000000]
NOP6_ESI TEXTEQU <DB 08Dh,0B6h,0,0,0,0>
42
Code Padding Using Neutral Code Fillers
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
;lea edi ,[edi+00000000]
NOP6_EDI TEXTEQU <DB 08Dh,0BFh,0,0,0,0>
;lea ebp ,[ebp+00000000]
NOP6_EBP TEXTEQU <DB 08Dh,0ADh,0,0,0,0>
;lea eax,[eax*1+00000000]
NOP7_EAX TEXTEQU <DB 08Dh,004h,005h,0,0,0,0>
;lea ebx,[ebx*1+00000000]
NOP7_EBX TEXTEQU <DB 08Dh,01Ch,01Dh,0,0,0,0>
;lea ecx,[ecx*1+00000000]
NOP7_ECX TEXTEQU <DB 08Dh,00Ch,00Dh,0,0,0,0>
;lea edx,[edx*1+00000000]
NOP7_EDX TEXTEQU <DB 08Dh,014h,015h,0,0,0,0>
;lea esi,[esi*1+00000000]
NOP7_ESI TEXTEQU <DB 08Dh,034h,035h,0,0,0,0>
;lea edi,[edi*1+00000000]
NOP7_EDI TEXTEQU <DB 08Dh,03Ch,03Dh,0,0,0,0>
;lea ebp,[ebp*1+00000000]
NOP7_EBP TEXTEQU <DB 08Dh,02Ch,02Dh,0,0,0,0>
;lea eax,[eax*1+00000000] ;nop
NOP8_EAX TEXTEQU <DB 08Dh,004h,005h,0,0,0,0,90h>
;lea ebx,[ebx*1+00000000] ;nop
NOP8_EBX TEXTEQU <DB 08Dh,01Ch,01Dh,0,0,0,0,90h>
;lea ecx,[ecx*1+00000000] ;nop
NOP8_ECX TEXTEQU <DB 08Dh,00Ch,00Dh,0,0,0,0,90h>
;lea edx,[edx*1+00000000] ;nop
NOP8_EDX TEXTEQU <DB 08Dh,014h,015h,0,0,0,0,90h>
;lea esi,[esi*1+00000000] ;nop
NOP8_ESI TEXTEQU <DB 08Dh,034h,035h,0,0,0,0,90h>
;lea edi,[edi*1+00000000] ;nop
NOP8_EDI TEXTEQU <DB 08Dh,03Ch,03Dh,0,0,0,0,90h>
;lea ebp,[ebp*1+00000000] ;nop
NOP8_EBP TEXTEQU <DB 08Dh,02Ch,02Dh,0,0,0,0,90h>
;JMP
NOP9 TEXTEQU <DB 0EBh,007h,90h,90h,90h,90h,90h,90h,90h>
Code Padding Using Neutral Code Fillers
43
AMD Athlon™ Processor x86 Code Optimization
44
22007E/0—November 1999
Code Padding Using Neutral Code Fillers
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
5
Cache and Memory
Optimizations
This chapter describes code optimization techniques that take
advantage of the large L1 caches and high-bandwidth buses of
the AMD Athlon™ processor. Guidelines are listed in order of
importance.
Memory Size and Alignment Issues
Avoid Memory Size Mismatches
✩
TOP
Avoid memory size mismatches when instructions operate on
the same data. For instructions that store and reload the same
data, keep operands aligned and keep the loads/stores of each
operand the same size. The following code examples result in a
store-to-load-forwarding (STLF) stall:
Example 1 (Avoid):
MOV
MOV
FLD
DWORD PTR [FOO], EAX
DWORD PTR [FOO+4], EDX
QWORD PTR [FOO]
Avoid large-to-small mismatches, as shown in the following
code:
Example 2 (Avoid):
FST
MOV
MOV
Memory Size and Alignment Issues
QWORD PTR [FOO]
EAX, DWORD PTR [FOO]
EDX, DWORD PTR [FOO+4]
45
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Align Data Where Possible
✩
TOP
In general, avoid misaligned data references. All data whose
size is a power of 2 is considered aligned if it is naturally
aligned. For example:
■
QWORD accesses are aligned if they access an address
divisible by 8.
■
DWORD accesses are aligned if they access an address
divisible by 4.
WORD accesses are aligned if they access an address
divisible by 2.
TBYTE accesses are aligned if they access an address
divisible by 8.
■
■
A misaligned store or load operation suffers a minimum
one-cycle penalty in the AMD Athlon processor load/store
pipeline. In addition, using misaligned loads and stores
increases the likelihood of encountering a store-to-load
forwarding pitfall. For a more detailed discussion of store-toload forwarding issues, see “Store-to-Load Forwarding
Restrictions” on page 51.
Use the 3DNow!™ PREFETCH and PREFETCHW Instructions
✩
TOP
For code that can take advantage of prefetching, use the
3DNow! PREFETCH and PREFETCHW instructions to
increase the effective bandwidth to the AMD Athlon processor.
Th e P R E F E T C H a n d P R E F E T C H W i n s t r u c t i o n s t a ke
advantage of the AMD Athlon processor’s high bus bandwidth
to hide long latencies when fetching data from system memory.
The prefetch instructions are essentially integer instructions
and can be used anywhere, in any type of code (integer, x87,
3DNow!, MMX, etc.).
Large data sets typically require unit-stride access to ensure
that all data pulled in by PREFETCH or PREFETCHW is
actually used. If necessary, algorithms or data structures should
be reorganized to allow unit-stride access.
46
Use the 3DNow!™ PREFETCH and PREFETCHW
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
PREFETCH/W versus
PREFETCHNTA/T0/T1
/T2
The PREFETCHNTA/T0/T1/T2 instructions in the MMX
extensions are processor implementation dependent. To
maintain compatibility with the 25 million AMD-K6 ® -2 and
A M D -K 6 -I I I p ro c e s s o rs a lre a dy s o l d , u se t h e 3 D N ow !
PREFETCH/W instructions instead of the various prefetch
flavors in the new MMX extensions.
PREFETCHW Usage
Code that intends to modify the cache line brought in through
prefetching should use the PREFETCHW instruction. While
PREFETCHW works the same as a PREFETCH on the
AMD-K6-2 and AMD-K6-III processors, PREFETCHW gives a
hint to the AMD Athlon processor of an intent to modify the
cache line. The AMD Athlon processor will mark the cache line
being brought in by PREFET CHW as Modified. Using
PREFETCHW can save an additional 15-25 cycles compared to
a PREFETCH and the subsequent cache state change caused by
a write to the prefetched cache line.
Multiple Prefetches
Programmers can initiate multiple outstanding prefetches on
t h e A M D A t hl o n p ro c e s s o r. Wh il e t h e A M D -K 6 -2 a n d
AMD-K6-III processors can have only one outstanding prefetch,
the AMD Athlon processor can have up to six outstanding
prefetches. When all six buffers are filled by various memory
read requests, the processor will simply ignore any new
prefetch requests until a buffer frees up. Multiple prefetch
requests are essentially handled in-order. If data is needed first,
then that data should be prefetched first.
The example below shows how to initiate multiple prefetches
when traversing more than one array.
Example (Multiple Prefetches):
.CODE
.K3D
;
;
;
;
;
;
;
;
;
;
;
;
original C code
#define LARGE_NUM 65536
double array_a[LARGE_NUM];
double array b[LARGE_NUM];
double array c[LARGE_NUM];
int i;
for (i = 0; i < LARGE_NUM; i++) {
a[i] = b[i] * c[i]
}
Use the 3DNow!™ PREFETCH and PREFETCHW Instructions
47
AMD Athlon™ Processor x86 Code Optimization
MOV
MOV
MOV
MOV
ECX,
EAX,
EDX,
ECX,
22007E/0—November 1999
(-LARGE_NUM)
OFFSET array_a
OFFSET array_b
OFFSET array_c
;used biased
;get address
;get address
;get address
index
of array_a
of array_b
of array_c
$loop:
PREFETCHW
PREFETCH
PREFETCH
FLD
QWORD
FMUL QWORD
FSTP QWORD
FLD
QWORD
FMUL QWORD
FSTP QWORD
[EAX+196]
;two
[EDX+196]
;two
[ECX+196]
;two
PTR [EDX+ECX*8+ARR_SIZE]
PTR [ECX+ECX*8+ARR_SIZE]
PTR [EAX+ECX*8+ARR_SIZE]
PTR [EDX+ECX*8+ARR_SIZE+8]
PTR [ECX+ECX*8+ARR_SIZE+8]
PTR [EAX+ECX*8+ARR_SIZE+8]
FLD
FMUL
FSTP
QWORD PTR
QWORD PTR
QWORD PTR
FLD
FMUL
FSTP
QWORD PTR
QWORD PTR
QWORD PTR
FLD
FMUL
FSTP
QWORD PTR
QWORD PTR
QWORD PTR
FLD
FMUL
FSTP
QWORD PTR
QWORD PTR
QWORD PTR
FLD
FMUL
FSTP
QWORD PTR
QWORD PTR
QWORD PTR
FLD
FMUL
FSTP
QWORD PTR
QWORD PTR
QWORD PTR
ADD
JNZ
ECX, 8
$loop
cachelines ahead
cachelines ahead
cachelines ahead
;b[i]
;b[i]*c[i]
;a[i] = b[i]*c[i]
;b[i+1]
;b[i+1]*c[i+1]
;a[i+1] =
; b[i+1]*c[i+1]
[EDX+ECX*8+ARR_SIZE+16];b[i+2]
[ECX+ECX*8+ARR_SIZE+16];b[i+2]*c[i+2]
[EAX+ECX*8+ARR_SIZE+16];a[i+2] =
; [i+2]*c[i+2]
[EDX+ECX*8+ARR_SIZE+24];b[i+3]
[ECX+ECX*8+ARR_SIZE+24];b[i+3]*c[i+3]
[EAX+ECX*8+ARR_SIZE+24];a[i+3] =
; b[i+3]*c[i+3]
[EDX+ECX*8+ARR_SIZE+32];b[i+4]
[ECX+ECX*8+ARR_SIZE+32];b[i+4]*c[i+4]
[EAX+ECX*8+ARR_SIZE+32];a[i+4] =
; b[i+4]*c[i+4]
[EDX+ECX*8+ARR_SIZE+40];b[i+5]
[ECX+ECX*8+ARR_SIZE+40];b[i+5]*c[i+5]
[EAX+ECX*8+ARR_SIZE+40];a[i+5] =
; b[i+5]*c[i+5]
[EDX+ECX*8+ARR_SIZE+48];b[i+6]
[ECX+ECX*8+ARR_SIZE+48];b[i+6]*c[i+6]
[EAX+ECX*8+ARR_SIZE+48];a[i+6] =
; b[i+6]*c[i+6]
[EDX+ECX*8+ARR_SIZE+56];b[i+7]
[ECX+ECX*8+ARR_SIZE+56];b[i+7]*c[i+7]
[EAX+ECX*8+ARR_SIZE+56];a[i+7] =
; b[i+7]*c[i+7]
;next 8 products
;until none left
END
48
Use the 3DNow!™ PREFETCH and PREFETCHW
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
The following optimization rules were applied to this example.
■
■
■
Determining Prefetch
Distance
Loops should be unrolled to make sure that the data stride
per loop iteration is equal to the length of a cache line. This
avoids overlapping PREFETCH instructions and thus
optimal use of the available number of outstanding
PREFETCHes.
Since the array "array_a" is written rather than read,
PREFETCHW is used instead of PREFETCH to avoid
overhead for switching cache lines to the correct MESI
state. The PREFETCH lookahead has been optimized such
that each loop iteration is working on three cache lines
while six active PREFETCHes bring in the next six cache
lines.
Index arithmetic has been reduced to a minimum by use of
complex addressing modes and biasing of the array base
addresses in order to cut down on loop overhead.
Given the latency of a typical AMD Athlon processor system
and expected processor speeds, the following formula should be
used to determine the prefetch distance in bytes for a single
array:
Prefetch Distance = 200 (DS/C) bytes
■
■
■
■
Round up to the nearest 64-byte cache line.
The number 200 is a constant based upon expected
AMD Athlon processor clock frequencies and typical system
memory latencies.
DS is the data stride in bytes per loop iteration.
C is the number of cycles for one loop to execute entirely
from the L1 cache.
The prefetch distance for multiple arrays are typically even
longer.
Prefetch at Least 64
Bytes Away from
Surrounding Stores
The PREFETCH and PREFETCHW instructions can be
affected by false dependencies on stores. If there is a store to an
address that matches a request, that request (the PREFETCH
or PREFETCHW instruction) may be blocked until the store is
written to the cache. Therefore, code should prefetch data that
is located at least 64 bytes away from any surrounding store’s
data address.
Use the 3DNow!™ PREFETCH and PREFETCHW Instructions
49
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Take Advantage of Write Combining
✩
TOP
Operating system and device driver programmers should take
a dva n t a g e o f t h e w ri t e -c o m b i n i n g c a p ab il it ie s o f t h e
AMD Athlon processor. The AMD Athlon processor has a very
aggressive write-combining algorithm, which improves
performance significantly.
See Appendix C, “Implementation of Write Combining” on
page 155 for more details.
Avoid Placing Code and Data in the Same 64-Byte Cache
Line
✩
TOP
Sharing code and data in the same 64-byte cache line may cause
the L1 caches to thrash (unnecessary castout of code/data) in
order to maintain coherency between the separate instruction
and data caches. The AMD Athlon processor has a cache-line
size of 64-bytes, which is twice the size of previous processors.
Programmers must be aware that code and data should not be
shared within this larger cache line, especially if the data
becomes modified.
For example, programmers should consider that a memory
indirect JMP instruction may have the data for the jump table
residing in the same 64-byte cache line as the JMP instruction,
which would result in lower performance.
Although rare, do not place critical code at the border between
32-byte aligned code segments and a data segments. The code
at the start or end of your data segment should be as rarely
executed as possible or simply padded with garbage.
In general, the following should be avoided:
■
■
50
self-modifying code
storing data in code segments
Take Advantage of Write Combining
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Store-to-Load Forwarding Restrictions
Store-to-load forwarding refers to the process of a load reading
(forwarding) data from the store buffer (LS2). There are
instances in the AMD Athlon processor load/store architecture
when either a load operation is not allowed to read needed data
from a store in the store buffer, or a load OP detects a false data
dependency on a store in the store buffer.
In either case, the load cannot complete (load the needed data
into a register) until the store has retired out of the store buffer
and written to the data cache. A store-buffer entry cannot retire
and write to the data cache until every instruction before the
store has completed and retired from the reorder buffer.
The implication of this restriction is that all instructions in the
reorder buffer, up to and including the store, must complete
and retire out of the reorder buffer before the load can
complete. Effectively, the load has a false dependency on every
instruction up to the store.
The following sections describe store-to-load forwarding
examples that are acceptable and those that should be avoided.
Store-to-Load Forwarding Pitfalls—True Dependencies
A load is allowed to read data from the store-buffer entry only if
all of the following conditions are satisfied:
■
■
■
■
The start address of the load matches the start address of
the store.
The load operand size is equal to or smaller than the store
operand size.
Neither the load or store is misaligned.
The store data is not from a high-byte register (AH, BH, CH,
or DH).
The following sections describe common-case scenarios to avoid
whereby a load has a true dependency on a LS2-buffered store
but cannot read (forward) data from a store-buffer entry.
Store-to-Load Forwarding Restrictions
51
AMD Athlon™ Processor x86 Code Optimization
Narrow-to-Wide
Store-Buffer Data
Forwarding
Restriction
22007E/0—November 1999
I f t h e f o l l o w i n g c o n d i t i o n s a re p re s e n t , t h e re i s a
narrow-to-wide store-buffer data forwarding restriction:
■
The operand size of the store data is smaller than the
operand size of the load data.
■
The range of addresses spanned by the store data covers
some sub-region of range of addresses spanned by the load
data.
Avoid the type of code shown in the following two examples.
Example 1 (Avoid):
MOV EAX, 10h
MOV WORD PTR [EAX], BX
...
MOV ECX, DWORD PTR [EAX]
;word store
;doubleword load
;cannot forward upper
; byte from store buffer
Example 2 (Avoid):
MOV EAX, 10h
MOV BYTE PTR [EAX + 3], BL ;byte store
...
MOV ECX, DWORD PTR [EAX]
;doubleword load
;cannot forward upper byte
; from store buffer
Wide-to-Narrow
Store-Buffer Data
Forwarding
Restriction
I f t h e f o l l o w i n g c o n d i t i o n s a re p re s e n t , t h e re i s a
wide-to-narrow store-buffer data forwarding restriction:
■
■
The operand size of the store data is greater than the
operand size of the load data.
The start address of the store data does not match the start
address of the load.
Example 3 (Avoid):
MOV EAX, 10h
ADD DWORD PTR [EAX], EBX
;doubleword store
MOV CX, WORD PTR [EAX + 2] ;word load-cannot forward high
; word from store buffer
Use example 5 instead of example 4.
Example 4 (Avoid):
MOVQ
...
ADD
ADD
52
[foo], MM1
;store upper and lower half
EAX, [foo]
EDX, [foo+4]
;fine
;uh-oh!
Store-to-Load Forwarding Restrictions
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 5 (Preferred):
MOVD
PUNPCKHDQ
MOVD
...
ADD
ADD
Misaligned
Store-Buffer Data
Forwarding
Restriction
[foo], MM1
MM1, MM1
[foo+4], MM1
;store lower half
;get upper half into lower half
;store lower half
EAX, [foo]
EDX, [foo+4]
;fine
;fine
If the following condition is present, there is a misaligned
store-buffer data forwarding restriction:
■
The store or load address is misaligned. For example, a
quadword store is not aligned to a quadword boundary, a
doubleword store is not aligned to doubleword boundary,
etc.
A common case of misaligned store-data forwarding involves
the passing of misaligned quadword floating-point data on the
doubleword-aligned integer stack. Avoid the type of code shown
in the following example.
Example 6 (Avoid):
MOV
FSTP
.
.
.
FLD
High-Byte
Store-Buffer Data
Forwarding
Restriction
ESP, 24h
QWORD PTR [ESP] ;esp=24
;store occurs to quadword
; misaligned address
QWORD PTR[ESP]
;quadword load cannot forward
; from quadword misaligned
; ‘fstp[esp]’ store OP
If the following condition is present, there is a high-byte
store-data buffer forwarding restriction:
■
The store data is from a high-byte register (AH, BH, CH,
DH).
Avoid the type of code shown in the following example.
Example 7 (Avoid):
MOV EAX, 10h
MOV [EAX], BH
.
MOV DL, [EAX]
Store-to-Load Forwarding Restrictions
;high-byte store
;load cannot forward from
; high-byte store
53
AMD Athlon™ Processor x86 Code Optimization
One Supported Storeto-Load Forwarding
Case
22007E/0—November 1999
There is one case of a mismatched store-to-load forwarding that
is supported by the by AMD Athlon processor. The lower 32 bits
from an aligned QWORD write feeding into a DWORD read is
allowed.
Example 8 (Allowed):
MOVQ
...
MOV
[AlignedQword], mm0
EAX, [AlignedQword]
Summary of Store-to-Load Forwarding Pitfalls to Avoid
To avoid store-to-load forwarding pitfalls, code should conform
to the following guidelines:
■
■
■
■
Maintain consistent use of operand size across all loads and
stores. Preferably, use doubleword or quadword operand
sizes.
Avoid misaligned data references.
Avoid narrow-to-wide and wide-to-narrow forwarding cases.
When using word or byte stores, avoid loading data from
anywhere in the same doubleword of memory other than the
identical start addresses of the stores.
Stack Alignment Considerations
Make sure the stack is suitably aligned for the local variable
with the largest base type. Then, using the technique described
in “C Language Structure Component Considerations” on page
55, all variables can be properly aligned with no padding.
Extend to 32 Bits
Before Pushing onto
Stack
Function arguments smaller than 32 bits should be extended to
32 bits before being pushed onto the stack, which ensures that
the stack is always doubleword aligned on entry to a function.
If a function has no local variables with a base type larger than
doubleword, no further work is necessary. If the function does
have lo ca l variables whos e ba se type is la rger than a
doubleword, additional code should be inserted to ensure
proper alignment of the stack. For example, the following code
achieves quadword alignment:
54
Stack Alignment Considerations
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example (Preferred):
Prolog:
PUSH
MOV
SUB
AND
EBP
EBP, ESP
ESP, SIZE_OF_LOCALS
;size of local variables
ESP, –8
;push registers that need to be preserved
Epilog:
;pop register that needed to be preserved
MOV
ESP, EBP
POP
EBP
RET
With this technique, function arguments can be accessed via
EBP, and local variables can be accessed via ESP. In order to
free EBP for general use, it needs to be saved and restored
between the prolog and the epilog.
Align TBYTE Variables on Quadword Aligned Addresses
Align variables of type TBYTE on quadword aligned addresses.
In order to make an array of TBYTE variables that are aligned,
array elements are 16-bytes apart. In general, TBYTE variables
should be avoided. Use double-precision variables instead.
C Language Structure Component Considerations
Structures (‘struct’ in C language) should be made the size of a
multiple of the largest base type of any of their components. To
meet this requirement, padding should be used where
necessary.
Language definitions permitting, to minimize padding,
structure components should be sorted and allocated such that
the components with a larger base type are allocated ahead of
those with a smaller base type. For example, consider the
following code:
Align TBYTE Variables on Quadword Aligned Addresses
55
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example:
struct
{
char a[5];
long k;
doublex;
} baz;
The structure components should be allocated (lowest to
highest address) as follows:
x, k, a[4], a[3], a[2], a[1], a[0], padbyte6, ..., padbyte0
See “C Language Structure Component Considerations” on
page 27 for more information from a C source code perspective.
Sort Variables According to Base Type Size
Sort local variables according to their base type size and
allocate variables with larger base type size ahead of those with
smaller base type size. Assuming the first variable allocated is
naturally aligned, all other variables are naturally aligned
without any padding. The following example is a declaration of
local variables in a C function:
Example:
short
long
double
char
float
ga, gu, gi;
foo, bar;
x, y, z[3];
a, b;
baz;
Allocate in the following order from left to right (from higher to
lower addresses):
x, y, z[2], z[1], z[0], foo, bar, baz, ga, gu, gi, a, b;
See “Sort Local Variables According to Base Type Size” on page
28 for more information from a C source code perspective.
56
Sort Variables According to Base Type Size
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
6
Branch Optimizations
Wh i l e t h e A M D A t h l o n ™ p ro c e s s o r c o n t a i n s a ve ry
sophisticated branch unit, certain optimizations increase the
effectiveness of the branch prediction unit. This chapter
discusses rules that improve branch prediction and minimize
branch penalties. Guidelines are listed in order of importance.
Avoid Branches Dependent on Random Data
✩
TOP
Avoid conditional branches depending on random data, as these
are difficult to predict. For example, a piece of code receives a
random stream of characters “A” through “Z” and branches if
the character is before “M” in the collating sequence.
Data-dependent branches acting upon basically random data
causes the branch prediction logic to mispredict the branch
about 50% of the time.
If possible, design branch-free alternative code sequences,
which results in shorter average execution time. This technique
is especially important if the branch body is small. Examples 1
and 2 illustrate this concept using the CMOV instruction. Note
that the AMD-K6 ® processor does not support the CMOV
instruction. Therefore, blended AMD-K6 and AMD Athlon
processor code should use examples 3 and 4.
Avoid Branches Dependent on Random Data
57
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
AMD Athlon™ Processor Specific Code
Example 1 — Signed integer ABS function (X = labs(X)):
MOV
MOV
NEG
CMOVS
MOV
ECX,
EBX,
ECX
ECX,
[X],
[X]
ECX
EBX
ECX
;load value
;save value
;–value
;if –value is negative, select value
;save labs result
Example 2 — Unsigned integer min function (z = x < y ? x : y):
MOV
MOV
CMP
CMOVNC
MOV
EAX,
EBX,
EAX,
EAX,
[Z],
[X]
[Y]
EBX
EBX
EAX
;load X value
;load Y value
;EBX<=EAX ? CF=0 : CF=1
;EAX=(EBX<=EAX) ? EBX:EAX
;save min (X,Y)
Blended AMD-K6® and AMD Athlon™ Processor Code
Example 3 — Signed integer ABS function (X = labs(X)):
MOV
MOV
SAR
XOR
SUB
MOV
ECX,
EBX,
ECX,
EBX,
EBX,
[X],
[X]
ECX
31
ECX
ECX
EBX
;load value
;save value
;x < 0 ? 0xffffffff : 0
;x < 0 ? ~x : x
;x < 0 ? (~x)+1 : x
;x < 0 ? -x : x
Example 4 — Unsigned integer min function (z = x < y ? x : y):
MOV
MOV
SUB
SBB
AND
ADD
MOV
EAX,
EBX,
EAX,
ECX,
ECX,
ECX,
[z],
[x]
[y]
EBX
ECX
EAX
EBX
ECX
;load x
;load y
;x < y ?
;x < y ?
;x < y ?
;x < y ?
;x < y ?
CF : NC ; x - y
0xffffffff : 0
x - y : 0
x - y + y : y
x : y
Example 5 — Hexadecimal to ASCII conversion
(y=x < 10 ? x + 0x30: x + 0x41):
MOV
CMP
SBB
DAS
MOV
58
AL, [X]
AL, 10
AL, 69h
[Y],AL
;load X value
;if x is less than 10, set carry flag
;0..9 –> 96h, Ah..Fh –> A1h...A6h
;0..9: subtract 66h, Ah..Fh: Sub. 60h
;save conversion in y
Avoid Branches Dependent on Random Data
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 6 — Increment Ring Buffer Offset:
//C Code
char buf[BUFSIZE];
int a;
if (a < (BUFSIZE-1)) {
a++;
} else {
a = 0;
}
;------------;Assembly Code
MOV
EAX, [a]
CMP
EAX, (BUFSIZE-1)
INC
EAX
SBB
EDX, EDX
AND
EAX, EDX
MOV
[a], EAX
;
;
;
;
;
;
old offset
a < (BUFSIZE-1) ? CF : NC
a++
a < (BUFSIZE-1) ? 0xffffffff :0
a < (BUFSIZE-1) ? a++ : 0
store new offset
Example 7 — Integer Signum Function:
//C Code
int a, s;
if (!a) {
s =
} else if
s =
} else {
s =
}
0;
(a < 0) {
-1;
1;
;------------;Assembly Code
MOV
EAX, [a]
CDQ
CMP
EDX, EAX
ADC
EDX, 0
MOV
[s], EDX
;load a
;t = a < 0 ? 0xffffffff : 0
;a > 0 ? CF : NC
;a > 0 ? t+1 : t
;signum(x)
Always Pair CALL and RETURN
Wh e n t h e 1 2 e n t ry re t u r n a d d re s s s t a ck g e t s o u t o f
synchronization, the latency of returns increase. The return
address stack becomes out of sync when:
■
■
Always Pair CALL and RETURN
calls and returns do not match
the depth of the return stack is exceeded because of too
many levels of nested functions calls
59
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Replace Branches with Computation in 3DNow!™ Code
Branches negatively impact the performance of 3DNow! code.
Branches can operate only on one data item at a time, i.e., they
are inherently scalar and inhibit the SIMD processing that
makes 3DNow! code superior. Also, branches based on 3DNow!
comparisons require data to be passed to the integer units,
which requires either transport through memory, or the use of
“MOVD reg, MMreg” instructions. If the body of the branch is
small, one can achieve higher performance by replacing the
branch with com putation. The com putation simulat es
predicated execution or conditional moves. The principal tools
for this are the following instructions: PCMPGT, PFCMPGT,
PFCMPGE, PFMIN, PFMAX, PAND, PANDN, POR, PXOR.
Muxing Constructs
The most important construct to avoiding branches in
3DNow!™ and MMX™ code is a 2-way muxing construct that is
equivalent to the ternary operator “?:” in C and C++. It is
implemented using the PCMP/PFCMP, PAND, PANDN, and
POR instructions. To maximize performance, it is important to
apply the PAND and PANDN instructions in the proper order.
Example 1 (Avoid):
; r = (x < y) ? a : b
;
; in: mm0 a
;
mm1 b
;
mm2 x
;
mm3 y
; out: mm1 r
PCMPGTD
MOVQ
PANDN
PAND
POR
MM3,
MM4,
MM3,
MM1,
MM1,
MM2
MM3
MM0
MM4
MM3
;
;
;
;
;
y > x ? 0xffffffff : 0
duplicate mask
y > x ? 0 : a
y > x ? b : 0
r = y > x ? b : a
Because the use of PANDN destroys the mask created by PCMP,
the mask needs to be saved, which requires an additional
register. This adds an instruction, lengthens the dependency
chain, and increases register pressure. Therefore 2-way muxing
constructs should be written as follows.
60
Replace Branches with Computation in 3DNow!™ Code
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 2 (Preferred):
; r = (x < y) ? a : b
;
; in: mm0 a
;
mm1 b
;
mm2 x
;
mm3 y
; out: mm1 r
PCMPGTD
PAND
PANDN
POR
MM3,
MM1,
MM3,
MM1,
MM2
MM3
MM0
MM3
;
;
;
;
y
y
y
r
>
>
>
=
x
x
x
y
?
?
>
>
0xffffffff : 0
b : 0
0 : a
x ? b : a
"
Sample Code Translated into 3DNow!™ Code
The following examples use scalar code translated into 3DNow!
code. Note that it is not recommended to use 3DNow! SIMD
instructions for scalar code, because the advantage of 3DNow!
instructions lies in their “SIMDness”. These examples are
meant to demonstrate general techniques for translating source
code with branches into branchless 3DNow! code. Scalar source
code was chosen to keep the examples simple. These techniques
work in an identical fashion for vector code.
Each example shows the C code and the resulting 3DNow! code.
Example 1:
C code:
float x,y,z;
if (x < y) {
z += 1.0;
}
else {
z -= 1.0;
}
3DNow! code:
;in: MM0 = x
;
MM1 = y
;
MM2 = z
;out: MM0 = z
MOVQ
MM3,
MOVQ
MM4,
PFCMPGE MM0,
PSLLD
MM0,
PXOR
MM0,
PFADD
MM0,
MM0
one
MM1
31
MM4
MM2
Replace Branches with Computation in 3DNow!™ Code
;save x
;1.0
;x < y ?
;x < y ?
;x < y ?
;x < y ?
0 : 0xffffffff
0 : 0x80000000
1.0 : -1.0
z+1.0 : z-1.0
61
AMD Athlon™ Processor x86 Code Optimization
Example 2:
22007E/0—November 1999
C code:
float x,z;
z = abs(x);
if (z >= 1) {
z = 1/z;
}
3DNow! code:
;in: MM0 = x
;out: MM0 = z
MOVQ
MM5,
PAND
MM0,
PFRCP
MM2,
MOVQ
MM1,
PFRCPIT1 MM0,
PFRCPIT2 MM0,
PFMIN
MM0,
Example 3:
mabs
MM5
MM0
MM0
MM2
MM2
MM1
;0x7fffffff
;z=abs(x)
;1/z approx
;save z
;1/z step
;1/z final
;z = z < 1 ? z : 1/z
C code:
float x,z,r,res;
z = fabs(x)
if (z < 0.575) {
res = r;
}
else {
res = PI/2 - 2*r;
}
3DNow! code:
;in: MM0 = x
;
MM1 = r
;out: MM0 = res
MOVQ
MM7, mabs
PAND
MM0, MM7
MOVQ
MM2, bnd
PCMPGTD MM2, MM0
MOVQ
MM3, pio2
MOVQ
MM0, MM1
PFADD
MM1, MM1
PFSUBR MM1, MM3
PAND
MM0, MM2
PANDN
MM2, MM1
POR
MM0, MM2
62
;mask for absolute value
;z = abs(x)
;0.575
;z < 0.575 ? 0xffffffff : 0
;pi/2
;save r
;2*r
;pi/2 - 2*r
;z < 0.575 ? r : 0
;z < 0.575 ? 0 : pi/2 - 2*r
;z < 0.575 ? r : pi/2 - 2 * r
Replace Branches with Computation in 3DNow!™ Code
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 4:
C code:
#define PI 3.14159265358979323
float x,z,r,res;
/* 0 <= r <= PI/4 */
z = abs(x)
if (z < 1) {
res = r;
}
else {
res = PI/2-r;
}
3DNow! code:
;in: MM0 = x
;
MM1 = r
;out: MM1 = res
MOVQ
MM5, mabs
MOVQ
MM6, one
PAND
MM0, MM5
PCMPGTD MM6, MM0
MOVQ
MM4, pio2
PFSUB
MM4, MM1
PANDN
MM6, MM4
PFMAX
MM1, MM6
Replace Branches with Computation in 3DNow!™ Code
;
;
;
;
;
;
;
;
mask to clear sign bit
1.0
z=abs(x)
z < 1 ? 0xffffffff : 0
pi/2
pi/2-r
z < 1 ? 0 : pi/2-r
res = z < 1 ? r : pi/2-r
63
AMD Athlon™ Processor x86 Code Optimization
Example 5:
22007E/0—November 1999
C code:
#define PI 3.14159265358979323
float x,y,xa,ya,r,res;
int
xs,df;
xs = x < 0 ? 1 : 0;
xa = fabs(x);
ya = fabs(y);
df = (xa < ya);
if (xs && df) {
res = PI/2 + r;
}
else if (xs) {
res = PI - r;
}
else if (df) {
res = PI/2 - r;
}
else {
res = r;
}
3DNow! code:
;in: MM0 = r
;
MM1 = y
;
MM2 = x
;out: MM0 = res
MOVQ
MM7, sgn
MOVQ
MM6, sgn
MOVQ
MM5, mabs
PAND
MM7, MM2
PAND
MM1, MM5
PAND
MM2, MM5
MOVQ
MM6, MM1
PCMPGTD MM6, MM2
PSLLD
MM6, 31
MOVQ
MM5, MM7
PXOR
MM7, MM6
MOVQ
MM3, npio2
PXOR
MM5, MM3
PSRAD
MM6, 31
PANDN
MM6, MM5
PFSUB
MM6, MM3
POR
PFADD
64
MM0, MM7
MM0, MM6
;mask to extract sign bit
;mask to extract sign bit
;mask to clear sign bit
;xs = sign(x)
;ya = abs(y)
;xa = abs(x)
;y
;df = (xa < ya) ? 0xffffffff : 0
;df = bit<31>
;xs
;xs^df ? 0x80000000 : 0
;-pi/2
;xs ? pi/2 : -pi/2
;df ? 0xffffffff : 0
;xs ? (df ? 0 : pi/2) : (df ? 0 : -pi/2)
;pr = pi/2 + (xs ? (df ? 0 : pi/2) :
; (df ? 0 : -pi/2))
;ar = xs^df ? -r : r
;res = ar + pr
Replace Branches with Computation in 3DNow!™ Code
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Avoid the Loop Instruction
The LOOP instruction in the AMD Athlon processor requires
eight cycles to execute. Use the preferred code shown below:
Example 1 (Avoid):
LOOP
LABEL
Example 2 (Preferred):
DEC
JNZ
ECX
LABEL
Avoid Far Control Transfer Instructions
Avoid using far control transfer instructions. Far control
transfer branches can not be predicted by the branch target
buffer (BTB).
Avoid the Loop Instruction
65
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Avoid Recursive Functions
Avoid recursive functions due to the danger of overflowing the
return address stack. Convert end-recursive functions to
iterative code. An end-recursive function is when the function
call to itself is at the end of the code.
Example 1 (Avoid):
long fac(long a)
{
if (a==0) {
return (1);
} else {
return (a*fac(a–1));
}
return (t);
}
Example 2 (Preferred):
long fac(long a)
{
long t=1;
while (a > 0) {
t *= a;
a--;
}
return (t);
}
66
Avoid Recursive Functions
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
7
Scheduling Optimizations
This chapter describes how to code instructions for efficient
scheduling. Guidelines are listed in order of importance.
Schedule Instructions According to their Latency
The AMD Athlon™ processor can execute up to three x86
instructions per cycle, with each x86 instruction possibly having
a different latency. The AMD Athlon processor has flexible
scheduling, but for absolute maximum performance, schedule
instructions, especially FPU and 3DNow!™ instructions,
according to their latency. Dependent instructions will then not
have to wait on instructions with longer latencies.
See Appendix F, “Instruction Dispatch and Execution
Resources” on page 187 for a list of latency numbers.
Unrolling Loops
Complete Loop Unrolling
Make use of the large AMD Athlon processor 64-Kbyte
instruction cache and unroll loops to get more parallelism and
reduce loop overhead, even with branch prediction. Complete
Schedule Instructions According to their Latency
67
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
unrolling reduces register pressure by removing the loop
counter. To completely unroll a loop, remove the loop control
and replicate the loop body N times. In addition, completely
unrolling a loop increases scheduling opportunities.
Only unrolling very large code loops can result in the inefficient
use of the L1 instruction cache. Loops can be unrolled
completely, if all of the following conditions are true:
■
■
■
The loop is in a frequently executed piece of code.
The loop count is known at compile time.
The loop body, once unrolled, is less than 100 instructions,
which is approximately 400 bytes of code.
Partial Loop Unrolling
Partial loop unrolling can increase register pressure, which can
make it inefficient due to the small number of registers in the
x86 architecture. However, in certain situations, partial
unrolling can be efficient due to the performance gains
possible. Partial loop unrolling should be considered if the
following conditions are met:
■
■
■
Spare registers are available
Loop body is small, so that loop overhead is significant
Number of loop iterations is likely > 10
Consider the following piece of C code:
double a[MAX_LENGTH], b[MAX_LENGTH];
for (i=0; i< MAX_LENGTH; i++) {
a[i] = a[i] + b[i];
}
Without loop unrolling, the code looks like the following:
68
Unrolling Loops
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Without Loop Unrolling:
MOV ECX, MAX_LENGTH
MOV EAX, OFFSET A
MOV EBX, OFFSET B
$add_loop:
FLD
QWORD PTR [EAX]
FADD
QWORD PTR [EBX]
FSTP
QWORD PTR [EAX]
ADD
EAX, 8
ADD
EBX, 8
DEC
ECX
JNZ
$add_loop
The loop consists of seven instructions. The AMD Athlon
processor can decode/retire three instructions per cycle, so it
cannot execute faster than three iterations in seven cycles, or
3/7 floating-point adds per cycle. However, the pipelined
floating-point adder allows one add every cycle. In the following
code, the loop is partially unrolled by a factor of two, which
creates potential endcases that must be handled outside the
loop:
With Partial Loop Unrolling:
MOV
MOV
MOV
SHR
JNC
FLD
FADD
FSTP
ADD
ADD
ECX, MAX_LENGTH
EAX, offset A
EBX, offset B
ECX, 1
$add_loop
QWORD PTR [EAX]
QWORD PTR [EBX]
QWORD PTR [EAX]
EAX, 8
EBX, 8
$add_loop:
FLD
QWORD PTR[EAX]
FADD
QWORD PTR[EBX]
FSTP
QWORD PTR[EAX]
FLD
QWORD PTR[EAX+8]
FADD
QWORD PTR[EBX+8]
FSTP
QWORD PTR[EAX+8]
ADD
EAX, 16
ADD
EBX, 16
DEC
ECX
JNZ
$add_loop
Now the loop consists of 10 instructions. Based on the
decode/retire bandwidth of three OPs per cycle, this loop goes
Unrolling Loops
69
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
n o f a s t e r t h a n t h re e i t e ra t i o n s i n 1 0 cy c l e s , o r 6 / 1 0
floating-point adds per cycle, or 1.4 times as fast as the original
loop.
Deriving Loop
Control For Partially
Unrolled Loops
A frequently used loop construct is a counting loop. In a typical
case, the loop count starts at some lower bound lo, increases by
some fixed, positive increment inc for each iteration of the
loop, and may not exceed some upper bound hi. The following
example shows how to partially unroll such a loop by an
unrolling factor of fac, and how to derive the loop control for
the partially unrolled version of the loop.
Example 1 (rolled loop):
for (k = lo; k <= hi; k += inc) {
x[k] =
...
}
Example 2 (partially unrolled loop):
for (k = lo; k <= (hi - (fac-1)*inc); k += fac*inc) {
x[k] =
...
x[k+inc] =
...
...
x[k+(fac-1)*inc] =
...
}
/* handle end cases */
for (k = k; k <= hi; k += inc) {
x[k] =
...
}
70
Unrolling Loops
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Use Function Inlining
Overview
Make use of the AMD Athlon processor’s large 64-Kbyte
in str uct io n ca che by inl in ing s m a ll rou t in es to avoi d
procedure-call overhead. Consider the cost of possible
increased register usage, which can increase load/store
instructions for register spilling.
Function inlining has the advantage of eliminating function call
overhe ad and a llowi ng be t te r reg iste r allo ca t ion and
instruction scheduling at the site of the function call. The
disadvantage is decreasing code locality, which can increase
execution time due to instruction cache misses. Therefore,
function inlining is an optimization that has to be used
judiciously.
In general, due to its very large instruction cache, the
AMD Athlon processor is less susceptible than other processors
to the negative side effect of function inlining. Function call
overhead on the AMD Athlon processor can be low because
calls and returns are executed at high speed due to the use of
prediction mechanisms. However, there is still overhead due to
passing function arguments through memory, which creates
STLF (store-to-load-forwarding) dependencies. Some compilers
allow for a reduction of this overhead by allowing arguments to
be passed in registers in one of their calling conventions, which
has the drawback of constraining register allocation in the
function and at the site of the function call.
In general, function inlining works best if the compiler can
utilize feedback from a profiler to identify the function call
sites most frequently executed. If such data is not available, a
reasonable heuristic is to concentrate on function calls inside
loops. Functions that are directly recursive should not be
considered candidates for inlining. However, if they are
end-recursive, the compiler should convert them to an iterative
equivalent to avoid potential overflow of the AMD Athlon
processor return prediction mechanism (return stack) during
deep recursion. For best results, a compiler should support
function inlining across multiple source files. In addition, a
compiler should provide inline templates for commonly used
library functions, such as sin(), strcmp(), or memcpy().
Use Function Inlining
71
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Always Inline Functions if Called from One Site
A function should always be inlined if it can be established that
it is called from just one site in the code. For the C language,
determination of this characteristic is made easier if functions
are explicitly declared static unless they require external
linkage. This case occurs quite frequently, as functionality that
could be concentrated in a single large function is split across
multiple small functions for improved maintainability and
readability.
Always Inline Functions with Fewer than 25 Machine Instructions
In addition, functions that create fewer than 25 machine
instructions once inlined should always be inlined because it is
likely that the function call overhead is close to or more than
the time spent executing the function body. For large functions,
t h e b e n e f i t s o f re d u c e d f u n c t i o n c a ll ove r h e a d g ive s
diminishing returns. Therefore, a function that results in the
insertion of more than 500 machine instructions at the call site
should probably not be inlined. Some larger functions might
consist of multiple, relatively short paths that are negatively
affected by function overhead. In such a case, it can be
advantageous to inline larger functions. Profiling information is
the best guide in determining whether to inline such large
functions.
Avoid Address Generation Interlocks
Loads and stores are scheduled by the AMD Athlon processor to
access the data cache in program order. Newer loads and stores
with their addresses calculated can be blocked by older loads
and stores whose addresses are not yet calculated – this is
known as an address generation interlock. Therefore, it is
advantageous to schedule loads and stores that can calculate
their addresses quickly, ahead of loads and stores that require
the resolution of a long dependency chain in order to generate
their addresses. Consider the following code examples.
72
Avoid Address Generation Interlocks
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 1 (Avoid):
ADD
MOV
MOV
MOV
EBX,
EAX,
ECX,
EDX,
ECX
DWORD PTR [10h]
DWORD PTR [EAX+EBX]
DWORD PTR [24h]
;inst 1
;inst 2 (fast address calc.)
;inst 3 (slow address calc.)
;this load is stalled from
; accessing data cache due
; to long latency for
; generating address for
; inst 3
Example 2 (Preferred):
ADD EBX, ECX
MOV EAX, DWORD PTR [10h]
MOV EDX, DWORD PTR [24h]
MOV ECX, DWORD PTR [EAX+EBX]
;inst 1
;inst 2
;place load above inst 3
; to avoid address
; generation interlock stall
;inst 3
Use MOVZX and MOVSX
Use the MOVZX and MOVSX instructions to zero-extend and
sign-extend byte-size and word-size operands to doubleword
length. For example, typical code for zero extension creates a
superset dependency when the zero-extended value is used, as
in the following code:
Example 1 (Avoid):
XOR
MOV
EAX, EAX
AL, [MEM]
Example 2 (Preferred):
MOVZX
EAX, BYTE PTR [MEM]
Minimize Pointer Arithmetic in Loops
Minimize pointer arithmetic in loops, especially if the loop
body is small. In this case, the pointer arithmetic would cause
significant overhead. Instead, take advantage of the complex
addressing modes to utilize the loop counter to index into
memory arrays. Using complex addressing modes does not have
any negative impact on execution speed, but the reduced
number of instructions preserves decode bandwidth.
Use MOVZX and MOVSX
73
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 1 (Avoid):
int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;
for (i=0; i < MAXSIZE; i++) {
c [i] = a[i] + b[i];
}
MOV
XOR
XOR
XOR
ECX,
ESI,
EDI,
EBX,
MAXSIZE
ESI
EDI
EBX
$add_loop:
MOV
EAX, [ESI + a]
MOV
EDX, [EDI + b]
ADD
EAX, EDX
MOV
[EBX + c], EAX
ADD
ESI, 4
ADD
EDI, 4
ADD
EBX, 4
DEC
ECX
JNZ
$add_loop
;initialize
;initialize
;initialize
;initialize
loop counter
offset into array a
offset into array b
offset into array c
;get element a
;get element b
;a[i] + b[i]
;write result to c
;increment offset into a
;increment offset into b
;increment offset into c
;decrement loop count
;until loop count 0
Example 2 (Preferred):
int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;
for (i=0; i < MAXSIZE; i++) {
c [i] = a[i] + b[i];
}
MOV ECX, MAXSIZE-1
;initialize loop counter
$add_loop:
MOV EAX, [ECX*4 + a]
MOV EDX, [ECX*4 + b]
ADD EAX, EDX
MOV [ECX*4 + c], EAX
DEC ECX
JNS $add_loop
;get element a
;get element b
;a[i] + b[i]
;write result to c
;decrement index
;until index negative
Note that the code in example 2 traverses the arrays in a
downward direction (i.e., from higher addresses to lower
addresses), whereas the original code in example 1 traverses
the arrays in an upward direction. Such a change in the
direction of the traversal is possible if each loop iteration is
completely independent of all other loop iterations, as is the
case here.
In code where the direction of the array traversal can’t be
switched, it is still possible to minimize pointer arithmetic by
appropriately biasing base addresses and using an index
74
Minimize Pointer Arithmetic in Loops
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
variable that starts with a negative value and reaches zero when
the loop expires. Note that if the base addresses are held in
registers (e.g., when the base addresses are passed as
arguments of a function) biasing the base addresses requires
additional instructions to perform the biasing at run time and a
small amount of additional overhead is incurred. In the
examples shown here the base addresses are used in the
d i s p l a c e m e n t p o r t i o n o f t h e a d d re s s a n d b i a s i n g i s
accomplished at compile time by simply modifying the
displacement.
Example 3 (Preferred):
int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i;
for (i=0; i < MAXSIZE; i++) {
c [i] = a[i] + b[i];
}
MOV
ECX, (-MAXSIZE)
$add_loop:
MOV
EAX, [ECX*4 + a + MAXSIZE*4]
MOV
EDX, [ECX*4 + b + MAXSIZE*4]
ADD
EAX, EDX
MOV
[ECX*4 + c + MAXSIZE*4], EAX
INC
ECX
JNZ
$add_loop
;initialize index
;get a element
;get b element
;a[i] + b[i]
;write result to c
;increment index
;until index==0
Push Memory Data Carefully
Carefully choose the best method for pushing memory data. To
reduce register pressure and code dependencies, follow
example 2 below.
Example 1 (Avoid):
MOV
PUSH
EAX, [MEM]
EAX
Example 2 (Preferred):
PUSH
Push Memory Data Carefully
[MEM]
75
AMD Athlon™ Processor x86 Code Optimization
76
22007E/0—November 1999
Push Memory Data Carefully
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
8
Integer Optimizations
This chapter describes ways to improve integer performance
through optimized programming techniques. The guidelines are
listed in order of importance.
Replace Divides with Multiplies
Replace integer division by constants with multiplication by
the reciprocal. Because the AMD Athlon™ processor has a very
fast integer multiply (5–9 cycles signed, 4–8 cycles unsigned)
and the integer division delivers only one bit of quotient per
cycle (22–47 cycles signed, 17–41 cycles unsigned), the
equivalent code is much faster. The user can follow the
examples in this chapter that illustrate the use of integer
division by constants, or access the executables in the
opt_utilities directory in the AMD documentation CD-ROM
(order# 21860) to find alternative code for dividing by a
constant.
Multiplication by Reciprocal (Division) Utility
The code for the utilities can be found at “Derivation of
Multiplier Used for Integer Division by Constants” on page 93.
All utilities were compiled for the Microsoft Windows ® 95,
Windows 98, and Windows NT® environments. All utilities are
provided ‘as is’ and are not supported by AMD.
Replace Divides with Multiplies
77
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Signed Division
Utility
In the opt_utilities directory of the AMD documentation
CDROM, run sdiv.exe in a DOS window to find the fastest code
for signed division by a constant. The utility displays the code
after the user enters a signed constant divisor. Type “sdiv >
example.out” to output the code to a file.
Unsigned Division
Utility
In the opt_utilities directory of the AMD documentation
CDROM, run udiv.exe in a DOS window to find the fastest code
for unsigned division by a constant. The utility displays the code
after the user enters an unsigned constant divisor. Type “udiv >
example.out” to output the code to a file.
Unsigned Division by Multiplication of Constant
Algorithm: Divisors
1 <= d < 231, Odd d
The following code shows an unsigned division using a constant
value multiplier.
;In:
;Out:
;
;
d
a
m
s
=
=
=
=
divisor, 1 <= d < 2^31, odd d
algorithm
multiplier
shift factor
;algorithm
MOV
EDX,
MOV
EAX,
MUL
EDX
SHR
EDX,
0
dividend
m
;algorithm
MOV
EDX,
MOV
EAX,
MUL
EDX
ADD
EAX,
ADC
EDX,
SHR
EDX,
1
dividend
m
s
m
0
s
;EDX=quotient
;EDX=quotient
Derivation of a, m, s
The derivation for the algorithm (a), multiplier (m), and shift
count (s), is found in the section “Unsigned Derivation for
Algorithm, Multiplier, and Shift Factor” on page 93.
Algorithm: Divisors
231 <= d < 232
For divisors 2 31 <= d < 2 32 , the possible quotient values are
either 0 or 1. This makes it easy to establish the quotient by
simple comparison of the dividend and divisor. In cases where
the dividend needs to be preserved, example 1 below is
recommended.
78
Replace Divides with Multiplies
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 1:
;In:
;Out:
XOR EDX,
CMP EAX,
SBB EDX,
EDX = dividend
EDX = quotient
EDX;0
d ;CF = (dividend < divisor) ? 1 : 0
-1 ;quotient = 0+1-CF = (dividend < divisor) ? 0 : 1
In cases where the dividend does not need to be preserved, the
division can be accomplished without the use of an additional
register, thus reducing register pressure. This is shown in
example 2 below:
Example 2:
;In: EDX =
;Out: EAX =
CMP EDX, d
MOV EAX, 0
SBB EAX, -1
Simpler Code for
Restricted Dividend
dividend
quotient
;CF = (dividend < divisor) ? 1 : 0
;0
;quotient = 0+1-CF = (dividend < divisor) ? 0 : 1
Integer division by a constant can be made faster if the range of
the dividend is limited, which removes a shift associated with
most divisors. For example, for a divide by 10 operation, use the
following code if the dividend is less than 40000005h:
MOV
MOV
MUL
MOV
EAX, dividend
EDX, 01999999Ah
EDX
quotient, EDX
Signed Division by Multiplication of Constant
Algorithm: Divisors
2 <= d < 231
These algorithms work if the divisor is positive. If the divisor is
negative, use abs(d) instead of d, and append a ‘NEG EDX’ to
the code. The code makes use of the fact that n/–d = –(n/d).
;IN:
;OUT:
;
;
d
a
m
s
=
=
=
=
;algorithm
MOV
EAX,
MOV
EDX,
MOV
ECX,
IMUL EDX
SHR
ECX,
SAR
EDX,
ADD
EDX,
Replace Divides with Multiplies
divisor, 2 <= d < 2^31
algorithm
multiplier
shift count
0
m
dividend
EDX
31
s
ECX
;quotient in EDX
79
AMD Athlon™ Processor x86 Code Optimization
;algorithm
MOV
EAX,
MOV
EDX,
MOV
ECX,
IMUL EDX
ADD
EDX,
SHR
ECX,
SAR
EDX,
ADD
EDX,
22007E/0—November 1999
1
m
dividend
EDX
ECX
31
s
ECX
;quotient in EDX
Derivation for a, m, s
The derivation for the algorithm (a), multiplier (m), and shift
count (s), is found in the section “Signed Derivation for
Algorithm, Multiplier, and Shift Factor” on page 95.
Signed Division By 2
;IN: EAX =
;OUT:EAX =
CMP
EAX,
SBB
EAX,
SAR
EAX,
Signed Division By 2n
Signed Division By –2
dividend
quotient
800000000h
–1
1
;IN:EAX = dividend
;OUT:EAX = quotient
CDQ
AND
EDX, (2^n–1)
ADD
EAX, EDX
SAR
EAX, (n)
;CY = 1, if dividend >=0
;Increment dividend if it is < 0
;Perform a right shift
;Sign extend into EDX
;Mask correction (use divisor –1)
;Apply correction if necessary
;Perform right shift by
; log2 (divisor)
;IN:EAX = dividend
;OUT:EAX = quotient
CMP
EAX, 800000000h
SBB
EAX, –1
SAR
EAX, 1
NEG
EAX
;CY = 1, if dividend >= 0
;Increment dividend if it is < 0
;Perform right shift
;Use (x/–2) == –(x/2)
Signed Division By
–(2n)
;IN:EAX = dividend
;OUT:EAX = quotient
CDQ
AND
EDX, (2^n–1)
ADD
EAX, EDX
SAR
EAX, (n)
NEG
EAX
;Sign extend into EDX
;Mask correction (–divisor –1)
;Apply correction if necessary
;Right shift by log2(–divisor)
;Use (x/–(2^n)) == (–(x/2^n))
Remainder of Signed
Integer 2 or –2
;IN:EAX = dividend
;OUT:EAX = remainder
CDQ
AND
EDX, 1
XOR
EAX, EDX
SUB
EAX, EDX
MOV
[remainder], EAX
80
;Sign extend into EDX
;Compute remainder
;Negate remainder if
;Dividend was < 0
Replace Divides with Multiplies
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Remainder of Signed
Integer 2n or –(2n)
;IN:EAX = dividend
;OUT:EAX = remainder
CDQ
AND
EDX, (2^n–1)
ADD
EAX, EDX
AND
EAX, (2^n–1)
SUB
EAX, EDX
MOV
[remainder], EAX
;Sign extend into EDX
;Mask correction (abs(divison)–1)
;Apply pre-correction
;Mask out remainder (abs(divison)–1)
;Apply pre-correction, if necessary
Use Alternative Code When Multiplying by a Constant
A 32-bit integer multiply by a constant has a latency of five
cycles. Therefore, use alternative code when multiplying by
certain constants. In addition, because there is just one
multiply unit, the replacement code may provide better
throughput.
The following code samples are designed such that the original
source also receives the final result. Other sequences are
possible if the result is in a different register. Adds have been
favored over shifts to keep code size small. Generally, there is a
fast replacement if the constant has very few 1 bits in binary.
More constants are found in the file multiply_by_constants.txt
located in the same directory where this document is located in
the SDK.
by 2:
ADD
REG1, REG1
;1 cycle
by 3:
LEA
REG1, [REG1*2+REG1]
;2 cycles
by 4:
SHL
REG1, 2
;1 cycle
by 5:
LEA
REG1, [REG1*4+REG1]
;2 cycles
by 6:
LEA
ADD
REG2, [REG1*4+REG1]
REG1, REG2
;3 cycles
by 7:
MOV
SHL
SUB
REG2, REG1
REG1, 3
REG1, REG2
;2 cycles
by 8:
SHL
REG1, 3
;1 cycle
by 9:
LEA
REG1, [REG1*8+REG1]
;2 cycles
by 10:
LEA
ADD
REG2, [REG1*8+REG1]
REG1, REG2
;3 cycles
Use Alternative Code When Multiplying by a Constant
81
AMD Athlon™ Processor x86 Code Optimization
by 11:
LEA
ADD
ADD
REG2, [REG1*8+REG1]
REG1, REG1
REG1, REG2
by 12:
SHL
LEA
REG1, 2
REG1, [REG1*2+REG1]
LEA
SHL
SUB
LEA
LEA
ADD
REG2,
REG1,
REG1,
REG2,
REG1,
REG1,
by 15:
MOV
SHL
SUB
REG2, REG1
REG1, 4
REG1, REG2
;2 cycles
by 16:
SHL
REG1, 4
;1 cycle
by 17:
MOV
SHL
ADD
REG2, REG1
REG1, 4
REG1, REG2
;2 cycles
by 18:
ADD
LEA
REG1, REG1
REG1, [REG1*8+REG1]
;3 cycles
by 19:
LEA
SHL
ADD
REG2, [REG1*2+REG1]
REG1, 4
REG1, REG2
;3 cycles
by 20:
SHL
LEA
REG1, 2
REG1, [REG1*4+REG1]
;3 cycles
by 21:
LEA
SHL
ADD
REG2, [REG1*4+REG1]
REG1, 4
REG1, REG2
;3 cycles
by 22:
use IMUL
by 23:
LEA
SHL
SUB
REG2, [REG1*8+REG1]
REG1, 5
REG1, REG2
;3 cycles
by 24:
SHL
LEA
REG1, 3
REG1, [REG1*2+REG1]
;3 cycles
by 25:
LEA
SHL
ADD
REG2, [REG1*8+REG1]
REG1, 4
REG1, REG2
;3 cycles
by 13:
by 14:
82
22007E/0—November 1999
[REG1*2+REG1]
4
REG2
[REG1*4+REG1]
[REG1*8+REG1]
REG2
;3 cycles
;3 cycles
;3 cycles
;3 cycles
Use Alternative Code When Multiplying by a Constant
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
by 26:
use IMUL
by 27:
LEA
SHL
SUB
REG2, [REG1*4+REG1]
REG1, 5
REG1, REG2
;3 cycles
by 28:
MOV
SHL
SUB
SHL
REG2,
REG1,
REG1,
REG1,
;3 cycles
by 29:
LEA
SHL
SUB
REG2, [REG1*2+REG1]
REG1, 5
REG1, REG2
;3 cycles
by 30:
MOV
SHL
SUB
ADD
REG2,
REG1,
REG1,
REG1,
REG1
4
REG2
REG1
;3 cycles
by 31:
MOV
SHL
SUB
REG2, REG1
REG1, 5
REG1, REG2
;2 cycles
by 32:
SHL
REG1, 5
;1 cycle
REG1
3
REG2
2
Use MMX™ Instructions for Integer-Only Work
In many programs it can be advantageous to use MMX
instructions to do integer-only work, especially if the function
already uses 3DNow!™ or MMX code. Using MMX instructions
relieves register pressure on the integer registers. As long as
data is simply loaded/stored, added, shifted, etc., MMX
instructions are good substitutes for integer instructions.
Integer registers are freed up with the following results:
■
■
May be able to reduce the number of integer registers to
saved/restored on function entry/edit.
Free up integer registers for pointers, loop counters, etc., so
that they do not have to be spilled to memory, which
reduces memory traffic and latency in dependency chains.
Be careful with regards to passing data between MMX and
integer registers and of creating mismatched store-to-load
forwarding cases. See “Unrolling Loops” on page 67.
Use MMX™ Instructions for Integer-Only Work
83
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
In addition, using MMX instructions increases the available
parallelism. The AMD Athlon processor can issue three integer
OPs and two MMX OPs per cycle.
Repeated String Instruction Usage
Latency of Repeated String Instructions
Table 1 shows the latency for repeated string instructions on the
AMD Athlon processor.
Table 1.
Latency of Repeated String Instructions
Instruction
ECX=0 (cycles)
DF = 0 (cycles)
DF = 1 (cycles)
REP MOVS
11
15 + (4/3*c)
25 + (4/3*c)
REP STOS
11
14 + (1*c)
24 + (1*c)
REP LODS
11
15 + (2*c)
15 + (2*c)
REP SCAS
11
15 + (5/2*c)
15 + (5/2*c)
REP CMPS
11
16 + (10/3*c)
16 + (10/3*c)
Note:
c = value of ECX, (ECX > 0)
Table 1 lists the latencies with the direction flag (DF) = 0
(increment) and DF = 1. In addition, these latencies are
a s su m ed fo r al i g n e d m em o ry o p e ra n d s . N o t e t h at for
MOVS/STOS, when DF = 1 (DOWN), the overhead portion of the
latency increases significantly. However, these types are less
commonly found. The user should use the formula and round up
to the nearest integer value to determine the latency.
Guidelines for Repeated String Instructions
To help achieve good performance, this section contains
guidelines for the careful scheduling of VectorPath repeated
string instructions.
Use the Largest
Possible Operand
Size
84
Always move data using the largest operand size possible. For
example, use REP MOVSD rather than REP MOVSW and REP
MOVSW rather than REP MOVSB. Use REP STOSD rather than
REP STOSW and REP STOSW rather than REP MOVSB.
Repeated String Instruction Usage
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Ensure DF=0 (UP)
Always make sure that DF = 0 (UP) (after execution of CLD) for
REP MOVS and REP STOS. DF = 1 (DOWN) is only needed for
certain cases of overlapping REP MOVS (for example, source
and destination overlap).
While string instructions with DF = 1 (DOWN) are slower, only
the overhead part of the cycle equation is larger and not the
throughput part. See Table 1, “Latency of Repeated String
Instructions,” on page 84 for additional latency numbers.
Align Source and
Destination with
Operand Size
For REP MOVS, make sure that both source and destination are
aligned with regard to the operand size. Handle the end case
separately, if necessary. If either source or destination cannot
be aligned, make the destination aligned and the source
misaligned. For REP STOS, make the destination aligned.
Inline REP String
with Low Counts
Expand REP string instructions into equivalent sequences of
simple x86 instructions, if the repeat count is constant and less
than eight. Use an inline sequence of loads and stores to
accomplish the move. Use a sequence of stores to emulate REP
STOS. This technique eliminates the setup overhead of REP
instructions and increases instruction throughput.
Use Loop for REP
String with Low
Variable Counts
If the repeated count is variable, but is likely less than eight,
use a simple loop to move/store the data. This technique avoids
the overhead of REP MOVS and REP STOS.
Using MOVQ and
MOVNTQ for Block
Copy/Fill
To fill or copy blocks of data that are larger than 512 bytes, or
where the destination is in uncacheable memory, it is
recommended to use the MMX instructions MOVQ/MOVNTQ
instead of REP STOS and REP MOVS in order to achieve
maximum performance. (See the guideline, “Use MMX™
Instructions for Block Copies and Block Fills” on page 115.)
Repeated String Instruction Usage
85
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Use XOR Instruction to Clear Integer Registers
To clear an integer register to all 0s, use “XOR reg, reg”. The
AMD Athlon processo r is able to avoid the false rea d
dependency on the XOR instruction.
Example 1 (Acceptable):
MOV
REG, 0
Example 2 (Preferred):
XOR
REG, REG
Efficient 64-Bit Integer Arithmetic
This section contains a collection of code snippets and
subroutines showing the efficient implementation of 64-bit
arithmetic. Addition, subtraction, negation, and shifts are best
handled by inline code. Multiplies, divides, and remainders are
less common operations and should usually be implemented as
subroutines. If these s ubroutines are used often, the
programmer should consider inlining them. Except for division
and remainder, the code presented works for both signed and
unsigned integers. The division and remainder code shown
works for unsigned integers, but can easily be extended to
handle signed integers.
Example 1 (Addition):
;add operand in ECX:EBX to operand EDX:EAX, result in
; EDX:EAX
ADD
EAX, EBX
ADC
EDX, ECX
Example 2 (Subtraction):
;subtract operand in ECX:EBX from operand EDX:EAX, result in
; EDX:EAX
SUB
EAX, EBX
SBB
EDX, ECX
Example 3 (Negation):
;negate
NOT
NEG
SBB
86
operand in EDX:EAX
EDX
EAX
EDX, –1 ;fixup: increment hi-word if low-word was 0
Use XOR Instruction to Clear Integer Registers
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 4 (Left shift):
;shift operand in EDX:EAX
; applied modulo 64)
SHLD
EDX, EAX, CL
SHL
EAX, CL
TEST
ECX, 32
JZ
$lshift_done
MOV
EDX, EAX
XOR
EAX, EAX
left, shift count in ECX (count
;first apply shift count
; mod 32 to EDX:EAX
;need to shift by another 32?
;no, done
;left shift EDX:EAX
; by 32 bits
$lshift_done:
Example 5 (Right shift):
SHRD
SHR
TEST
JZ
MOV
XOR
EAX, EDX, CL
EDX, CL
ECX, 32
$rshift_done
EAX, EDX
EDX, EDX
;first apply shift count
; mod 32 to EDX:EAX
;need to shift by another 32?
;no, done
;left shift EDX:EAX
; by 32 bits
$rshift_done:
Example 6 (Multiplication):
;_llmul computes the low-order half of the product of its
; arguments, two 64-bit integers
;
;INPUT: [ESP+8]:[ESP+4]
multiplicand
;
[ESP+16]:[ESP+12] multiplier
;
;OUTPUT: EDX:EAX
(multiplicand * multiplier) % 2^64
;
;DESTROYS: EAX,ECX,EDX,EFlags
_llmul PROC
MOV
EDX, [ESP+8]
MOV
ECX, [ESP+16]
OR
EDX, ECX
MOV
EDX, [ESP+12]
MOV
EAX, [ESP+4]
JNZ
$twomul
MUL
EDX
RET
;multiplicand_hi
;multiplier_hi
;one operand >= 2^32?
;multiplier_lo
;multiplicand_lo
;yes, need two multiplies
;multiplicand_lo * multiplier_lo
;done, return to caller
$twomul:
IMUL
EDX, [ESP+8] ;p3_lo = multiplicand_hi*multiplier_lo
IMUL
ECX, EAX
;p2_lo = multiplier_hi*multiplicand_lo
ADD
ECX, EDX
; p2_lo + p3_lo
MUL
DWORD PTR [ESP+12] ;p1=multiplicand_lo*multiplier_lo
ADD
EDX, ECX
;p1+p2lo+p3_lo = result in EDX:EAX
RET
;done, return to caller
_llmul ENDP
Efficient 64-Bit Integer Arithmetic
87
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 7 (Division):
;_ulldiv divides two unsigned 64-bit integers, and returns
; the quotient.
;
;INPUT:
[ESP+8]:[ESP+4]
dividend
;
[ESP+16]:[ESP+12] divisor
;
;OUTPUT:
EDX:EAX
quotient of division
;
;DESTROYS: EAX,ECX,EDX,EFlags
_ulldiv PROC
PUSH
EBX
;save EBX as per calling convention
MOV
ECX, [ESP+20]
;divisor_hi
MOV
EBX, [ESP+16]
;divisor_lo
MOV
EDX, [ESP+12]
;dividend_hi
MOV
EAX, [ESP+8]
;dividend_lo
TEST
ECX, ECX
;divisor > 2^32–1?
JNZ
$big_divisor
;yes, divisor > 32^32–1
CMP
EDX, EBX
;only one division needed? (ECX = 0)
JAE
$two_divs
;need two divisions
DIV
EBX
;EAX = quotient_lo
MOV
EDX, ECX
;EDX = quotient_hi = 0 (quotient in
; EDX:EAX)
POP
EBX
;restore EBX as per calling convention
RET
;done, return to caller
$two_divs:
MOV
ECX,
MOV
EAX,
XOR
EDX,
DIV
EBX
XCHG
EAX,
DIV
EBX
MOV
EDX,
POP
EBX
RET
EAX
EDX
EDX
ECX
ECX
;save dividend_lo in ECX
;get dividend_hi
;zero extend it into EDX:EAX
;quotient_hi in EAX
;ECX = quotient_hi, EAX = dividend_lo
;EAX = quotient_lo
;EDX = quotient_hi (quotient in EDX:EAX)
;restore EBX as per calling convention
;done, return to caller
$big_divisor:
PUSH
EDI
MOV
EDI, ECX
SHR
EDX, 1
RCR
EAX, 1
ROR
EDI, 1
RCR
EBX, 1
BSR
ECX, ECX
SHRD
EBX, EDI, CL
SHRD
EAX, EDX, CL
SHR
EDX, CL
ROL
EDI, 1
DIV
EBX
MOV
EBX, [ESP+12]
88
;save EDI as per calling convention
;save divisor_hi
;shift both divisor and dividend right
; by 1 bit
;ECX = number of remaining shifts
;scale down divisor and dividend
; such that divisor is
; less than 2^32 (i.e. fits in EBX)
;restore original divisor_hi
;compute quotient
;dividend_lo
Efficient 64-Bit Integer Arithmetic
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
MOV
IMUL
MUL
ADD
SUB
MOV
MOV
SBB
SBB
XOR
POP
POP
RET
ECX, EAX
EDI, EAX
;save quotient
;quotient * divisor hi-word
; (low only)
DWORD PTR [ESP+20];quotient * divisor lo-word
EDX, EDI
;EDX:EAX = quotient * divisor
EBX, EAX
;dividend_lo – (quot.*divisor)_lo
EAX, ECX
;get quotient
ECX, [ESP+16]
;dividend_hi
ECX, EDX
;subtract divisor * quot. from dividend
EAX, 0
;adjust quotient if remainder negative
EDX, EDX
;clear hi-word of quot(EAX<=FFFFFFFFh)
EDI
;restore EDI as per calling convention
EBX
;restore EBX as per calling convention
;done, return to caller
_ulldiv ENDP
Example 8 (Remainder):
;_ullrem divides two unsigned 64-bit integers, and returns
; the remainder.
;
;INPUT:
[ESP+8]:[ESP+4]
dividend
;
[ESP+16]:[ESP+12] divisor
;
;OUTPUT:
EDX:EAX
remainder of division
;
;DESTROYS: EAX,ECX,EDX,EFlags
_ullrem
PUSH
MOV
MOV
MOV
MOV
TEST
JNZ
CMP
JAE
DIV
MOV
MOV
POP
RET
Efficient 64-Bit Integer Arithmetic
PROC
EBX
ECX, [ESP+20]
EBX, [ESP+16]
EDX, [ESP+12]
EAX, [ESP+8]
ECX, ECX
$r_big_divisor
EDX, EBX
$r_two_divs
EBX
EAX, EDX
EDX, ECX
EBX
;save EBX as per calling convention
;divisor_hi
;divisor_lo
;dividend_hi
;dividend_lo
;divisor > 2^32–1?
;yes, divisor > 32^32–1
;only one division needed? (ECX = 0)
;need two divisions
;EAX = quotient_lo
;EAX = remainder_lo
;EDX = remainder_hi = 0
;restore EBX as per calling convention
;done, return to caller
89
AMD Athlon™ Processor x86 Code Optimization
$r_two_divs:
MOV
ECX, EAX
MOV
EAX, EDX
XOR
EDX, EDX
DIV
EBX
MOV
DIV
MOV
XOR
POP
RET
EAX, ECX
EBX
EAX, EDX
EDX, EDX
EBX
22007E/0—November 1999
;save dividend_lo in ECX
;get dividend_hi
;zero extend it into EDX:EAX
;EAX = quotient_hi, EDX = intermediate
; remainder
;EAX = dividend_lo
;EAX = quotient_lo
;EAX = remainder_lo
;EDX = remainder_hi = 0
;restore EBX as per calling convention
;done, return to caller
$r_big_divisor:
PUSH
EDI
;save EDI as per calling convention
MOV
EDI, ECX
;save divisor_hi
SHR
EDX, 1
;shift both divisor and dividend right
RCR
EAX, 1
; by 1 bit
ROR
EDI, 1
RCR
EBX, 1
BSR
ECX, ECX
;ECX = number of remaining shifts
SHRD
EBX, EDI, CL
;scale down divisor and dividend such
SHRD
EAX, EDX, CL
; that divisor is less than 2^32
SHR
EDX, CL
; (i.e. fits in EBX)
ROL
EDI, 1
;restore original divisor (EDI:ESI)
DIV
EBX
;compute quotient
MOV
EBX, [ESP+12] ;dividend lo-word
MOV
ECX, EAX
;save quotient
IMUL
EDI, EAX
;quotient * divisor hi-word (low only)
MUL
DWORD PTR [ESP+20] ;quotient * divisor lo-word
ADD
EDX, EDI
;EDX:EAX = quotient * divisor
SUB
EBX, EAX
;dividend_lo – (quot.*divisor)–lo
MOV
ECX, [ESP+16] ;dividend_hi
MOV
EAX, [ESP+20] ;divisor_lo
SBB
ECX, EDX
;subtract divisor * quot. from
; dividend
SBB
EDX, EDX
;(remainder < 0)? 0xFFFFFFFF : 0
AND
EAX, EDX
;(remainder < 0)? divisor_lo : 0
AND
EDX, [ESP+24] ;(remainder < 0)? divisor_hi : 0
ADD
EAX, EBX
;remainder += (remainder < 0)?
ADC
EDX, ECX
; divisor : 0
POP
EDI
;restore EDI as per calling convention
POP
EBX
;restore EBX as per calling convention
RET
;done, return to caller
_ullrem ENDP
90
Efficient 64-Bit Integer Arithmetic
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Efficient Implementation of Population Count Function
Population count is an operation that determines the number of
set bits in a bit string. For example, this can be used to
determine the cardinality of a set. The following example code
shows how to efficiently implement a population count
operation for 32-bit operands. The example is written for the
inline assembler of Microsoft Visual C.
Function popcount() implements a branchless computation of
the population count. It is based on a O(log(n)) algorithm that
successively groups the bits into groups of 2, 4, 8, 16, and 32,
while maintaining a count of the set bits in each group. The
algorithms consist of the following steps:
Step 1
Partition the integer into groups of two bits. Compute the
population count for each 2-bit group and store the result in the
2-bit group. This calls for the following transformation to be
performed for each 2-bit group:
00b
01b
10b
11b
->
->
->
->
00b
01b
01b
10b
If the original value of a 2-bit group is v, then the new value will
be v - (v >> 1). In order to handle all 2-bit groups simultaneously,
it is necessary to mask appropriately to prevent spilling from
one bit group to the next lower bit group. Thus:
w = v - ((v >> 1) & 0x55555555)
Step 2
Add the population count of adjacent 2-bit group and store the
sum to the 4-bit group resulting from merging these adjacent
2-bit groups. To do this simultaneously to all groups, mask out
the odd numbered groups, mask out the even numbered groups,
and then add the odd numbered groups to the even numbered
groups:
x = (w & 0x33333333) + ((w >> 2) & 0x33333333)
Each 4-bit field now has value 0000b, 0001b, 0010b, 0011b, or
0100b.
Efficient Implementation of Population Count Function
91
AMD Athlon™ Processor x86 Code Optimization
Step 3
22007E/0—November 1999
For the first time, the value in each k-bit field is small enough
that adding two k-bit fields results in a value that still fits in the
k-bit field. Thus the following computation is performed:
y = (x + (x >> 4)) & 0x0F0F0F0F
The result is four 8-bit fields whose lower half has the desired
sum and whose upper half contains "junk" that has to be
masked out. In a symbolic form:
x
= 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh
x >> 4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg
sum
= 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ
The W W W W, X X X X , Y Y Y Y, a n d Z Z Z Z va l u e s a re t he
interesting sums with each at most 1000b, or 8 decimal.
Step 4
The four 4-bit sums can now be rapidly accumulated by means
of a multiply with a "magic" multiplier. This can be derived
from looking at the following chart of partial products:
0p0q0r0s * 01010101 =
:0p0q0r0s
0p:0q0r0s
0p0q:0r0s
0p0q0r:0s
000pxxww:vvuutt0s
Here p, q, r, and s are the 4-bit sums from the previous step, and
vv is the final result in which we are interested. Thus, the final
result:
z = (y * 0x01010101) >> 24
Example:
unsigned int popcount(unsigned int v)
{
unsigned int retVal;
__asm {
MOV
EAX, [v]
MOV
EDX, EAX
SHR
EAX, 1
AND
EAX, 055555555h
SUB
EDX, EAX
MOV
EAX, EDX
SHR
EDX, 2
AND
EAX, 033333333h
AND
EDX, 033333333h
92
;v
;v
;v >> 1
;(v >> 1) & 0x55555555
;w = v - ((v >> 1) & 0x55555555)
;w
;w >> 2
;w & 0x33333333
;(w >> 2) & 0x33333333
Efficient Implementation of Population Count Function
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
ADD
EAX, EDX
;x = (w & 0x33333333) + ((w >> 2) &
; 0x33333333)
EDX, EDX
;x
EAX, 4
;x >> 4
EAX, EDX
;x + (x >> 4)
EAX, 00F0F0F0Fh ;y = (x + (x >> 4) & 0x0F0F0F0F)
EAX, 001010101h ;y * 0x01010101
EAX, 24
;population count = (y *
; 0x01010101) >> 24
retVal, EAX
;store result
MOV
SHR
ADD
AND
IMUL
SHR
MOV
}
return (retVal);
}
Derivation of Multiplier Used for Integer Division by
Constants
Unsigned Derivation for Algorithm, Multiplier, and Shift Factor
The utility udiv.exe was compiled using the code shown in this
section.
The following code derives the multiplier value used when
performing integer division by constants. The code works for
unsigned integer division and for odd divisors between 1 and
231–1, inclusive. For divisors of the form d = d’*2n, the multiplier
is the same as for d’ and the shift factor is s + n.
/* Code snippet to determine algorithm (a), multiplier (m),
and shift factor (s) to perform division on unsigned 32-bit
integers by constant divisor. Code is written for the
Microsoft Visual C compiler. */
/*
In:
Out:
d
a
m
s
=
=
=
=
;algorithm
MOV
EDX,
MOV
EAX,
MUL
EDX
SHR
EDX,
divisor, 1 <= d < 2^31, d odd
algorithm
multiplier
shift factor
0
dividend
m
s
;EDX=quotient
Derivation of Multiplier Used for Integer Division by Constants
93
AMD Athlon™ Processor x86 Code Optimization
;algorithm
MOV
EDX,
MOV
EAX,
MUL
EDX
ADD
EAX,
ADC
EDX,
SHR
EDX,
*/
22007E/0—November 1999
1
dividend
m
m
0
s
;EDX=quotient
typedef unsigned __int64
typedef unsigned long
U64;
U32;
U32 d, l, s, m, a, r;
U64 m_low, m_high, j, k;
U32 log2 (U32 i)
{
U32 t = 0;
i = i >> 1;
while (i) {
i = i >> 1;
t++;
}
return (t);
}
/* Generate m, s for algorithm 0. Based on: Granlund, T.;
Montgomery, P.L.:"Division by Invariant Integers using
Multiplication”. SIGPLAN Notices, Vol. 29, June 1994, page
61. */
l
= log2(d) + 1;
j
= (((U64)(0xffffffff)) % ((U64)(d)));
k
= (((U64)(1)) << (32+l)) / ((U64)(0xffffffff–j));
m_low
= (((U64)(1)) << (32+l)) / d;
m_high = ((((U64)(1)) << (32+l)) + k) / d;
while (((m_low >> 1) < (m_high >> 1)) && (l > 0)) {
m_low = m_low >> 1;
m_high = m_high >> 1;
l
= l – 1;
}
if ((m_high >> 32) == 0) {
m = ((U32)(m_high));
s = l;
a = 0;
}
94
Derivation of Multiplier Used for Integer Division by
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
/* Generate m, s for algorithm 1. Based on: Magenheimer,
D.J.; et al: “Integer Multiplication and Division on the HP
Precision Architecture”. IEEE Transactions on Computers, Vol
37, No. 8, August 1988, page 980. */
else {
s = log2(d);
m_low = (((U64)(1)) << (32+s)) / ((U64)(d));
r
= ((U32)((((U64)(1)) << (32+s)) % ((U64)(d))));
m = (r < ((d>>1)+1)) ? ((U32)(m_low)) : ((U32)(m_low))+1;
a = 1;
}
/* Reduce multiplier/shift factor for either algorithm to
smallest possible */
while (!(m&1)) {
m = m >> 1;
s––;
}
Signed Derivation for Algorithm, Multiplier, and Shift Factor
The utility sdiv.exe was compiled using the following code.
/* Code snippet to determine algorithm (a), multiplier (m),
and shift count (s) for 32-bit signed integer division,
given divisor d. Written for Microsoft Visual C compiler. */
/*
IN:
OUT:
d
a
m
s
=
=
=
=
;algorithm
MOV
EAX,
MOV
EDX,
MOV
ECX,
IMUL EDX
SHR
ECX,
SAR
EDX,
ADD
EDX,
divisor, 2 <= d < 2^31
algorithm
multiplier
shift count
0
m
dividend
EDX
31
s
ECX
Derivation of Multiplier Used for Integer Division by Constants
; quotient in EDX
95
AMD Athlon™ Processor x86 Code Optimization
;algorithm
MOV
EAX,
MOV
EDX,
MOV
ECX,
IMUL EDX
ADD
EDX,
SHR
ECX,
SAR
EDX,
ADD
EDX,
*/
22007E/0—November 1999
1
m
dividend
EDX
ECX
31
s
ECX
; quotient in EDX
typedef unsigned __int64
typedef unsigned long
U64;
U32;
U32 log2 (U32 i)
{
U32 t = 0;
i = i >> 1;
while (i) {
i = i >> 1;
t++;
}
return (t);
}
U32 d, l, s, m, a;
U64 m_low, m_high, j, k;
/* Determine algorithm (a), multiplier (m), and shift count
(s) for 32-bit signed integer division. Based on: Granlund,
T.; Montgomery, P.L.: “Division by Invariant Integers using
Multiplication”. SIGPLAN Notices, Vol. 29, June 1994, page
61. */
l
= log2(d);
j
= (((U64)(0x80000000)) % ((U64)(d)));
k
= (((U64)(1)) << (32+l)) / ((U64)(0x80000000–j));
m_low
= (((U64)(1)) << (32+l)) / d;
m_high = ((((U64)(1)) << (32+l)) + k) / d;
while (((m_low >> 1) < (m_high >> 1)) && (l > 0)) {
m_low = m_low >> 1;
m_high = m_high >> 1;
l
= l – 1;
}
m = ((U32)(m_high));
s = l;
a = (m_high >> 31) ? 1 : 0;
96
Derivation of Multiplier Used for Integer Division by
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
9
Floating-Point Optimizations
T h i s ch a p t e r d e t a i l s t h e m e t h o d s u s e d t o o p t i m i z e
floating-point code to the pipelined floating-point unit (FPU).
Guidelines are listed in order of importance.
Ensure All FPU Data is Aligned
As discussed in “Memory Size and Alignment Issues” on page
45, floating-point data should be naturally aligned. That is,
words should be aligned on word boundaries, doublewords on
doubleword boundaries, and quadwords on quadword
boundaries. Misaligned memory accesses reduce the available
memory bandwidth.
Use Multiplies Rather than Divides
If accuracy requirements allow, floating-point division by a
constant should be converted to a multiply by the reciprocal.
Divisors that are powers of two and their reciprocal are exactly
representable, except in the rare case that the reciprocal
overflows or underflows, and therefore does not cause an
accuracy issue. Unless such an overflow or underflow occurs, a
division by a power of two should always be converted to a
mu l t i p ly. A l t h o u g h t h e A M D A t h l o n ™ p ro c e s s o r h a s
high-performance division, multiplies are significantly faster
than divides.
Ensure All FPU Data is Aligned
97
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Use FFREEP Macro to Pop One Register from the FPU Stack
In FPU intensive code, frequently accessed data is often
pre-loaded at the bottom of the FPU stack before processing
floating-point data. After completion of processing, it is
desirable to remove the pre-loaded data from the FPU stack as
quickly as possible. The classical way to clean up the FPU stack
is to use either of the following instructions:
FSTP
ST(0)
FCOMPP
;removes one register from stack
;removes two registers from stack
On the AMD Athlon processor, a faster alternative is to use the
FFREEP instruction below. Note that the FFREEP instruction,
although insufficiently documented in the past, is supported by
all 32-bit x86 processors. The opcode bytes for FFREEP ST(i)
are listed in Table 22 on page 212.
FFREEP
ST(0)
;removes one register from stack
FFREEP ST (i) wo rks like FFREE ST(i) except tha t it
increments the FPU top-of-stack after doing the FFREE work.
In other words, FFREEP ST(i) marks ST(i) as empty, then
increments the x87 stack pointer. On the AMD Athlon
processor, the FFREEP instruction converts to an internal NOP,
which can go down any pipe with no dependencies.
Many assemblers do not support the FFREEP instruction. In
these cases, a simple text macro can be created to facilitate use
of the FFREEP ST(0).
FFREEP_ST0
TEXTEQU
<DB 0DFh, 0C0h>
Floating-Point Compare Instructions
For branches that are dependent on floating-point comparisons,
use the following instructions:
■
■
■
■
98
FCOMI
FCOMIP
FUCOMI
FUCOMIP
Use FFREEP Macro to Pop One Register from the FPU
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
These instructions are much faster than the classical approach
using FSTSW, because FSTSW is essentially a serializing
instruction on the AMD Athlon processor. When FSTSW cannot
be avoided (for example, backward compatibility of code with
older processors), no FPU instruction should occur between an
FCOM[P], FICOM[P], FUCOM[P], or FTST and a dependent
FSTSW. This optimization allows the use of a fast forwarding
mechanism for the FPU condition codes internal to the
AMD Athlon processor FPU and increases performance.
Use the FXCH Instruction Rather than FST/FLD Pairs
Increase parallelism by breaking up dependency chains or by
evaluating multiple dependency chains simultaneously by
explicitly switching execution between them. Although the
AMD Athlon processor FPU has a deep scheduler, which in
most cases can extract sufficient parallelism from existing code,
long dependency chains can stall the scheduler while issue slots
are still available. The maximum dependency chain length that
the scheduler can absorb is about six 4-cycle instructions.
To switch execution between dependency chains, use of the
FXCH instruction is recommended because it has an apparent
latency of zero cycles and generates only one OP. The
AMD Athlon processor FPU contains special hardware to
handle up to three FXCH instructions per cycle. Using FXCH is
preferred over the use of FST/FLD pairs, even if the FST/FLD
pair works on a register. An FST/FLD pair adds two cycles of
latency and consists of two OPs.
Avoid Using Extended-Precision Data
Store data as either single-precision or double-precision
quantities. Loading and storing extended-precision data is
comparatively slower.
Use the FXCH Instruction Rather than FST/FLD Pairs
99
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Minimize Floating-Point-to-Integer Conversions
C++, C, and Fortran define floating-point-to-integer conversions
as truncating. This creates a problem because the active
rounding mode in an application is typically round-to-nearesteven. The classical way to do a double-to-int conversion
therefore works as follows:
Example 1 (Fast):
SUB
FLD
FSTCW
MOVZX
OR
MOV
FLDCW
FISTP
FLDCW
[I], EDX
;trunc(X)=rndint(X)-correction
QWORD PTR [X]
;load double to be converted
[SAVE_CW]
;save current FPU control word
EAX, WORD PTR[SAVE_CW];retrieve control word
EAX, 0C00h
;rounding control field = truncate
WORD PTR [NEW_CW], AX ;new FPU control word
[NEW_CW]
;load new FPU control word
DWORD PTR [I]
;do double->int conversion
[SAVE_CW]
;restore original control word
The AMD Athlon processor contains special acceleration
hardware to execute such code as quickly as possible. In most
situations, the above code is therefore the fastest way to
perform floating-point-to-integer conversion and the conversion
is compliant both with programming language standards and
the IEEE-754 standard.
According to the recommendations for inlining (see “Always
Inline Functions with Fewer than 25 Machine Instructions” on
page 72), the above code should not be put into a separate
subroutine (e.g., ftol). It should rather be inlined into the main
code.
In some codes, floating-point numbers are converted to an
integer and the result is immediately converted back to
floating-point. In such cases, the FRNDINT instruction should
be used for maximum performance instead of FISTP in the code
above. FRNDINT delivers the integral result directly to an FPU
register in floating-point form, which is faster than first using
FISTP to store the integer result and then converting it back to
floating-point with FILD.
If there are multiple, consecutive floating-point-to-integer
conversions, the cost of FLDCW operat ions should be
minimized by saving the current FPU control word, forcing the
100
Minimize Floating-Point-to-Integer Conversions
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
FP U into tr uncating mo de, and perfor ming all of the
conversions before restoring the original control word.
The speed of the above code is somewhat dependent on the
nature of the code surrounding it. For applications in which the
speed of floating-point-to-integer conversions is extremely
critical for application performance, experiment with either of
the following substitutions, which may or may not be faster than
the code above.
The first substitution simulates a truncating floating-point to
integer conversion provided that there are no NaNs, infinities,
and overflows. This conversion is therefore not IEEE-754
compliant. This code works properly only if the current FPU
rounding mode is round-to-nearest-even, which is usually the
case.
Example 2 (Potentially faster).
FLD
FST
FIST
FISUB
FSTP
MOV
MOV
TEST
JZ
XOR
SAR
SAR
LEA
AND
SUB
$DONE:
QWORD PTR [X]
DWORD PTR [TX]
DWORD PTR [I]
DWORD PTR [I]
DWORD PTR [DIFF]
EAX, [TX]
EDX, [DIFF]
EDX, EDX
$DONE
EDX, EAX
; need
EAX, 31
EDX, 31
EAX, [EAX+EAX+1]
EDX, EAX
[I], EDX
;load double to be converted
;store X because sign(X) is needed
;store rndint(x) as default result
;compute DIFF = X - rndint(X)
;store DIFF as we need sign(DIFF)
;X
;DIFF
;DIFF == 0 ?
;default result is OK, done
correction if sign(X) != sign(DIFF)
;(X<0) ? 0xFFFFFFFF : 0
; sign(X)!=sign(DIFF)?0xFFFFFFFF:0
;(X<0) ? 0xFFFFFFFF : 1
;correction: -1, 0, 1
;trunc(X)=rndint(X)-correction
The second substitution simulates a truncating floating-point to
integer conversion using only integer instructions and therefore
works correctly independent of the FPUs current rounding
mode. It does not handle NaNs, infinities, and overflows
according to the IEEE-754 standard. Note that the first
instruction of this code may cause an STLF size mismatch
resulting in performance degradation if the variable to be
converted has been stored recently.
Minimize Floating-Point-to-Integer Conversions
101
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 3 (Potentially faster):
MOV
ECX, DWORD PTR[X+4]
;get upper 32 bits of double
XOR
EDX, EDX
;i = 0
MOV
EAX, ECX
;save sign bit
AND
ECX, 07FF00000h
;isolate exponent field
CMP
ECX, 03FF00000h
;if abs(x) < 1.0
JB
$DONE2
; then i = 0
MOV
EDX, DWORD PTR[X]
;get lower 32 bits of double
SHR
ECX, 20
;extract exponent
SHRD
EDX, EAX, 21
;extract mantissa
NEG
ECX
;compute shift factor for extracting
ADD
ECX, 1054
;non-fractional mantissa bits
OR
EDX, 080000000h
;set integer bit of mantissa
SAR
EAX, 31
;x < 0 ? 0xffffffff : 0
SHR
EDX, CL
;i = trunc(abs(x))
XOR
EDX, EAX
;i = x < 0 ? ~i : i
SUB
EDX, EAX
;i = x < 0 ? -i : i
$DONE2:
MOV
[I], EDX
;store result
For applications which can tolerate a floating-point-to-integer
conversion that is not compliant with existing programming
language standards (but is IEEE-754 compliant), perform the
conversion using the rounding mode that is currently in effect
(usually round-to-nearest-even).
Example 4 (Fastest):
FLD
FISTP
QWORD PTR [X]
DWORD PTR [I]
; get double to be converted
; store integer result
Some compilers offer an option to use the code from example 4
for floating-point-to-integer conversion, using the default
rounding mode.
Lastly, consider setting the rounding mode throughout an
application to truncate and using the code from example 4 to
perform extremely fast conversions that are compliant with
language standards and IEEE-754. This mode is also provided
as an option by some compilers. Note that use of this technique
also changes the rounding mode for all other FPU operations
inside the application, which can lead to significant changes in
numerical results and even program failure (for example, due to
lack of convergence in iterative algorithms).
102
Minimize Floating-Point-to-Integer Conversions
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Floating-Point Subexpression Elimination
There are cases which do not require an FXCH instruction after
every instruction to allow access to two new stack entries. In the
cases where two instructions share a source operand, an FXCH
is not required between the two instructions. When there is an
opportunity for subexpression elimination, reduce the number
of superfluous FXCH instructions by putting the shared source
operand at the top of the stack. For example, using the function:
func( (x*y), (x+z) )
Example 1 (Avoid):
FLD
FLD
FLD
FADD
FXCH
FMUL
CALL
FSTP
Z
Y
X
ST, ST(2)
ST(1)
ST, ST(2)
FUNC
ST(0)
Example 2 (Preferred):
FLD
FLD
FLD
FMUL
FADDP
CALL
Z
Y
X
ST(1), ST
ST(2), ST
FUNC
Check Argument Range of Trigonometric Instructions
Efficiently
The transcendental instructions FSIN, FCOS, FPTAN, and
FSINCOS are architecturally restricted in their argument
range. Only arguments with a magnitude of <= 2^63 can be
evaluated. If the argument is out of range, the C2 bit in the FPU
status word is set, and the argument is returned as the result.
Software needs to guard against such (extremely infrequent)
cases.
Floating-Point Subexpression Elimination
103
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
If an “argument out of range” is detected, a range reduction
subroutine is invoked which reduces the argument to less than
2^63 before the instruction is attempted again. While an
argument > 2^63 is unusual, it often indicates a problem
elsewhere in the code and the code may completely fail in the
absence of a properly guarded trigonometric instruction. For
example, in the case of FSIN or FCOS generated from a sin() or
cos() function invocation in the HLL, the downstream code
might reasonably expect that the returned result is in the range
[-1,1].
A naive solution for guarding a trigonometric instruction may
check the C2 bit in the FPU status word after each FSIN, FCOS,
FPTAN, and FSINCOS instruction, and take appropriate action
if it is set (indicating an argument out of range).
Example 1 (Avoid):
FLD
FSIN
FSTSW
TEST
JZ
CALL
FSIN
QWORD PTR [x] ;argument
;compute sine
AX
;store FPU status word to AX
AX, 0400h
;is the C2 bit set?
$in_range
;nope, argument was in range, all OK
$reduce_range ;reduce argument in ST(0) to < 2^63
;compute sine (in-range argument
; guaranteed)
$in_range:
Such a solution is inefficient since the FSTSW instruction is
serializing with respect to all x87/3DNow!/MMX instructions
and should thus be avoided (see the section “Floating-Point
Compare Instructions” on page 98). Use of FSTSW in the above
fashion slows down the common path through the code.
Instead, it is advisable to check the argument before one of the
trigonometric instructions is invoked.
Example 2 (Preferred):
FLD
QWORD PTR [x] ;argument
FLD
DWORD PTR [two_to_the_63]
;2^63
FCOMIP ST,ST(1)
;argument <= 2^63 ?
JBE
$in_range
;Yes, It is in range.
CALL
$reduce_range ;reduce argument in ST(0) to < 2^63
$in_range:
FSIN
;compute sine (in-range argument
; guaranteed)
104
Check Argument Range of Trigonometric Instructions
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Since out-of-range arguments are extremely uncommon, the
conditional branch will be perfectly predicted, and the other
instructions used to guard the trigonometric instruction can
execute in parallel to it.
Take Advantage of the FSINCOS Instruction
Frequently, a piece of code that needs to compute the sine of an
argument also needs to compute the cosine of that same
argument. In such cases, the FSINCOS instruction should be
used to compute both trigonometric functions concurrently,
which is faster than using separate FSIN and FCOS instructions
to accomplish the same task.
Example 1 (Avoid):
FLD
QWORD PTR [x]
FLD
DWORD PTR [two_to_the_63]
FCOMIP ST,ST(1)
JBE
$in_range
CALL
$reduce_range
$in_range:
FLD
ST(0)
FCOS
FSTP
QWORD PTR [cosine_x]
FSIN
FSTP
QWORD PTR [sine_x]
Example 2 (Preferred):
FLD
QWORD PTR [x]
FLD
DWORD PTR [two_to_the_63]
FCOMIP ST,ST(1)
JBE
$in_range
CALL
$reduce_range
$in_range:
FSINCOS
FSTP
QWORD PTR [cosine_x]
FSTP
QWORD PTR [sine_x]
Take Advantage of the FSINCOS Instruction
105
AMD Athlon™ Processor x86 Code Optimization
106
22007E/0—November 1999
Take Advantage of the FSINCOS Instruction
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
10
3DNow!™ and MMX™
Optimizations
This chapter describes 3DNow! and MMX code optimization
techniques for the AMD Athlon™ processor. Guidelines are
listed in order of importance. 3DNow! porting guidelines can be
found in the 3DNow!™ Instruction Porting Guide, order# 22621.
Use 3DNow!™ Instructions
✩
TOP
Unless accuracy requirements dictate otherwise, perform
floating-point computations using the 3DNow! instructions
instead of x87 instructions. The SIMD nature of 3DNow!
achieves twice the number of FLOPs that are achieved through
x87 instructions. 3DNow! instructions provide for a flat register
file instead of the stack-based approach of x87 instructions.
See the 3DNow!™ Technology Manual, order# 21928 for
information on instruction usage.
Use FEMMS Instruction
Though there is no penalty for switching between x87 FPU and
3DNow!/MMX instructions in the AMD Athlon processor, the
FEMMS instruction should be used to ensure the same code
also runs optimally on AMD-K6 ® family processors. The
Use 3DNow!™ Instructions
107
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
FEMMS instruction is supported for backward compatibility
with AMD-K6 family processors, and is aliased to the EMMS
instruction.
3DNow! and MMX instructions are designed to be used
concurrently with no switching issues. Likewise, enhanced
3DNow! instructions can be used simultaneously with MMX
instructions. However, x87 and 3DNow! instructions share the
same architectural registers so there is no easy way to use them
concurrently without cleaning up the register file in between
using FEMMS/EMMS.
Use 3DNow!™ Instructions for Fast Division
3DNow! instructions can be used to compute a very fast, highly
accurate reciprocal or quotient.
Optimized 14-Bit Precision Divide
This divide operation executes with a total latency of seven
cycles, assuming that the program hides the latency of the first
MOVD/MOVQ instructions within preceding code.
Example:
MOVD
PFRCP
MOVQ
PFMUL
MM0,
MM0,
MM2,
MM2,
[MEM]
MM0
[MEM]
MM0
;
;
;
;
0
1/W
Y
Y/W
|
|
|
|
W
1/W
X
X/W
(approximate)
Optimized Full 24-Bit Precision Divide
This divide operation executes with a total latency of 15 cycles,
assuming that the program hides the latency of the first
MOVD/MOVQ instructions within preceding code.
Example:
MOVD
PFRCP
PUNPCKLDQ
PFRCPIT1
MOVQ
PFRCPIT2
PFMUL
108
MM0,
MM1,
MM0,
MM0,
MM2,
MM0,
MM2,
[W]
MM0
MM0
MM1
[X_Y]
MM1
MM0
;
;
;
;
;
;
:
0
1/W
W
1/W
Y
1/W
Y/W
|
|
|
|
|
|
|
W
1/W
W
1/W
X
1/W
X/W
(approximate)
(MMX instr.)
(refine)
(final)
Use 3DNow!™ Instructions for Fast Division
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Pipelined Pair of 24-Bit Precision Divides
This divide operation executes with a total latency of 21 cycles,
assuming that the program hides the latency of the first
MOVD/MOVQ instructions within preceding code.
Example:
MOVQ
PFRCP
MOVQ
PUNPCKHDQ
PFRCP
PUNPCKLDQ
MOVQ
PFRCPIT1
PFRCPIT2
PFMUL
MM0,
MM1,
MM2,
MM0,
MM0,
MM1,
MM0,
MM2,
MM2,
MM0,
[DIVISORS]
MM0
MM0
MM0
MM0
MM0
[DIVIDENDS]
MM1
MM1
MM2
;
;
;
;
;
;
;
;
;
;
y
1/x
y
y
1/y
1/y
z
1/y
1/y
z/y
|
|
|
|
|
|
|
|
|
|
x
1/x
x
y
1/y
1/x
w
1/x
1/x
w/x
(approximate)
(approximate)
(approximate)
(intermediate)
(final)
Newton-Raphson Reciprocal
Consider the quotient q = a/b. An (on-chip) ROM-based table
lookup can be used to quickly produce a 14-to-15-bit precision
approximation of 1/b using just one PFRCP instruction. A full
24-bit precision reciprocal can then be quickly computed from
this approximation using a Newton Raphson algorithm.
The general Newton-Raphson recurrence for the reciprocal is as
follows:
Zi+1 = Zi • (2 – b • Zi)
Given that the initial approximation is accurate to at least 14
bits, and that a full IEEE single-precision mantissa contains 24
bits, just one Newton-Raphson iteration is required. The
following sequence shows the 3DNow! instructions that produce
the initial reciprocal approximation, compute the full precision
reciprocal from the approximation, and finally, complete the
desired divide of a/b.
X0
X1
X2
q
=
=
=
=
PFRCP(b)
PFRCPIT1(b,X0)
PFRCPIT2(X1,X0)
PFMUL(a,X2)
The 24-bit final reciprocal value is X 2 . In the AMD Athlon
processor 3DNow! technology implementation the operand X2
contains the correct round-to-nearest single precision
reciprocal for approximately 99% of all arguments.
Use 3DNow!™ Instructions for Fast Division
109
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Use 3DNow!™ Instructions for Fast Square Root and
Reciprocal Square Root
3DNow! instructions can be used to compute a very fast, highly
accurate square root and reciprocal square root.
Optimized 15-Bit Precision Square Root
This square root operation can be executed in only 7 cycles,
assuming a program hides the latency of the first MOVD
instruction within previous code. The reciprocal square root
operation requires four less cycles than the square root
operation.
Example:
MOVD
PFRSQRT
PUNPCKLDQ
PFMUL
MM0,
MM1,
MM0,
MM0,
[MEM]
MM0
MM0
MM1
;
0
;1/sqrt(a)
;
a
; sqrt(a)
|
|
|
|
a
1/sqrt(a) (approximate)
a
(MMX instr.)
sqrt(a)
Optimized 24-Bit Precision Square Root
This square root operation can be executed in only 19 cycles,
assuming a program hides the latency of the first MOVD
instruction within previous code. The reciprocal square root
operation requires four less cycles than the square root
operation.
Example:
MOVD
PFRSQRT
MOVQ
PFMUL
PUNPCKLDQ
PFRSQIT1
PFRCPIT2
PFMUL
110
MM0,
MM1,
MM2,
MM1,
MM0,
MM1,
MM1,
MM0,
[MEM]
MM0
MM1
MM1
MM0
MM0
MM2
MM1
;
0 | a
; 1/sqrt(a) | 1/sqrt(a) (approx.)
;
X_0 = 1/(sqrt a)
(approx.)
(step 1)
; X_0 * X_0 | X_0 * X_0
;
a | a
(MMX instr)
;
(intermediate)
(step 2)
; 1/sqrt(a) | 1/sqrt(a) (step 3)
;
sqrt(a) | sqrt(a)
Use 3DNow!™ Instructions for Fast Square Root and
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Newton-Raphson Reciprocal Square Root
The general Newton-Raphson reciprocal square root recurrence
is:
Zi+1 = 1/2 • Zi • (3 – b • Zi2)
To reduce the number of iterations, the initial approximation
read from a table. The 3DNow! reciprocal square root
approximation is accurate to at least 15 bits. Accordingly, to
obtain a single-precision 24-bit reciprocal square root of an
input operand b, one Newton-Raphson iteration is required,
using the following sequence of 3DNow! instructions:
X0
X1
X2
X3
X4
=
=
=
=
=
PFRSQRT(b)
PFMUL(X0,X0)
PFRSQIT1(b,X1)
PFRCPIT2(X2,X0)
PFMUL(b,X3)
The 24-bit final reciprocal square root value is X 3 . In the
AMD Athlon processor 3DNow! implementation, the estimate
contains the correct round-to-nearest value for approximately
87% of all arguments. The remaining arguments differ from the
correct round-to-nearest value by one unit-in-the-last-place. The
square root (X4) is formed in the last step by multiplying by the
input operand b.
Use MMX™ PMADDWD Instruction to Perform Two 32-Bit
Multiplies in Parallel
The MMX PMADDWD instruction can be used to perform two
signed 16x16→32 bit multiplies in parallel, with much higher
performance than can be achieved using the IMUL instruction.
The PMADDWD instruction is designed to perform four
16x16→32 bit signed multiplies and accumulate the results
pairwise. By making one of the results in a pair a zero, there are
now just two multiplies. The following example shows how to
multiply 16-bit signed numbers a,b,c,d into signed 32-bit
products a×c and b×d:
Use MMX™ PMADDWD Instruction to Perform Two 32-Bit Multiplies in Parallel
111
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example:
PXOR
MOVD
MOVD
PUNPCKLWD
PUNCPKLWD
PMADDWD
MM2,
MM0,
MM1,
MM0,
MM1,
MM0,
MM2
[ab]
[cd]
MM2
MM2
MM1
;
;
;
;
;
;
0
0 0
0 0
0 b
0 d
b*d
|
|
|
|
|
|
0
b a
d c
0 a
0 c
a*c
3DNow!™ and MMX™ Intra-Operand Swapping
AMD Athlon™
Specific Code
If the swapping of MMX register halves is necessary, use the
PSWAPD instruction, which is a new AMD Athlon 3DNow! DSP
ex t e ns i o n. Us e o f t hi s i ns t r uc t i o n s h ou l d o n ly b e fo r
AMD Athlon specific code. “PSWAPD MMreg1, MMreg2”
performs the following operation:
mmreg1[63:32] = mmreg2[31:0])
mmreg1[31:0] = mmreg2[63:32])
See the AMD Extensions to the 3DNow! and MMX Instruction Set
Manual, order #22466 for more usage information.
Blended Code
Otherwise, for blended code, which needs to run well on
AMD-K6 and AMD Athlon family processors, the following code
is recommended:
Example 1 (Preferred, faster):
;MM1 = SWAP
MOVQ
PUNPCKLDQ
PUNPCKHDQ
(MM0), MM0 destroyed
MM1, MM0
;make a copy
MM0, MM0
;duplicate lower half
MM1, MM0
;combine lower halves
Example 2 (Preferred, fast):
;MM1 = SWAP
MOVQ
PUNPCKHDQ
PUNPCKLDQ
(MM0), MM0 preserved
MM1, MM0
;make a copy
MM1, MM1
;duplicate upper half
MM1, MM0
;combine upper halves
Both examples accomplish the swapping, but the first example
should be used if the original contents of the register do not
need to be preserved. The first example is faster due to the fact
that the MOVQ and PUNPCKLDQ instructions can execute in
parallel. The instructions in the second example are dependent
on one another and take longer to execute.
112
3DNow!™ and MMX™ Intra-Operand Swapping
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Fast Conversion of Signed Words to Floating-Point
In many applications there is a need to quickly convert data
consisting of packed 16-bit signed integers into floating-point
numbers. The following two examples show how this can be
accomplished efficiently on AMD processors.
The first example shows how to do the conversion on a processor
that supports AMD ’s 3 DN ow! ex te n si on s, such as t h e
AMD Athlon processor. It demonstrates the increased
efficiency from using the PI2FW instruction. Use of this
instruction should only be for AMD Athlon processor specific
code. See the AMD Extensions to the 3DNow!™ and MMX™
Instruction Set Manual, order #22466 for more information on
this instruction.
The second example demonstrates how to accomplish the same
task in blended code that achieves good performance on the
AMD Athlon processor as well as on the AMD-K6 family
processors that support 3DNow! technology.
Example 1 (AMD Athlon specific code using 3DNow! DSP extension):
MOVD
PUNPCKLWD
PI2FW
MOVQ
MM0, [packed_sword]
MM0, MM0
MM0, MM0
[packed_float], MM0
;0 0 | b a
;b b | a a
;xb=float(b) | xa=float(a)
;store xb | xa
Example 2 (AMD-K6 Family and AMD Athlon processor blended code):
MOVD
PXOR
PUNPCKLWD
PSRAD
PI2FD
MOVQ
MM1, [packed_sword]
MM0, MM0
MM0, MM1
MM0, 16
MM0, MM0
[packed_float], MM0
;0 0 | b a
;0 0 | 0 0
;b 0 | a 0
;sign extend: b | a
;xb=float(b) | xa=float(a)
;store xb | xa
Use MMX™ PXOR to Negate 3DNow!™ Data
For both the AMD Athlon and AMD-K6 processors, it is
recommended that code use the MMX PXOR instruction to
change the sign bit of 3DNow! operations instead of the 3DNow!
PFMUL instruction. On the AMD Athlon processor, using
PXOR allows for more parallelism, as it can execute in either
the FADD or FMUL pipes. PXOR has an execution latency of
two, but because it is a MMX instruction, there is an initial one
Fast Conversion of Signed Words to Floating-Point
113
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
cycle bypassing penalty, and another one cycle penalty if the
result goes to a 3DNow! operation. The PFMUL execution
latency is four, therefore, in the worst case, the PXOR and
PMUL instructions are the same in terms of latency. On the
AMD-K6 processor, there is only a one cycle latency for PXOR,
versus a two cycle latency for the 3DNow! PFMUL instruction.
Use the following code to negate 3DNow! data:
msgn
PXOR
DQ 8000000080000000h
MM0, [msgn]
;toggle sign bit
Use MMX™ PCMP Instead of 3DNow!™ PFCMP
Use the MMX PCMP instruction instead of the 3DNow! PFCMP
instruction. On the AMD Athlon processor, the PCMP has a
latency of two cycles while the PFCMP has a latency of four
cycles. In addition to the shorter latency, PCMP can be issued to
either the FADD or the FMUL pipe, while PFCMP is restricted
to the FADD pipe.
Note: The PFCMP instruction has a ‘GE’ (greater or equal)
version (PFCMPGE) that is missing from PCMP.
Both Numbers
Positive
If both arguments are positive, PCMP always works.
One Negative, One
Positive
If one number is negative and the other is positive, PCMP still
works, except when one number is a positive zero and the other
is a negative zero.
Both Numbers
Negative
Be careful when performing integer comparison using PCMPGT
on two negative 3DNow! numbers. The result is the inverse of
the PFCMPGT floating-point comparison. For example:
–2 = 84000000
–4 = 84800000
PCMPGT gives 84800000 > 84000000, but –4 < –2. To address
this issue, simply reverse the comparison by swapping the
source operands.
114
Use MMX™ PCMP Instead of 3DNow!™ PFCMP
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Use MMX™ Instructions for Block Copies and Block Fills
For moving or filling small blocks of data (e.g., less than 512
bytes) between cacheable memory areas, the REP MOVS and
REP STOS families of instructions deliver good performance
and are straightforward to use. For moving and filling larger
blocks of data, or to move/fill blocks of data where the
destination is in non-cacheable space, it is recommended to
make use of MMX instructions and MMX extensions. The
following examples all use quadword-aligned blocks of data. In
cases where memory blocks are not quadword aligned,
additional code is required to handle end cases as needed.
AMD-K6® and
AMD Athlon™
Processor Blended
Code
The following example code, written for the inline assembler of
Microsoft Visual C, is suitable for moving/filling a large quadword aligned block of data in the following situations:
■
Blended code, i.e., code that needs to perform well on both
AMD Athlon and AMD-K6 family processors
■
AMD Athlon processor specific code where the destination
is in cacheable memory and immediate data re-use of the
data at the destination is expected
AMD-K6 family specific code where the destination is in
non-cacheable memory
■
Example 1:
/* block copy (source and destination QWORD aligned) */
__asm {
mov
mov
mov
shr
eax,
edx,
ecx,
ecx,
[src_ptr]
[dst_ptr]
[blk_size]
6
align 16
Use MMX™ Instructions for Block Copies and Block Fills
115
AMD Athlon™ Processor x86 Code Optimization
$xfer:
movq
add
movq
add
movq
movq
movq
movq
movq
movq
movq
movq
movq
movq
movq
movq
movq
dec
movq
jnz
femms
}
22007E/0—November 1999
mm0, [eax]
edx, 64
mm1, [eax+8]
eax, 64
mm2, [eax-48]
[edx-64], mm0
mm0, [eax-40]
[edx-56], mm1
mm1, [eax-32]
[edx-48], mm2
mm2, [eax-24]
[edx-40], mm0
mm0, [eax-16]
[edx-32], mm1
mm1, [eax-8]
[edx-24], mm2
[edx-16], mm0
ecx
[edx-8], mm1
$xfer
/* block fill (destination QWORD aligned) */
__asm {
mov
mov
shr
movq
edx,
ecx,
ecx,
mm0,
[dst_ptr]
[blk_size]
6
[fill_data]
align 16
$fill:
movq
movq
movq
movq
movq
movq
add
movq
decq
mov
jnz
femms
}
116
[edx], mm0
[edx+8], mm0
[edx+16], mm0
[edx+24], mm0
[edx+32], mm0
[edx+40], mm0
edx, 64
[edx-16], mm0
ecx
[edx-8], mm0
$fill
Use MMX™ Instructions for Block Copies and Block Fills
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
AMD Athlon™
Processor Specific
Code
The following example code, written for the inline assembler of
Microsoft Visual C, is suitable for moving/filling a quadword
aligned block of data in the following situations:
■
■
AMD Athlon processor specific code where the destination
of the block copy is in non-cacheable memory space
AMD Athlon processor specific code where the destination
of the block copy is in cacheable space, but no immediate
data re-use of the data at the destination is expected.
Example 2:
/* block copy (source and destination QWORD aligned) */
__asm {
mov
mov
mov
shr
eax,
edx,
ecx,
ecx,
[src_ptr]
[dst_ptr]
[blk_size]
6
align 16
$xfer_nc:
prefetchnta [eax+256]
movq
mm0, [eax]
add
edx, 64
movq
mm1, [eax+8]
add
eax, 64
movq
mm2, [eax-48]
movntq
[edx-64], mm0
movq
mm0, [eax-40]
movntq
[edx-56], mm1
movq
mm1, [eax-32]
movntq
[edx-48], mm2
movq
mm2, [eax-24]
movntq
[edx-40], mm0
movq
mm0, [eax-16]
movntq
[edx-32], mm1
movq
mm1, [eax-8]
movntq
[edx-24], mm2
movntq
[edx-16], mm0
dec
ecx
movntq
[edx-8], mm1
jnz
$xfer_nc
femms
sfence
}
Use MMX™ Instructions for Block Copies and Block Fills
117
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
/* block fill (destination QWORD aligned) */
__asm {
mov
mov
shr
movq
edx,
ecx,
ecx,
mm0,
[dst_ptr]
[blk_size]
6
[fill_data]
align 16
$fill_nc:
movntq
movntq
movntq
movntq
movntq
movntq
movntq
movntq
add
dec
jnz
femms
sfence
}
[edx], mm0
[edx+8], mm0
[edx+16], mm0
[edx+24], mm0
[edx+32], mm0
[edx+40], mm0
[edx+48], mm0
[edx+56], mm0
edx, 64
ecx
$fill_nc
Use MMX™ PXOR to Clear All Bits in an MMX™ Register
To clear all the bits in an MMX register to zero, use:
PXOR
MMreg, MMreg
Note that PXOR MMreg, MMreg is dependent on previous
writes to MMreg. Therefore, using PXOR in the manner
described can lengthen dependency chains, which in return
may lead to reduced performance. An alternative in such cases
is to use:
zero DD 0
MOVD
MMreg, DWORD PTR [zero]
i.e., to load a zero from a statically initialized and properly
aligned memory location. However, loading the data from
memory runs the risk of cache misses. Cases where MOVD is
superior to PXOR are therefore rare and PXOR should be used
in general.
118
Use MMX™ PXOR to Clear All Bits in an MMX™ Register
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Use MMX™ PCMPEQD to Set All Bits in an MMX™ Register
To set all the bits in an MMX register to one, use:
PCMPEQD
MMreg, MMreg
Note that PCMPEQD MMreg, MMreg is dependent on previous
writes to MMreg. Therefore, using PCMPEQD in the manner
described can lengthen dependency chains, which in return
may lead to reduced performance. An alternative in such cases
is to use:
ones DQ 0FFFFFFFFFFFFFFFFh
MOVQ
MMreg, QWORD PTR [ones]
i.e., to load a quadword of 0xFFFFFFFFFFFFFFFF from a
statically initialized and properly aligned memory location.
However, loading the data from memory runs the risk of cache
misses. Cases where MOVQ is superior to PCMPEQD are
therefore rare and PCMPEQD should be used in general.
Use MMX™ PAND to Find Absolute Value in 3DNow!™ Code
Use the following to compute the absolute value of 3DNow!
floating-point operands:
mabs
PAND
DQ 7FFFFFFF7FFFFFFFh
MM0, [mabs]
;mask out sign bit
Optimized Matrix Multiplication
The multiplication of a 4x4 matrix with a 4x1 vector is
commonly used in 3D graphics for geometry transformation.
This routine serves to translate, scale, rotate, and apply
perspective to 3D coordinates represented in homogeneous
coordinates. The following code sample is a 3DNow! optimized,
general 3D vertex transformation routine that completes in 16
cycles on the AMD Athlon processor:
Use MMX™ PCMPEQD to Set All Bits in an MMX™ Register
119
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
/* Function XForm performs a fully generalized 3D transform on an array
of vertices pointed to by "v" and stores the transformed vertices in
the location pointed to by "res". Each vertex consists of four floats.
The 4x4 transform matrix is pointed to by "m". The matrix elements are
also floats. The argument "numverts" indicates how many vertices have
to be transformed. The computation performed for each vertex is:
res->x
res->y
res->z
res->w
=
=
=
=
v->x*m[0][0]
v->x*m[0][1]
v->x*m[0][2]
v->x*m[0][3]
+
+
+
+
v->y*m[1][0]
v->y*m[1][1]
v->y*m[1][2]
v->y*m[1][3]
+
+
+
+
v->z*m[2][0]
v->z*m[2][1]
v->z*m[2][2]
v->z*m[2][3]
+
+
+
+
v->w*m[3][0]
v->w*m[3][1]
v->w*m[3][2]
v->w*m[3][3]
*/
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
M00
M01
M02
M03
M10
M11
M12
M13
M20
M21
M22
M23
M30
M31
M32
M33
0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
void XForm (float
{
_asm {
MOV
MOV
MOV
MOV
*res, const float *v, const float *m, int numverts)
EDX,
EAX,
EBX,
ECX,
[V]
[M]
[RES]
[NUMVERTS]
;EDX
;EAX
;EBX
;ECX
=
=
=
=
source vector ptr
matrix ptr
destination vector ptr
number of vertices to transform
;3DNow! version of fully general 3D vertex tranformation.
;Optimal for AMD Athlon (completes in 16 cycles)
FEMMS
ALIGN
120
;clear MMX state
16
;for optimal branch alignment
Optimized Matrix Multiplication
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
$$xform:
ADD
MOVQ
MOVQ
ADD
MOVQ
MOVQ
PUNPCKLDQ
MOVQ
PFMUL
PUNPCKHDQ
PFMUL
MOVQ
MOVQ
MOVQ
PFMUL
MOVQ
PUNPCKLDQ
PFMUL
MOVQ
PFMUL
PFADD
EBX,
MM0,
MM1,
EDX,
MM2,
MM3,
MM0,
MM4,
MM3,
MM2,
MM4,
MM5,
MM7,
MM6,
MM5,
MM0,
MM1,
MM7,
MM2,
MM0,
MM3,
16
QWORD
QWORD
16
MM0
QWORD
MM0
QWORD
MM0
MM2
MM2
QWORD
QWORD
MM1
MM0
QWORD
MM1
MM2
QWORD
MM1
MM4
MOVQ
PFMUL
PFADD
MM4, QWORD PTR
MM2, MM1
MM5, MM7
PTR
PTR
PTR
PTR
PTR
PTR
MOVQ
MM1, QWORD PTR
PUNPCKHDQ MM6, MM6
PFADD
MM3, MM0
PFMUL
PFMUL
PFADD
MM4, MM6
MM1, MM6
MM5, MM2
PFADD
MM3, MM4
MOVQ
PFADD
[EBX-16], MM3
MM5, MM1
MOVQ
DEC
JNZ
[EBX-8], MM5
ECX
$$XFORM
FEMMS
;res++
;v->y | v->x
;v->w | v->z
;v++
;v->y | v->x
[EAX+M00] ;m[0][1] | m[0][0]
;v->x | v->x
[EAX+M10] ;m[1][1] | m[1][0]
;v->x*m[0][1] | v->x*m[0][0]
;v->y | v->y
;v->y*m[1][1] | v->y*m[1][0]
[EAX+M02] ;m[0][3] | m[0][2]
[EAX+M12] ;m[1][3] | m[1][2]
;v->w | v->z
;v->x*m[0][3] | v0>x*m[0][2]
[EAX+M20] ;m[2][1] | m[2][0]
;v->z | v->z
;v->y*m[1][3] | v->y*m[1][2]
[EAX+M22] ;m[2][3] | m[2][2]
;v->z*m[2][1] | v->z*m[2][0]
;v->x*m[0][1]+v->y*m[1][1] |
; v->x*m[0][0]+v->y*m[1][0]
[EAX+M30] ;m[3][1] | m[3][0]
;v->z*m[2][3] | v->z*m[2][2]
;v->x*m[0][3]+v->y*m[1][3] |
; v->x*m[0][2]+v->y*m[1][2]
[EAX+M32] ;m[3][3] | m[3][2]
;v->w | v->w
;v->x*m[0][1]+v->y*m[1][1]+v->z*m[2][1] |
; v->x*m[0][0]+v->y*m[1][0]+v->z*m[2][0]
;v->w*m[3][1] | v->w*m[3][0]
;v->w*m[3][3] | v->w*m[3][2]
;v->x*m[0][3]+v->y*m[1][3]+v->z*m[2][3] |
; v->x*m[0][2]+v->y*m[1][2]+v->z*m[2][2]
;v->x*m[0][1]+v->y*m[1][1]+v->z*m[2][1]+
; v->w*m[3][1] | v->x*m[0][0]+v->y*m[1][0]+
; v->z*m[2][0]+v->w*m[3][0]
;store res->y | res->x
;v->x*m[0][3]+v->y*m[1][3]+v->z*m[2][3]+
; v->w*m[3][3] | v->x*m[0][2]+v->y*m[1][2]+
; v->z*m[2][2]+v->w*m[3][2]
;store res->w | res->z
;numverts-;until numverts == 0
PTR [EDX]
PTR [EDX+8]
;clear MMX state
}
}
Optimized Matrix Multiplication
121
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Efficient 3D-Clipping Code Computation Using 3DNow!™
Instructions
Clipping is one of the major activities occurring in a 3D
graphics pipeline. In many instances, this activity is split into
two parts which do not necessarily have to occur consecutively:
■
■
Computation of the clip code for each vertex, where each
bit of the clip code indicates whether the vertex is outside
the frustum with regard to a specific clip plane.
Examination of the clip code for a vertex and clipping if the
clip code is non-zero.
The following example shows how to use 3DNow! instructions to
efficiently implement a clip code computation for a frustum
that is defined by:
■
■
■
-w <= x <= w
-w <= y <= w
-w <= z <= w
.DATA
RIGHT
LEFT
ABOVE
BELOW
BEHIND
BEFORE
EQU
EQU
EQU
EQU
EQU
EQU
01h
02h
04h
08h
10h
20h
ALIGN 8
ABOVE_RIGHT
BELOW_LEFT
BEHIND_BEFORE
DD
DD
DD
DD
DD
DD
RIGHT
ABOVE
LEFT
BELOW
BEFORE
BEHIND
.CODE
;; Generalized computation of 3D clip code (out code)
;;
;; Register usage: IN
MM5
y | x
;;
MM6
w | z
;;
;;
OUT
MM2
clip code (out code)
122
Efficient 3D-Clipping Code Computation Using
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
;;
;;
PXOR
MOVQ
MOVQ
PUNPCKHDQ
MOVQ
MOVQ
PFSUBR
PFSUBR
PUNPCKLDQ
PFCMPGT
MOVQ
PFCMPGT
PFCMPGT
MOVQ
PAND
MOVQ
PAND
PAND
POR
POR
MOVQ
PUNPCKHDQ
POR
DESTROYS
MM0,
MM1,
MM4,
MM1,
MM3,
MM2,
MM3,
MM2,
MM3,
MM4,
MM0,
MM3,
MM2,
MM1,
MM4,
MM0,
MM3,
MM2,
MM2,
MM2,
MM1,
MM2,
MM2,
MM0,MM1,MM2,MM3,MM4
MM0
; 0 | 0
MM6
; w | z
MM5
; y | x
MM1
; w | w
MM6
; w | z
MM5
; y | x
MM0
; -w | -z
MM0
; -y | -x
MM6
; z | -z
MM1
; y>w?FFFFFFFF:0 | x>w?FFFFFFFF:0
QWORD PTR [ABOVE_RIGHT]
; ABOVE | RIGHT
MM1
; z>w?FFFFFFFF:0 | -z>w>FFFFFFFF:0
MM1
; -y>w?FFFFFFFF:0 | -x>w?FFFFFFFF:0
QWORD PTR [BEHIND_BEFORE] ; BEHIND | BEFORE
MM0
; y > w ? ABOVE:0 | x > w ? RIGHT:0
QWORD PTR [BELOW_LEFT]
; BELOW | LEFT
MM1
; z > w ? BEHIND:0 | -z > w ? BEFORE:0
MM0
; -y > w ? BELOW:0 | -x > w ? LEFT:0
MM4
; BELOW,ABOVE | LEFT,RIGHT
MM3
; BELOW,ABOVE,BEHIND | LEFT,RIGHT,BEFORE
MM2
; BELOW,ABOVE,BEHIND | LEFT,RIGHT,BEFORE
MM2
; BELOW,ABOVE,BEHIND | BELOW,ABOVE,BEHIND
MM1
; zclip, yclip, xclip = clip code
Use 3DNow!™ PAVGUSB for MPEG-2 Motion Compensation
Use the 3DNow! PAVGUSB instruction for MPEG-2 motion
compensation. The PAVGUSB instruction produces the rounded
averages of the eight unsigned 8-bit integer values in the source
operand (a MMX register or a 64-bit memory location) and the
eight corresponding unsigned 8-bit integer values in the
destination operand (a MMX register). The PAVGUSB
instruction is extremely useful in DVD (MPEG-2) decoding
where motion compensation performs a lot of byte averaging
between and within macroblocks. The PAVGUSB instruction
helps speed up these operations. In addition, PAVGUSB can
free up some registers and make unrolling the averaging loops
possible.
The following code fragment uses original MMX code to
perform averaging between the source macroblock and
destination macroblock:
Use 3DNow!™ PAVGUSB for MPEG-2 Motion Compensation
123
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Example 1 (Avoid):
124
MOV
MOV
MOV
MOV
MOVQ
MOVQ
MOV
ESI,
EDI,
EDX,
EBX,
MM7,
MM6,
ECX,
DWORD
DWORD
DWORD
DWORD
QWORD
QWORD
16
PTR
PTR
PTR
PTR
PTR
PTR
L1:
MOVQ
MOVQ
MOVQ
MOVQ
PAND
PAND
PAND
PAND
POR
PSRLQ
PSRLQ
PAND
PADDB
MM0,
MM1,
MM2,
MM3,
MM2,
MM3,
MM0,
MM1,
MM2,
MM0,
MM1,
MM2,
MM0,
[ESI]
[EDI]
MM0
MM1
MM6
MM6
MM7
MM7
MM3
1
1
MM6
MM1
PADDB
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
PAND
PAND
PAND
PAND
POR
PSRLQ
PSRLQ
PAND
PADDB
MM0, MM2
[EDI], MM0
MM4, [ESI+8]
MM5, [EDI+8]
MM2, MM4
MM3, MM5
MM2, MM6
MM3, MM6
MM4, MM7
MM5, MM7
MM2, MM3
MM4, 1
MM5, 1
MM2, MM6
MM4, MM5
PADDB
MOVQ
MM4, MM2
[EDI+8], MM4
ADD
ADD
LOOP
ESI, EDX
EDI, EBX
L1
Src_MB
Dst_MB
SrcStride
DstStride
[ConstFEFE]
[Const0101]
;MM0=QWORD1
;MM1=QWORD3
;MM0 = QWORD1 & 0xfefefefe
;MM1 = QWORD3 & 0xfefefefe
;calculate adjustment
;MM0 = (QWORD1 & 0xfefefefe)/2
;MM1 = (QWORD3 & 0xfefefefe)/2
;MM0 = QWORD1/2 + QWORD3/2 w/o
; adjustment
;add lsb adjustment
;MM4=QWORD2
;MM5=QWORD4
;MM0 = QWORD2 & 0xfefefefe
;MM1 = QWORD4 & 0xfefefefe
;calculate adjustment
;MM0 = (QWORD2 & 0xfefefefe)/2
;MM1 = (QWORD4 & 0xfefefefe)/2
;MM0 = QWORD2/2 + QWORD4/2 w/o
; adjustment
;add lsb adjustment
Use 3DNow!™ PAVGUSB for MPEG-2 Motion
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
The following code fragment uses the 3DNow! PAVGUSB
ins tr uction to perfor m averaging between the source
macroblock and destination macroblock:
Example 2 (Preferred):
MOV
MOV
MOV
MOV
MOV
EAX,
EDI,
EDX,
EBX,
ECX,
DWORD
DWORD
DWORD
DWORD
16
PTR
PTR
PTR
PTR
L1:
MOVQ
MOVQ
PAVGUSB
MM0, [EAX]
MM1, [EAX+8]
MM0, [EDI]
PAVGUSB
MM1, [EDI+8]
ADD
MOVQ
MOVQ
ADD
LOOP
EAX, EDX
[EDI], MM0
[EDI+8], MM1
EDI, EBX
L1
Src_MB
Dst_MB
SrcStride
DstStride
;MM0=QWORD1
;MM1=QWORD2
;(QWORD1 + QWORD3)/2 with
; adjustment
;(QWORD2 + QWORD4)/2 with
; adjustment
Stream of Packed Unsigned Bytes
The following code is an example of how to process a stream of
packed unsigned bytes (like RGBA information) with faster
3DNow! instructions.
Example:
outside loop:
PXOR
MM0, MM0
inside loop:
MOVD
MM1,
PUNPCKLBW
MM1,
MOVQ
MM2,
PUNPCKLWD
MM1,
PUNPCKHWD
MM2,
PI2FD
MM1,
PI2FD
MM2,
Stream of Packed Unsigned Bytes
[VAR]
MM0
MM1
MM0
MM0
MM1
MM2
;
0 | v[3],v[2],v[1],v[0]
;0,v[3],0,v[2] | 0,v[1],0,v[0]
;0,v[3],0,v[2] | 0,v[1],0,v[0]
;
0,0,0,v[1] | 0,0,0,v[0]
;
0,0,0,v[3] | 0,0,0,v[2]
; float(v[1]) | float(v[0])
; float(v[3]) | float(v[2])
125
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Complex Number Arithmetic
Complex numbers have a “real” part and an “imaginary” part.
Multiplying complex numbers (ex. 3 + 4i) is an integral part of
many algorithms such as Discrete Fourier Transform (DFT) and
complex FIR filters. Complex number multiplication is shown
below:
(src0.real + src0.imag) * (src1.real + src1.imag) = result
result = (result.real + result.imag)
result.real <= src0.real*src1.real - src0.imag*src1.imag
result.imag <= src0.real*src1.imag + src0.imag*src1.real
Example:
(1+2i) * (3+4i) => result.real + result.imag
result.real <= 1*3 - 2*4 = -5
result.imag <= 1*4i + 2i*3 = 10i
result = -5 +10i
Assuming that complex numbers are represented as two
element vectors [v.real, v.imag], one can see the need for
swapping the elements of src1 to perform the multiplies for
result.imag, and the need for a mixed positive/negative
accumulation to complete the parallel computation of
result.real and result.imag.
PSWAPD performs the swapping of elements for src1 and
PFPNACC performs the mixed positive/negative accumulation
to complete the computation. The code example below
summarizes the computation of a complex number multiply.
Example:
;MM0 = s0.imag | s0.real
;MM1 = s1.imag | s1.real
PSWAPD
PFMUL
PFMUL
PFPNACC
MM2,
MM0,
MM1,
MM0,
MM0
MM1
MM2
MM1
;M2
;M0
;M1
;M0
;reg_hi | reg_lo
=
s0.real | s0.imag
= s0.imag*s1.imag |s0.real*s1.real
= s0.real*s1.imag | s0.imag*s1.real
=
res.imag | res.real
PSWAPD supports independent source and result operands and
enables PSWAPD to also perform a copy function. In the above
example, this eliminates the need for a separate “MOVQ MM2,
MM0” instruction.
126
Complex Number Arithmetic
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
11
General x86 Optimization
Guidelines
This chapter describes general code optimization techniques
specific to superscalar processors (that is, techniques common
to the AMD-K6 ® processor, AMD Athlon™ processor, and
Pentium ® family processors). In general, all optimization
techniques used for the AMD-K6 processor, Pentium, and
Pentium Pro processors either improve the performance of the
AMD Athlon processor or are not required and have a neutral
effect (usually due to fewer coding restrictions with the
AMD Athlon processor).
Short Forms
Use shorter forms of instructions to increase the effective
number of instructions that can be examined for decoding at
any one time. Use 8-bit displacements and jump offsets where
possible.
Example 1 (Avoid):
CMP
REG, 0
Example 2 (Preferred):
TEST
REG, REG
Although both of these instructions have an execute latency of
one, fewer opcode bytes need to be examined by the decoders
for the TEST instruction.
Short Forms
127
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Dependencies
Spread out true dependencies to increase the opportunities for
p a ra l l e l e x e c u t i o n . A n t i -d e p e n d e n c i e s a n d o u t p u t
dependencies do not impact performance.
Register Operands
Maintain frequently used values in registers rather than in
memory. This technique avoids the comparatively long latencies
for accessing memory.
Stack Allocation
When allocating space for local variables and/or outgoing
parameters within a procedure, adjust the stack pointer and
use moves rather than pushes. This method of allocation allows
random access to the outgoing parameters so that they can be
set up when they are calculated instead of being held
somewhere else until the procedure call. In addition, this
method reduces ESP dependencies and uses fewer execution
resources.
128
Dependencies
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
Appendix A
AMD Athlon™ Processor
Microarchitecture
Introduction
When discussing processor design, it is important to understand
the following terms—architecture, microarchitecture, and design
implementation. The term architecture refers to the instruction
set and features of a processor that are visible to software
p rog ra m s r u n n ing o n t h e p ro c e s so r. The a rchi t ec t ure
de termines w hat software the processor can run. The
a rc h i t e c t u re o f t h e A M D A t h l o n p r o c e s s o r i s t h e
industry-standard x86 instruction set.
The term microarchitecture refers to the design techniques used
in the processor to reach the target cost, performance, and
f u n c t i o n a l i t y g o a l s . T h e A M D A t h l o n p ro c e s s o r
microarchitecture is a decoupled decode/execution design
approach. In other words, the decoders essentially operate
independent of the execution units, and the execution core uses
a small number of instructions and simplified circuit design for
fast single-cycle execution and fast operating frequencies.
The term design implementation refers to the actual logic and
circuit designs from which the processor is created according to
the microarchitecture specifications.
Introduction
129
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
AMD Athlon™ Processor Microarchitecture
The innovative AMD Athlon processor microarchitecture
approach implements the x86 instruction set by processing
simpler operations (OPs) instead of complex x86 instructions.
These OPs are specially designed to include direct support for
the x86 instructions while observing the high-performance
principles of fixed-length encoding, regularized instruction
fields, and a large register set. Instead of executing complex
x86 instructions, which have lengths from 1 to 15 bytes, the
AMD Athlon processor executes the simpler fixed-length OPs,
while maintaining the instruction coding efficiencies found in
x86 programs. The enhanced microarchitecture used in the
AMD Athlon processor enables higher processor core
performance and promotes straightforward extendibility for
future designs.
Superscalar Processor
The AMD Athlon processor is an aggressive, out-of-order,
three-way superscalar x86 processor. It can fetch, decode, and
issue up to three x86 instructions per cycle with a centralized
instruction control unit (ICU) and two independent instruction
schedulers — an integer scheduler and a floating-point
scheduler. These two schedulers can simultaneously issue up to
nine OPs to the three general-purpose integer execution units
(IEUs), three address-generation units (AGUs), and three
float ing-point /3D Now! ™/ MMX ™ execution units. The
AMD Athlon moves integer instructions down the integer
execution pipeline, which consists of the integer scheduler and
the IEUs, as shown in Figure 1 on page 131. Floating-point
instructions are handled by the floating-point execution
pipeline, which consists of the floating-point scheduler and the
x87/3DNow!/MMX execution units.
130
AMD Athlon™ Processor Microarchitecture
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
2-Way, 64-Kbyte Instruction Cache
24-Entry L1 TLB/256-Entry L2 TLB
Fetch/Decode
Control
Predecode
Cache
Branch
Prediction Table
3-Way x86 Instruction Decoders
Instruction Control Unit (72-Entry)
FPU Stack Map / Rename
Integer Scheduler (18-Entry)
FPU Scheduler (36-Entry)
FPU Register File (88-Entry)
Bus
Interface
Unit
IEU0 AGU0
IEU1 AGU1
IEU2 AGU2
FADD
MMX™
3DNow!™
FMUL
MMX
3DNow!
FSTORE
L2 Cache
Controller
Load / Store Queue Unit
2-Way, 64-Kbyte Data Cache
32-Entry L1 TLB/256-Entry L2 TLB
System Interface
L2 SRAMs
Figure 1. AMD Athlon™ Processor Block Diagram
Instruction Cache
The out-of-order execute engine of the AMD Athlon processor
contains a very large 64-Kbyte L1 instruction cache. The L1
instruction cache is organiz ed as a 64-Kbyte, two-way,
set-associative array. Each line in the instruction array is 64
bytes long. Functions associated with the L1 instruction cache
are instruction loads, instruction prefetching, instruction
predecoding, and branch prediction. Requests that miss in the
L1 instruction cache are fetched from the backside L2 cache or,
subsequently, from the local memory using the bus interface
unit (BIU).
The instruction cache generates fetches on the naturally
aligned 64 bytes containing the instructions and the next
sequential line of 64 bytes (a prefetch). The principal of
program spatial locality makes data prefetching very effective
and avoids or reduces execution stalls due to the amount of
t i m e wa s t e d re a d i n g t h e n e c e s s a ry d a t a . C a ch e l i n e
AMD Athlon™ Processor Microarchitecture
131
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
re p l a c e m e n t i s b a s e d o n a l e a s t -re c e n t ly u s e d ( L RU )
replacement algorithm.
The L1 instruction cache has an associated two-level translation
look-aside buffer (TLB) structure. The first-level TLB is fully
associative and contains 24 entries (16 that map 4-Kbyte pages
and eight that map 2-Mbyte or 4-Mbyte pages). The second-level
TLB is four-way set associative and contains 256 entries, which
can map 4-Kbyte pages.
Predecode
Predecoding begins as the L1 instruction cache is filled.
Predecode information is generated and stored alongside the
instruction cache. This information is used to help efficiently
identify the boundari es between var iable len gth x86
instructions, to distinguish DirectPath from VectorPath
early-decode instructions, and to locate the opcode byte in each
instruction. In addition, the predecode logic detects code
branches such as CALLs, RETURNs and short unconditional
JMPs. When a branch is detected, predecoding begins at the
target of the branch.
Branch Prediction
The fetch logic accesses the branch prediction table in parallel
with the instruction cache and uses the information stored in
the branch prediction table to predict the direction of branch
instructions.
The AMD Athlon processor employs combinations of a branch
target address buffer (BTB), a global history bimodal counter
(GHBC) table, and a return address stack (RAS) hardware in
order to predict and accelerate branches. Predicted-taken
branches incur only a single-cycle delay to redirect the
instruction fetcher to the target instruction. In the event of a
mispredict, the minimum penalty is ten cycles.
The BTB is a 2048-entry table that caches in each entry the
predicted target address of a branch.
In addition, the AMD Athlon processor implements a 12-entry
return address stack to predict return addresses from a near or
far call. As CALLs are fetched, the next EIP is pushed onto the
132
AMD Athlon™ Processor Microarchitecture
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
return stack. Subsequent RETs pop a predicted return address
off the top of the stack.
Early Decoding
T h e D i re c t Pa t h a n d Ve c t o r Pa t h d e c o d e r s p e r f o r m
early-decoding of instructions into MacroOPs. A MacroOP is a
fixed length instruction which contains one or more OPs. The
output s of the early decoders keep all (D irectPat h o r
VectorPath) instructions in program order. Early decoding
produces three MacroOPs per cycle from either path. The
outputs of both decoders are multiplexed together and passed
to the next stage in the pipeline, the instruction control unit.
When the target 16-byte instruction window is obtained from
the instruction cache, the predecode data is examined to
determine which t ype of basic decode should occ ur —
DirectPath or VectorPath.
DirectPath Decoder
DirectPath instructions can be decoded directly into a
MacroOP, and subsequently into one or two OPs in the final
issue stage. A DirectPath instruction is limited to those x86
instructions that can be further decoded into one or two OPs.
The length of the x86 instruction does not determine DirectPath
instructions. A maximum of three DirectPath x86 instructions
can occupy a given aligned 8-byte block. 16-bytes are fetched at
a time. Therefore, up to six DirectPath x86 instructions can be
passed into the DirectPath decode pipeline.
VectorPath Decoder
Uncommon x86 instructions requiring two or more MacroOPs
proceed down the VectorPath pipeline. The sequence of
MacroOPs is produced by an on-chip ROM known as the MROM.
The VectorPath decoder can produce up to three MacroOPs per
cycle. Decoding a VectorPath instruction may prevent the
simultaneous decode of a DirectPath instruction.
AMD Athlon™ Processor Microarchitecture
133
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Instruction Control Unit
The instruction control unit (ICU) is the control center for the
AMD Athlon processor. The ICU controls the following
resources—the centralized in-flight reorder buffer, the integer
scheduler, and the floating-point scheduler. In turn, the ICU is
responsible for the following functions — MacroOP dispatch,
MacroOP retirement, register and flag dependency resolution
and renaming, execution resource management, interrupts,
exceptions, and branch mispredictions.
The ICU takes the three MacroOPs per cycle from the early
decoders and places them in a centralized, fixed-issue reorder
buffer. This buffer is organized into 24 lines of three MacroOPs
each. The reorder buffer allows the ICU to track and monitor up
to 72 in-flight MacroOPs (whether integer or floating-point) for
maximum instruction throughput. The ICU can simultaneously
dispatch multiple MacroOPs from the reorder buffer to both the
integer and floating-point schedulers for final decode, issue,
and execution as OPs. In addition, the ICU handles exceptions
and manages the retirement of MacroOPs.
Data Cache
Th e L 1 d a t a c a ch e c o n t a i n s t w o 6 4 -b i t p o r t s . I t i s a
w rit e -a ll oc a t e a n d wr it eback c ache th at u se s a n LRU
replacement policy. The data cache and instruction cache are
both two-way set-associative and 64-Kbytes in size. It is divided
into 8 banks where each bank is 8 bytes wide. In addition, this
cache supports the MOESI (Modified, Owner, Exclusive,
Shared, and Invalid) cache coherency protocol and data parity.
The L1 data cache has an associated two-level TLB structure.
The first-level TLB is fully associative and contains 32 entries
(24 that map 4-Kbyte pages and eight that map 2-Mbyte or
4 -M by t e p a g e s) . The s e c o n d -l eve l T L B i s fo u r -way se t
associative and contains 256 entries, which can map 4-Kbyte
pages.
134
AMD Athlon™ Processor Microarchitecture
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Integer Scheduler
The integer scheduler is based on a three-wide queuing system
(also known as a reservation station) that feeds three integer
execution positions or pipes. The reservation stations are six
entries deep, for a total queuing system of 18 integer
MacroOPs.Each reservation station divides the MacroOPs into
integer and address generation OPs, as required.
Integer Execution Unit
The integer execution pipeline consists of three identical
pipes — 0, 1, and 2. Each integer pipe consists of an integer
execution unit (IEU) and an address generation unit (AGU).
The integer execution pipeline is organized to match the three
MacroOP dispatch pipes in the ICU as shown in Figure 2 on
page 135. MacroOPs are broken down into OPs in the
schedulers. OPs issue when their operands are available either
from the register file or result buses.
OPs are executed when their operands are available. OPs from
a single MacroOP can execute out-of-order. In addition, a
particular integer pipe can be executing two OPs from different
MacroOPs (one in the IEU and one in the AGU) at the same
time.
In s t r u c t io n C o n t r o l U n it a n d R e g is t e r F ile s
M a c ro O P s
M a c ro O P s
In t e g e r S c h e d u le r
(1 8 -e n try )
IE
IEU
U 00
A
AG
GU
U 00
IE
IEU
U 11
P ip e lin e
S tta
age
A
AG
GU
U 11
7
IE
IEU
U 22
A
AG
GU
U 22
8
In
Inte
teggeerr M
M uult
ltip
iply
ly ((IM
IM U
U LL))
Figure 2. Integer Execution Pipeline
AMD Athlon™ Processor Microarchitecture
135
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Each of the three IEUs are general purpose in that each
performs logic functions, arithmetic functions, conditional
functions, divide step functions, status flag multiplexing, and
branch resolutions. The AGUs calculate the logical addresses
for loads, stores, and LEAs. A load and store unit reads and
writes data to and from the L1 data cache. The integer
scheduler sends a completion status to the ICU when the
outstanding OPs for a given MacroOP are executed.
All integer operations can be handled within any of the three
IEUs with the exception of multiplies. Multiplies are handled
by a pipelined multiplier that is attached to the pipeline at pipe
0. See Figure 2 on page 135. Multiplies always issue to integer
pipe 0, and the issue logic creates results bus bubbles for the
multiplier in integer pipes 0 and 1 by preventing non-multiply
OPs from issuing at the appropriate time.
Floating-Point Scheduler
Th e A M D A t h l o n p ro c e s s o r f l o a t i n g -p o i n t l o g i c i s a
high-performance, fully-pipelined, superscalar, out-of-order
execution unit. It is capable of accepting three MacroOPs of any
mixture of x87 floating-point, 3DNow! or MMX operations per
cycle.
The floating-point scheduler handles register renaming and has
a dedicated 36-entry scheduler buffer organized as 12 lines of
three MacroOPs each. It also performs OP issue, and
o u t -o f -o rd e r ex e c u t i o n . Th e f l o a t i n g -p o i n t s ch e d u l e r
communicates with the ICU to retire a MacroOP, to manage
comparison results from the FCOMI instruction, and to back
out results from a branch misprediction.
136
AMD Athlon™ Processor Microarchitecture
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Floating-Point Execution Unit
The floating-point execution unit (FPU) is implemented as a
coprocessor that has its own out-of-order control in addition to
the data path. The FPU handles all register operations for x87
instructions, all 3DNow! operations, and all MMX operations.
The FPU consists of a stack renaming unit, a register renaming
unit, a scheduler, a register file, and three parallel execution
units. Figure 3 shows a block diagram of the dataflow through
the FPU.
Pipeline
Stage
Instruction
Instruction Control
Control Unit
Unit
Stack
Stack Map
Map
7
Register
Register Rename
Rename
8
Scheduler
Scheduler (36-entry)
(36-entry)
9
10
FPU
FPU Register
Register File
File (88-entry)
(88-entry)
11
FADD
FADD
•• MMX™
MMX™ ALU
ALU
•• 3DNow!™
3DNow!™
FMUL
FMUL
•• MMX
MMX ALU
ALU
•• MMX
MMX Mul
Mul
•• 3DNow!
3DNow!
FSTORE
FSTORE
12
to
15
Figure 3. Floating-Point Unit Block Diagram
As shown in Figure 3 on page 137, the floating-point logic uses
three separate execution positions or pipes for superscalar x87,
3DNow! and MMX operations. The first of the three pipes is
generally known as the adder pipe (FADD), and it contains
3DNow! add, MMX ALU/shifter, and floating-point add
execution units. The second pipe is known as the multiplier
(FMUL). It contains a 3DNow!/MMX multiplier/reciprocal unit,
an MMX ALU and a floating-point multiplier/divider/square
root unit. The third pipe is known as the floating-point
load/store (FSTORE), which handles floating-point constant
loads (FLDZ, FLDPI, etc.), stores, FILDs, as well as many OP
primitives used in VectorPath sequences.
AMD Athlon™ Processor Microarchitecture
137
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Load-Store Unit (LSU)
The load-store unit (LSU) manages data load and store accesses
to the L1 data cache and, if required, to the backside L2 cache
or system memory. The 44-entry LSU provides a data interface
for both the integer scheduler and the floating-point scheduler.
It consists of two queues—a 12-entry queue for L1 cache load
and store accesses and a 32-entry queue for L2 cache or system
memory load and store accesses. The 12-entry queue can
request a maximum of two L1 cache loads and two L1 cache
(32-bits) stores per cycle. The 32-entry queue effectively holds
requests that missed in the L1 cache probe by the 12-entry
queue. Finally, the LSU ensures that the architectural load and
store ordering rules are preserved (a requirement for x86
architecture compatibility).
Operand
Buses
Result Buses
from
Core
Data Cache
2-way,
64Kbytes
LSU
44-Entry
Store Data
to BIU
Figure 4. Load/Store Unit
138
AMD Athlon™ Processor Microarchitecture
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
L2 Cache Controller
The AMD Athlon processor contains a very flexible onboard L2
controller. It uses an independent backside bus to access up to
8-Mbytes of industry-standard SRAMs. There are full on-chip
tags for a 512-Kbyte cache, while larger sizes use a partial tag
system. In addition, there is a two-level data TLB structure. The
first-level TLB is fully associative and contains 32 entries (24
that map 4-Kbyte pages and eight that map 2-Mbyte or 4-Mbyte
pages). The second-level TLB is four-way set associative and
contains 256 entries, which can map 4-Kbyte pages.
Write Combining
See Appendix C, “Implementation of Write Combining” on
page 155 for detailed information about write combining.
AMD Athlon™ System Bus
The AMD Athlon system bus is a high-speed bus that consists of
a pair of unidirectional 13-bit address and control channels and
a bidirectional 64-bit data bus. The AMD Athlon system bus
supports low-voltage swing, multiprocessing, clock forwarding,
and fast data transfers. The clock forwarding technique is used
to deliver data on both edges of the reference clock, therefore
doubling the transfer speed. A four-entry 64-byte write buffer is
integrated into the BIU. The write buffer improves bus
utilization by combining multiple writes into a single large
write cycle. By using the AMD Athlon system bus, the
AMD Athlon processor can transfer data on the 64-bit data bus
at 200 MHz, which yields an effective throughput of 1.6-Gbyte
per second.
AMD Athlon™ Processor Microarchitecture
139
AMD Athlon™ Processor x86 Code Optimization
140
22007E/0—November 1999
AMD Athlon™ Processor Microarchitecture
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Appendix B
Pipeline and Execution Unit
Resources Overview
The AMD Athlon™ processor contains two independent
execution pipelines — one for integer operations and one for
floating-point operations. The integer pipeline manages x86
integer operations and the floating-point pipeline manages all
x87, 3DNow!™ and MMX™ instructions. This appendix
describes the operation and functionality of these pipelines.
Fetch and Decode Pipeline Stages
Figure 5 on page 142 and Figure 6 on page 142 show the
AMD Athlon processor instruction fetch and decoding pipeline
stages. The pipeline consists of one cycle for instruction fetches
and four cycles of instruction alignment and decoding. The
three ports in stage 5 provide a maximum bandwidth of three
MacroOPs per cycle for dispatching to the instruction control
unit (ICU).
Fetch and Decode Pipeline Stages
141
AMD Athlon™ Processor x86 Code Optimization
E n try
P o in t
D ec o d e
V ec to rP ath
22007E/0—November 1999
D ec o d e
MROM
D ec o d e
I-C A C H E
D ec o d e
1 6 b yte s
D ire ctP a th
D ec o d e
D ec o d e
D ec o d e
D ec o d e
D ec o d e
D ec o d e
3
M a cro O p s
Q u ad w o rd
Q u eu e
FETCH
S C A N A L IG N 1 /
M ECTL
1
2
A L IG N 2/
MEROM
3
EDEC/
MEDEC
4
5
ID E C
6
Figure 5. Fetch/Scan/Align/Decode Pipeline Hardware
The most common x86 instructions flow through the DirectPath
pipeline stages and are decoded by hardware. The less common
instructions, which require microcode assistance, flow through
the VectorPath. Although the DirectPath decodes the common
x86 instructions, it also contains VectorPath instruction data,
which allows it to maintain dispatch order at the end of cycle 5.
1
2
3
5
4
6
D ire ctP a th
A L IG N 1
FETCH
A L IG N 2
ED EC
SCA N
ID E C
M ECTL
M EROM
M ESEQ
V e cto rP a th
Figure 6. Fetch/Scan/Align/Decode Pipeline Stages
142
Fetch and Decode Pipeline Stages
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Cycle 1–FETCH
The FETCH pipeline stage calculates the address of the next
x86 instruction window to fetch from the processor caches or
system memory.
Cycle 2–SCAN
SCAN determines the start and end pointers of instructions.
SCAN can send up to six aligned instructions (DirectPath and
VectorPath) to ALIGN1 and only one VectorPath instruction to
the microcode engine (MENG) per cycle.
Cycle 3 (DirectPath)–
ALIGN1
Because each 8-byte buffer (quadword queue) can contain up to
three instructions, ALIGN1 can buffer up to a maximum of nine
instructions, or 24 instruction bytes. ALIGN1 tries to send three
instructions from an 8-byte buffer to ALIGN2 per cycle.
Cycle 3 (VectorPath)–
MECTL
For VectorPath instructions, the microcode engine control
(MECTL) stage of the pipeline generates the microcode entry
points.
Cycle 4 (DirectPath)–
ALIGN2
ALIGN2 prioritizes prefix bytes, determines the opcode,
ModR/M, and SIB bytes for each instruction and sends the
accumulated prefix information to EDEC.
Cycle 4 (VectorPath)–
MEROM
In the microcode engine ROM (MEROM) pipeline stage, the
entry-point generated in the previous cycle, MECTL, is used to
index into the MROM to obtain the microcode lines necessary
to decode the instruction sent by SCAN.
Cycle 5 (DirectPath)–
EDEC
The early decode (EDEC) stage decodes information from the
DirectPath stage (ALIGN2) and VectorPath stage (MEROM)
into MacroOPs. In addition, EDEC determines register
pointers, flag updates, immediate values, displacements, and
other information. EDEC then selects either MacroOPs from
the DirectPath or MacroOPs from the VectorPath to send to the
instruction decoder (IDEC) stage.
Cycle 5 (VectorPath)–
MEDEC/MESEQ
The microcode engine decode (MEDEC) stage converts x86
instructions into MacroOPs. The microcode engine sequencer
(MESEQ) performs the sequence controls (redirects and
exceptions) for the MENG.
Cycle 6–
IDEC/Rename
At the instruction decoder (IDEC)/rename stage, integer and
floating-point MacroOPs diverge in the pipeline. Integer
MacroOPs are scheduled for execution in the next cycle.
Floating-point MacroOPs have their floating-point stack
Fetch and Decode Pipeline Stages
143
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
operands mapped to registers. Both integer and floating-point
MacroOPs are placed into the ICU.
Integer Pipeline Stages
The integer execution pipeline consists of four or more stages
for scheduling and execution and, if necessary, accessing data
in the processor caches or system memory. There are three
integer pipes associated with the three IEUs.
Pipeline
Stage
Instruction Control Unit and Register Files
M acroOPs
M acroOPs
Integer Scheduler
(18-entry)
IEU0
IEU0
AGU0
AGU0
IEU1
IEU1
AGU1
AGU1
7
IEU2
IEU2
AGU2
AGU2
8
Integer
Integer Multiply
Multiply (IMUL)
(IMUL)
Figure 7. Integer Execution Pipeline
Figure 7 and Figure 8 show the integer execution resources and
the pipeline stages, which are described in the following
sections.
7
SCH ED
8
EXEC
9
AD DGEN
10
11
DC ACC
RESP
Figure 8. Integer Pipeline Stages
144
Integer Pipeline Stages
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
Cycle 7–SCHED
In the scheduler (SCHED) pipeline stage, the scheduler buffers
can contain MacroOPs that are waiting for integer operands
from the ICU or the IEU result bus. When all operands are
received, SCHED schedules the MacroOP for execution and
issues the OPs to the next stage, EXEC.
Cycle 8–EXEC
In the execution (EXEC) pipeline stage, the OP and its
associated operands are processed by an integer pipe (either
the IEU or the AGU). If addresses must be calculated to access
data necessary to complete the operation, the OP proceeds to
the next stages, ADDGEN and DCACC.
Cycle 9–ADDGEN
In the address generation (ADDGEN) pipeline stage, the load
or store OP calculates a linear address, which is sent to the data
cache TLBs and caches.
Cycle 10–DCACC
In the data cache access (DCACC) pipeline stage, the address
generated in the previous pipeline stage is used to access the
data cache arrays and TLBs. Any OP waiting in the scheduler
for this data snarfs this data and proceeds to the EXEC stage
(assuming all other operands were available).
Cycle 11–RESP
In the response (RESP) pipeline stage, the data cache returns
hit/miss status and data for the request from DCACC.
Integer Pipeline Stages
145
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Floating-Point Pipeline Stages
The floating-point unit (FPU) is implemented as a coprocessor
that has its own out-of-order control in addition to the data
path. The FPU handles all register operations for x87
instructions, all 3DNow! operations, and all MMX operations.
The FPU consists of a stack renaming unit, a register renaming
unit, a scheduler, a register file, and three parallel execution
units. Figure 9 shows a block diagram of the dataflow through
the FPU.
Pipeline
Stage
Instruction
Instruction Control
Control Unit
Unit
Stack
Stack Map
Map
7
Register
Register Rename
Rename
8
Scheduler
Scheduler (36-entry)
(36-entry)
9
10
FPU
FPU Register
Register File
File (88-entry)
(88-entry)
11
FMUL
FMUL
FADD
FADD
MMX ALU
ALU
•• MMX™
MMX™ ALU
ALU •• MMX
•
MMX
• MMX Mul
Mul
•• 3DNow!™
3DNow!™
•• 3DNow!
3DNow!
FSTORE
FSTORE
12
to
15
Figure 9. Floating-Point Unit Block Diagram
The floating-point pipeline stages 7–15 are shown in Figure 10
and described in the following sections. Note that the
floating-point pipe and integer pipe separates at cycle 7.
7
STKREN
8
REGREN
9
SCHEDW
10
11
12
15
SCHED
FREG
FEXE1
FEXE4
Figure 10. Floating-Point Pipeline Stages
146
Floating-Point Pipeline Stages
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Cycle 7–STKREN
The stack rename (STKREN) pipeline stage in cycle 7 receives
up to three MacroOPs from IDEC and maps stack-relative
register tags to virtual register tags.
Cycle 8–REGREN
The register renaming (REGREN) pipeline stage in cycle 8 is
responsible for register renaming. In this stage, virtual register
tags are mapped into physical register tags. Likewise, each
destination is assigned a new physical register. The MacroOPs
are then sent to the 36-entry FPU scheduler.
Cycle 9–SCHEDW
The scheduler write (SCHEDW) pipeline stage in cycle 9 can
receive up to three MacroOPs per cycle.
Cycle 10–SCHED
The schedule (SCHED) pipeline stage in cycle 10 schedules up
to three MacroOPs per cycle from the 36-entry FPU scheduler
to the FREG pipeline stage to read register operands.
MacroOPs are sent when their operands and/or tags are
obtained.
Cycle 11–FREG
The register file read (FREG) pipeline stage reads the
floating-point register file for any register source operands of
MacroOPs. The register file read is done before the MacroOPs
are sent to the floating-point execution pipelines.
Cycle 12–15–
Floating-Point
Execution (FEXEC1–4)
The FPU has three logical pipes—FADD, FMUL, and FSTORE.
Each pipe may have several associated execution units. MMX
execution is in both the FADD and FMUL pipes, with the
exception of MMX instructions involving multiplies, which are
limited to the FMUL pipe. The FMUL pipe has special support
for long latency operations.
DirectPath/VectorPath operations are dispatched to the FPU
during cycle 6, but are not acted upon until they receive
validation from the ICU in cycle 7.
Floating-Point Pipeline Stages
147
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Execution Unit Resources
Terminology
The execution units operate with two types of register values—
operands and results. There are three operand types and two
result types, which are described in this section.
Operands
The three types of operands are as follows:
■
■
■
Results
The two types of results are as follows:
■
■
Examples
Address register operands—Used for address calculations of
load and store instructions
Data register operands—Used for register instructions
Store data register operands—Used for memory stores
Data register results—Produced by load or register
instructions
Address register results—Produced by LEA or PUSH
instructions
The following examples illustrate the operand and result
definitions:
ADD
EAX, EBX
The ADD instruction has two data register operands (EAX
and EBX) and one data register result (EAX).
MOV
EBX, [ESP+4*ECX+8]
;Load
The Load instruction has two address register operands
(ESP and ECX as base and index registers, respectively)
and a data register result (EBX).
MOV
[ESP+4*ECX+8], EAX
;Store
The Store instruction has a data register operand (EAX)
and two address register operands (ESP and ECX as base
and index registers, respectively).
LEA
ESI, [ESP+4*ECX+8]
The LEA instruction has address register operands (ESP
and ECX as base and index registers, respectively), and an
address register result (ESI).
148
Execution Unit Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Integer Pipeline Operations
Table 2 shows the category or type of operations handled by the
integer pipeline. Table 3 shows examples of the decode type.
Table 2.
Integer Pipeline Operation Types
Category
Execution Unit
Integer Memory Load or Store Operations
L/S
Address Generation Operations
AGU
Integer Execution Unit Operations
IEU
Integer Multiply Operations
Table 3.
IMUL
Integer Decode Types
x86 Instruction
Decode Type
OPs
MOV
CX, [SP+4]
DirectPath
AGU, L/S
ADD
AX, BX
DirectPath
IEU
CMP
CX, [AX]
VectorPath
AGU, L/S, IEU
JZ
Addr
DirectPath
IEU
As shown in Table 2, the MOV instruction early decodes in the
DirectPath decoder and requires two OPs—an address
generation operation for the indirect address and a data load
from memory into a register. The ADD instruction early
decodes in the DirectPath decoder and requires a single OP
that can be executed in one of the three IEUs. The CMP
instruction early decodes in the VectorPath and requires three
OPs—an address generation operation for the indirect address,
a data load from memory, and a compare to CX using an IEU.
The final JZ instruction is a simple operation that early decodes
in the DirectPath decoder and requires a single OP. Not shown
is a load-op-store instruction, which translates into only one
MacroOP (one AGU OP, one IEU OP, and one L/S OP).
Execution Unit Resources
149
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Floating-Point Pipeline Operations
Table 4 shows the category or type of operations handled by the
floating-point execution units. Table 5 shows examples of the
decode types.
Table 4.
Floating-Point Pipeline Operation Types
Category
Execution Unit
FPU/3DNow!/MMX Load/store or
Miscellaneous Operations
FSTORE
FPU/3DNow!/MMX Multiply Operation
FMUL
FPU/3DNow!/MMX Arithmetic Operation
FADD
Table 5.
Floating-Point Decode Types
x86 Instruction
Decode Type
OPs
FADD ST, ST(i)
DirectPath
FADD
FSIN
VectorPath
various
PFACC
DirectPath
FADD
PFRSQRT
DirectPath
FMUL
As shown in Table 4, the FADD register-to-register instruction
generates a single MacroOP targeted for the floating-point
scheduler. FSIN is considered a VectorPath instruction because
it is a complex instruction with long execution times, as
compared to the more common floating-point instructions. The
MMX PFACC instruction is DirectPath decodeable and
generates a single MacroOP targeted for the arithmetic
operation execution pipeline in the floating-point logic. Just
like PFACC, a single MacroOP is early decoded for the 3DNow!
PFRSQRT instruction, but it is targeted for the multiply
operation execution pipeline.
150
Execution Unit Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Load/Store Pipeline Operations
The AMD Athlon processor decodes any instruction that
references memory into primitive load/store operations. For
example, consider the following code sample:
MOV
PUSH
POP
ADD
FSTP
MOVQ
AX, [EBX]
EAX
EAX
[EAX], EBX
[EAX]
[EAX], MM0
;1
;1
;1
;1
;1
;1
load MacroOP
store MacroOP
load MacroOP
load/store and 1 IEU MacroOPs
store MacroOP
store MacroOP
As shown in Table 6, the load/store unit (LSU) consists of a
three-stage data cache lookup.
Table 6.
Load/Store Unit Stages
Stage 1 (Cycle 8)
Stage 2 (Cycle 9)
Stage 3 (Cycle 10)
Address Calculation / LS1
Scan
Transport Address to Data
Cache
Data Cache Access / LS2
Data Forward
Loads and stores are first dispatched in order into a 12-entry
deep reservation queue called LS1. LS1 holds loads and stores
that are waiting to enter the cache subsystem. Loads and stores
are allocated into LS1 entries at dispatch time in program
order, and are required by LS1 to probe the data cache in
program order. The AGUs can calculate addresses out of
program order, therefore, LS1 acts as an address reorder buffer.
When a load or store is scanned out of the LS1 queue (Stage 1),
it is deallocated from the LS1 queue and inserted into the data
cache probe pipeline (Stage 2 and Stage 3). Up to two memory
operations can be scheduled (scanned out of LS1) to access the
data cache per cycle. The LSU can handle the following:
■
■
■
Execution Unit Resources
Two 64-bit loads per cycle or
One 64-bit load and one 64-bit store per cycle or
Two 32-bit stores per cycle
151
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Code Sample Analysis
The samples in Table 7 on page 153 and Table 8 on page 154
show the execution behavior of several series of instructions as
a function of decode constraints, dependencies, and execution
resource constraints.
The sample tables show the x86 instructions, the decode pipe in
the integer execution pipeline, the decode type, the clock
counts, and a description of the events occurring within the
processor. The decode pipe gives the specific IEU used (see
Figure 7 on page 144). The decode type specifies either
VectorPath (VP) or DirectPath (DP).
The following nomenclature is used to describe the current
location of a particular operation:
■
■
■
D—Dispatch stage (Allocate in ICU, reservation stations,
load-store (LS1) queue)
I—Issue stage (Schedule operation for AGU or FU
execution)
E—Integer Execution Unit (IEU number corresponds to
decode pipe)
&—Address Generation Unit (AGU number corresponds to
decode pipe)
■
M—Multiplier Execution
■
S—Load/Store pipe stage 1 (Schedule operation for
load/store pipe)
■
A—Load/Store pipe stage 2 (1st stage of data cache/LS2
buffer access)
■
$—Load/Store pipe stage 3 (2nd stage of data cache/LS2
buffer access)
Note: Instructions execute more efficiently (that is, without
delays) when scheduled apart by suitable distances based on
dependencies. In general, the samples in this section show
poorly scheduled code in order to illustrate the resultant
effects.
■
152
Execution Unit Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 7.
Sample 1 – Integer Register Operations
Instruction
Number
Instruction
Decode Decode
Pipe
Type
Clocks
1
2
3
4
5
6
D
I
M
M
M
M
1
IMUL EAX, ECX
0
VP
2
INC
ESI
0
DP
D
I
E
3
MOV
EDI, 0x07F4
1
DP
D
I
E
4
ADD
EDI, EBX
2
DP
D
5
SHL
EAX, 8
0
DP
D
6
OR
EAX, 0x0F
1
DP
D
7
INC
EBX
2
DP
D
8
ADD
ESI, EDX
0
DP
I
7
E
I
E
I
D
8
I
E
I
E
E
Comments for Each Instruction Number
1. The IMUL is a VectorPath instruction. It cannot be decode or paired with other operations and, therefore,
dispatches alone in pipe 0. The multiply latency is four cycles.
2. The simple INC operation is paired with instructions 3 and 4. The INC executes in IEU0 in cycle 4.
3. The MOV executes in IEU1 in cycle 4.
4. The ADD operation depends on instruction 3. It executes in IEU2 in cycle 5.
5. The SHL operation depends on the multiply result (instruction 1). The MacroOP waits in a reservation
station and is eventually scheduled to execute in cycle 7 after the multiply result is available.
6. This operation executes in cycle 8 in IEU1.
7. This simple operation has a resource contention for execution in IEU2 in cycle 5. Therefore, the operation
does not execute until cycle 6.
8. The ADD operation executes immediately in IEU0 after dispatching.
Execution Unit Resources
153
AMD Athlon™ Processor x86 Code Optimization
Table 8.
22007E/0—November 1999
Sample 2 – Integer Register and Memory Load Operations
Instruc
Num
Instruction
Clocks
Decode
Pipe
Decode
Type
1
2
3
4
5
1
DEC
EDX
0
DP
D
I
E
2
MOV
EDI, [ECX]
1
DP
D
I
&/S
A
$
3
SUB
EAX, [EDX+20]
2
DP
D
I
&/S
A
$/I
4
SAR
EAX, 5
0
DP
D
5
ADD
ECX, [EDI+4]
1
DP
D
6
AND
EBX, 0x1F
2
DP
D
7
MOV
ESI, [0x0F100]
0
8
OR
ECX, [ESI+EAX*4+8]
1
I
E
DP
D
I
DP
D
6
7
8
9
10 11 12
E
I
E
I
&/S
A
$
&
S
A
$
I
&/S A
$
E
Comments for Each Instruction Number
1. The ALU operation executes in IEU0.
2. The load operation generates the address in AGU1 and is simultaneously scheduled for the load/store pipe in cycle 3. In
cycles 4 and 5, the load completes the data cache access.
3. The load-execute instruction accesses the data cache in tandem with instruction 2. After the load portion completes, the
subtraction is executed in cycle 6 in IEU2.
4. The shift operation executes in IEU0 (cycle 7) after instruction 3 completes.
5. This operation is stalled on its address calculation waiting for instruction 2 to update EDI. The address is calculated in
cycle 6. In cycle 7/8, the cache access completes.
6. This simple operation executes quickly in IEU2
7. The address for the load is calculated in cycle 5 in AGU0. However, the load is not scheduled to access the data cache
until cycle 6. The load is blocked for scheduling to access the data cache for one cycle by instruction 5. In cycles 7 and 8,
instruction 7 accesses the data cache concurrently with instruction 5.
8. The load execute instruction accesses the data cache in cycles 10/11 and executes the ‘OR’ operation in IEU1 in cycle 12.
154
Execution Unit Resources
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
Appendix C
Implementation of
Write Combining
Introduction
This appendix describes the memory write-combining feature
as implemented in the AMD Athlon™ processor family. The
AMD Athlon processor supports the memory type and range
register (MTRR) and the page attribute table (PAT) extensions,
which allow software to define ranges of memory as either
writeback (WB), write-protected (WP), writethrough (WT),
uncacheable (UC), or write-combining (WC).
Defining the memory type for a range of memory as WC or WT
allows the processor to conditionally combine data from
multiple write cycles that are addressed within this range into a
merge buffer. Merging multiple write cycles into a single write
cycle reduces processor bus utilization and processor stalls,
thereby increasing the overall system performance.
To understand the information presented in this appendix, the
reader should possess a knowledge of K86™ processors, the x86
architecture, and programming requirements.
Introduction
155
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Write-Combining Definitions and Abbreviations
This appendix uses the following definitions and abbreviations:
■
■
■
■
■
■
■
■
■
■
■
UC—Uncacheable memory type
WC—Write-combining memory type
WT—Writethrough memory type
WP—Write-protected memory type
WB—Writeback memory type
One Byte—8 bits
One Word—16 bits
Longword—32 bits (same as a x86 doubleword)
Quadword—64 bits or 2 longwords
Octaword—128 bits or 2 quadwords
Cache Block—64 bytes or 4 octawords or 8 quadwords
What is Write Combining?
Write combining is the merging of multiple memory write
cycles that target locations within the address range of a write
buffer. The AMD Athlon processor combines multiple
memory-write cycles to a 64-byte buffer whenever the memory
address is within a WC or WT memory type region. The
processor continues to combine writes to this buffer without
writing the data to the system, as long as certain rules apply
(see Table 9 on page 158 for more information).
Programming Details
The steps required for programming write combining on the
AMD Athlon processor are as follows:
1. Verify the presence of an AMD Athlon processor by using
the CPUID instruction to check for the instruction family
code and vendor identification of the processor. Standard
function 0 on AMD processors returns a vendor
identification string of “AuthenticAMD” in registers EBX,
EDX, and ECX. Standard function 1 returns the processor
156
Write-Combining Definitions and Abbreviations
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
signature in register EAX, where EAX[11–8] contains the
instruction family code. For the AMD Athlon processor, the
instruction family code is six.
2. In addition, the presence of the MTRRs is indicated by bit
12 and the presence of the PAT extension is indicated by bit
16 of the extended features bits returned in the EDX
register by CPUID function 8000_0001h. See the AMD
Processor Recognition Application Note, order# 20734 for
more details on the CPUID instruction.
3. Write combining is controlled by the MTRRs and PAT.
Write combining should be enabled for the appropriate
memory ranges. The AMD Athlon processor MTRRs and
PAT are compatible with the Pentium® II.
Write-Combining Operations
In order to improve system performance, the AMD Athlon
processor aggressively combines multiple memory-write cycles
of any data size that address locations within a 64-byte write
buffer that is aligned to a cache-line boundary. The data sizes
can be bytes, words, longwords, or quadwords.
WC memory type writes can be combined in any order up to a
full 64-byte sized write buffer.
WT memory type writes can only be combined up to a fully
aligned quadword in the 64-byte buffer, and must be combined
contiguously in ascending order. Combining may be opened at
any byte boundary in a quadword, but is closed by a write that is
either not “contiguous and ascending” or fills byte 7.
All other memory types for stores that go through the write
buffer (UC and WP) cannot be combined.
Combining is able to continue until interrupted by one of the
conditions listed in Table 9 on page 158. When combining is
interrupted, one or more bus commands are issued to the
system for that write buffer, as described by Table 10 on
page 159.
Write-Combining Operations
157
AMD Athlon™ Processor x86 Code Optimization
Table 9.
22007E/0—November 1999
Write Combining Completion Events
Event
Comment
The first non-WB write to a different cache block address
closes combining for previous writes. WB writes do not affect
Non-WB write outside of
write combining. Only one line-sized buffer can be open for
current buffer
write combining at a time. Once a buffer is closed for write
combining, it cannot be reopened for write combining.
158
I/O Read or Write
Any IN/INS or OUT/OUTS instruction closes combining. The
implied memory type for all IN/OUT instructions is UC,
which cannot be combined.
Serializing instructions
Any serializing instruction closes combining. These
instructions include: MOVCRx, MOVDRx, WRMSR, INVD,
INVLPG, WBINVD, LGDT, LLDT, LIDT, LTR, CPUID, IRET, RSM,
INIT, HALT.
Flushing instructions
Any flush instruction causes the WC to complete.
Locks
Any instruction or processor operation that requires a cache
or bus lock closes write combining before starting the lock.
Writes within a lock can be combined.
Uncacheable Read
A UC read closes write combining. A WC read closes
combining only if a cache block address match occurs
between the WC read and a write in the write buffer.
Different memory type
Any WT write while write-combining for WC memory or any
WC write while write combining for WT memory closes write
combining.
Buffer full
Write combining is closed if all 64 bytes of the write buffer
are valid.
WT time-out
If 16 processor clocks have passed since the most recent
write for WT write combining, write combining is closed.
There is no time-out for WC write combining.
WT write fills byte 7
Write combining is closed if a write fills the most significant
byte of a quadword, which includes writes that are
misaligned across a quadword boundary. In the misaligned
case, combining is closed by the LS part of the misaligned
write and combining is opened by the MS part of the
misaligned store.
WT Nonsequential
If a subsequent WT write is not in ascending sequential
order, the write combining completes. WC writes have no
addressing constraints within the 64-byte line being
combined.
TLB AD bit set
Write combining is closed whenever a TLB reload sets the
accessed (A) or dirty (D) bits of a Pde or Pte.
Write-Combining Operations
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Sending Write-Buffer Data to the System
Once write combining is closed for a 64-byte write buffer, the
contents of the write buffer are eligible to be sent to the system
as one or more AMD Athlon system bus commands. Table 10
lists the rules for determining what system commands are
issued for a write buffer, as a function of the alignment of the
valid buffer data.
Table 10. AMD Athlon™ System Bus Commands Generation Rules
1.
If all eight quadwords are either full (8 bytes valid) or empty (0 bytes valid), a
Write-Quadword system command is issued, with an 8-byte mask representing
which of the eight quadwords are valid. If this case is true, do not proceed to the
next rule.
2. If all longwords are either full (4 bytes valid) or empty (0 bytes valid), a
Write-Longword system command is issued for each 32-byte buffer half that
contains at least one valid longword. The mask for each Write-Longword system
command indicates which longwords are valid in that 32-byte write buffer half. If
this case is true, do not proceed to the next rule.
3. Sequence through all eight quadwords of the write buffer, from quadword 0 to
quadword 7. Skip over a quadword if no bytes are valid. Issue a Write-Quad system
command if all bytes are valid, asserting one mask bit. Issue a Write-Longword
system command if the quadword contains one aligned longword, asserting one
mask bit. Otherwise, issue a Write-Byte system command if there is at least one
valid byte, asserting a mask bit for each valid byte.
Write-Combining Operations
159
AMD Athlon™ Processor x86 Code Optimization
160
22007E/0—November 1999
Write-Combining Operations
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Appendix D
Performance-Monitoring
Counters
This chapter describes how to use the AMD Athlon™ processor
performance monitoring counters.
Overview
The AMD Athlon processor provides four 48-bit performance
counters, which allows four types of events to be monitored
simultaneously. These counters can either count events or
measure duration. When counting events, a counter is
incremented each time a specified event takes place or a
specified number of events takes place. When measuring
duration, a counter counts the number of processor clocks that
occur while a specified condition is true. The counters can
count events or measure durations that occur at any privilege
level. Table 11 on page 164 lists the events that can be counted
with the performance monitoring counters.
Performance Counter Usage
The performance monitoring counters are supported by eight
MSRs — PerfEvtSel[3:0] are the performance event select
MSRs, and PerfCtr[3:0] are the performance counter MSRs.
Overview
161
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
These registers can be read from and written to using the
RDMSR and WRMSR instructions, respectively.
The PerfEvtSel[3:0] registers are located at MSR locations
C001_0000h to C001_0003h. The PerfCtr[3:0] registers are
located at MSR locations C001_0004h to C0001_0007h and are
64-byte registers.
The PerfEvtSel[3:0] registers can be accessed using the
RDMSR/WRMSR instructions only when operating at privilege
level 0. The PerfCtr[3:0] MSRs can be read from any privilege
level using the RDPMC (read performance-monitoring
counters) instruction, if the PCE flag in CR4 is set.
PerfEvtSel[3:0] MSRs (MSR Addresses C001_0000h–C001_0003h)
The PerfEvtSel[3:0] MSRs, shown in Figure 11, control the
operation of the performance-monitoring counters, with one
register used to set up each counter. These MSRs specify the
events to be counted, how they should be counted, and the
privilege levels at which counting should take place. The
functions of the flags and fields within these MSRs are as are
described in the following sections.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
Counter Mask
I E
N N
V
I P E O U
N C
S S
T
R
Unit Mask
8
7
6
5
4
3
2
1
0
Event Mask
Reserved
Symbol
USR
OS
E
PC
INT
EN
INV
Description
User Mode
Operating System Mode
Edge Detect
Pin Control
APIC Interrupt Enable
Enable Counter
Invert Mask
Bit
16
17
18
19
20
22
23
Figure 11. PerfEvtSel[3:0] Registers
Event Select Field
(Bits 0—7)
162
These bits are used to select the event to be monitored. See
Table 11 on page 164 for a list of event masks and their 8-bit
codes.
Performance Counter Usage
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
Unit Mask Field (Bits
8—15)
These bits are used to further qualify the event selected in the
event select field. For example, for some cache events, the mask
is used as a MESI-protocol qualifier of cache states. See
Table 11 on page 164 for a list of unit masks and their 8-bit
codes.
USR (User Mode) Flag
(Bit 16)
Events are counted only when the processor is operating at
privilege levels 1, 2 or 3. This flag can be used in conjunction
with the OS flag.
OS (Operating System
Mode) Flag (Bit 17)
Events are counted only when the processor is operating at
privilege level 0. This flag can be used in conjunction with the
USR flag.
E (Edge Detect) Flag
(Bit 18)
When this flag is set, edge detection of events is enabled. The
processor counts the number of negated-to-asserted transitions
of any condition that can be expressed by the other fields. The
mechanism is limited in that it does not permit back-to-back
assertions to be distinguished. This mechanism allows software
to measure not only the fraction of time spent in a particular
state, but also the average length of time spent in such a state
(for example, the time spent waiting for an interrupt to be
serviced).
PC (Pin Control) Flag
(Bit 19)
When this flag is set, the processor toggles the PMi pins when
the counter overflows. When this flag is clear, the processor
toggles the PMi pins and increments the counter when
performance monitoring events occur. The toggling of a pin is
defined as assertion of the pin for one bus clock followed by
negation.
INT (APIC Interrupt
Enable) Flag (Bit 20)
When this flag is set, the processor generates an interrupt
through its local APIC on counter overflow.
EN (Enable Counter)
Flag (Bit 22)
This flag enables/disables the PerfEvtSeln MSR. When set,
performance counting is enabled for this counter. When clear,
this counter is disabled.
INV (Invert) Flag (Bit
23)
By inverting the Counter Mask Field, this flag inverts the result
of the counter comparison, allowing both greater than and less
than comparisons.
Counter Mask Field
(Bits 31–24)
For events which can have multiple occurrences within one
clock, this field is used to set a threshold. If the field is non-zero,
the counter increments each time the number of events is
Performance Counter Usage
163
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
greater than or equal to the counter mask. Otherwise if this
field is zero, then the counter increments by the total number of
events.
Table 11. Performance-Monitoring Counters
Event
Number
Source
Unit
Notes / Unit Mask (bits 15–8)
Event Description
1xxx_xxxxb = reserved
x1xx_xxxxb = HS
xx1x_xxxxb = GS
20h
LS
xxx1_xxxxb = FS
xxxx_1xxxb = DS
Segment register loads
xxxx_x1xxb = SS
xxxx_xx1xb = CS
xxxx_xxx1b = ES
21h
LS
Stores to active instruction stream
40h
DC
Data cache accesses
41h
DC
Data cache misses
xxx1_xxxxb = Modified (M)
xxxx_1xxxb = Owner (O)
42h
DC
xxxx_x1xxb = Exclusive (E)
Data cache refills
xxxx_xx1xb = Shared (S)
xxxx_xxx1b = Invalid (I)
xxx1_xxxxb = Modified (M)
xxxx_1xxxb = Owner (O)
43h
DC
xxxx_x1xxb = Exclusive (E)
Data cache refills from system
xxxx_xx1xb = Shared (S)
xxxx_xxx1b = Invalid (I)
xxx1_xxxxb = Modified (M)
xxxx_1xxxb = Owner (O)
44h
DC
xxxx_x1xxb = Exclusive (E)
Data cache writebacks
xxxx_xx1xb = Shared (S)
xxxx_xxx1b = Invalid (I)
45h
DC
L1 DTLB misses and L2 DTLB hits
46h
DC
L1 and L2 DTLB misses
47h
DC
Misaligned data references
64h
BU
DRAM system requests
164
Performance Counter Usage
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 11. Performance-Monitoring Counters (Continued)
Event
Number
Source
Unit
Notes / Unit Mask (bits 15–8)
Event Description
1xxx_xxxxb = reserved
x1xx_xxxxb = WB
xx1x_xxxxb = WP
65h
BU
xxx1_xxxxb = WT
System requests with the selected type
bits 11–10 = reserved
xxxx_xx1xb = WC
xxxx_xxx1b = UC
bits 15–11 = reserved
73h
BU
xxxx_x1xxb = L2 (L2 hit and no DC
hit)
Snoop hits
xxxx_xx1xb = Data cache
xxxx_xxx1b = Instruction cache
bits 15–10 = reserved
74h
BU
xxxx_xx1xb = L2 single bit error
Single-bit ECC errors detected/corrected
xxxx_xxx1b = System single bit error
bits 15–12 = reserved
xxxx_1xxxb = I invalidates D
75h
BU
xxxx_x1xxb = I invalidates I
Internal cache line invalidates
xxxx_xx1xb = D invalidates D
xxxx_xxx1b = D invalidates I
76h
Cycles processor is running (not in HLT
or STPCLK)
BU
1xxx_xxxxb = Data block write from
the L2 (TLB RMW)
x1xx_xxxxb = Data block write from
the DC
xx1x_xxxxb = Data block write from
the system
79h
BU
xxx1_xxxxb = Data block read data
store
L2 requests
xxxx_1xxxb = Data block read data
load
xxxx_x1xxb = Data block read
instruction
xxxx_xx1xb = Tag write
xxxx_xxx1b = Tag read
Performance Counter Usage
165
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 11. Performance-Monitoring Counters (Continued)
Event
Number
Source
Unit
7Ah
BU
Cycles that at least one fill request
waited to use the L2
80h
PC
Instruction cache fetches
81h
PC
Instruction cache misses
82h
PC
Instruction cache refills from L2
83h
PC
Instruction cache refills from system
84h
PC
L1 ITLB misses (and L2 ITLB hits)
85h
PC
(L1 and) L2 ITLB misses
86h
PC
Snoop resyncs
87h
PC
Instruction fetch stall cycles
88h
PC
Return stack hits
89h
PC
Return stack overflow
C0h
FR
Retired instructions (includes
exceptions, interrupts, resyncs)
C1h
FR
Retired Ops
C2h
FR
Retired branches (conditional,
unconditional, exceptions, interrupts)
C3h
FR
Retired branches mispredicted
C4h
FR
Retired taken branches
C5h
FR
Retired taken branches mispredicted
C6h
FR
Retired far control transfers
C8h
FR
Retired near returns
C9h
FR
Retired near returns mispredicted
CAh
FR
Retired indirect branches with target
mispredicted
CDh
FR
Interrupts masked cycles (IF=0)
CEh
FR
Interrupts masked while pending cycles
(INTR while IF=0)
CFh
FR
Number of taken hardware interrupts
D0h
FR
Instruction decoder empty
D1h
FR
Dispatch stalls (event masks D2h
through DAh below combined)
D2h
FR
Branch abort to retire
D3h
FR
Serialize
D4h
FR
Segment load stall
166
Notes / Unit Mask (bits 15–8)
Event Description
Performance Counter Usage
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 11. Performance-Monitoring Counters (Continued)
Event
Number
Source
Unit
D5h
FR
ICU full
D6h
FR
Reservation stations full
D7h
FR
FPU full
D8h
FR
LS full
D9h
FR
All quiet stall
DAh
FR
Far transfer or resync branch pending
DCh
FR
Breakpoint matches for DR0
DDh
FR
Breakpoint matches for DR1
DEh
FR
Breakpoint matches for DR2
DFh
FR
Breakpoint matches for DR3
Notes / Unit Mask (bits 15–8)
Event Description
PerfCtr[3:0] MSRs (MSR Addresses C001_0004h–C001_0007h)
The performance-counter MSRs contain the event or duration
counts for the selected events being counted. The RDPMC
instruction can be used by programs or procedures running at
any privilege level and in virtual-8086 mode to read these
counters. The PCE flag in control register CR4 (bit 8) allows the
use of this instruction to be restricted to only programs and
procedures running at privilege level 0.
The RDPMC instruction is not serializing or ordered with other
instructions. Therefore, it does not necessarily wait until all
previous instructions have been executed before reading the
counter. Similarly, subsequent instructions can begin execution
before the RDPMC instruction operation is performed.
Only the operating system, executing at privilege level 0, can
directly manipulate the performance counters, using the
RDMSR and WRMSR instructions. A secure operating system
would clear the PCE flag during system initialization, which
disables direct user access to the performance-monitoring
counters but provides a user-accessible programming interface
that emulates the RDPMC instruction.
The WRMSR instruction cannot arbitrarily write to the
performance-monitoring counter MSRs (PerfCtr[3:0]). Instead,
the value should be treated as 64-bit sign extended, which
Performance Counter Usage
167
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
allows writing both positive and negative values to the
performance counters. The performance counters may be
initialized using a 64-bit signed integer in the range -247 and
+247 . Negative values are useful for generating an interrupt
after a specific number of events.
Starting and Stopping the Performance-Monitoring Counters
The performance-monitoring counters are started by writing
valid setup information in one or more of the PerfEvtSel[3:0]
MSRs and setting the enable counters flag in the PerfEvtSel0
MSR. If the setup is valid, the counters begin counting
following the execution of a WRMSR instruction, which sets the
enable counter flag. The counters can be stopped by clearing
the enable counters flag or by clearing all the bits in the
PerfEvtSel[3:0] MSRs.
Event and Time-Stamp Monitoring Software
For applications to use the performance-monitoring counters
and time-stamp counter, the operating system needs to provide
an event-monitoring device driver. This driver should include
procedures for handling the following operations:
■
■
■
■
■
Feature checking
Initialize and start counters
Stop counters
Read the event counters
Reading of the time stamp counter
The event monitor feature determination procedure must
determine whether the current processor supports the
performance-monitoring counters and time-stamp counter. This
procedure compares the family and model of the processor
returned by the CPUID instruction with those of processors
known to support performance monitoring. In addition, the
procedure checks the MSR and TSC flags returned to register
EDX by the CPUID instruction to determine if the MSRs and
the RDTSC instruction are supported.
168
Event and Time-Stamp Monitoring Software
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
The initialization and start counters procedure sets the
PerfEvtSel0 and/or PerfEvtSel1 MSRs for the events to be
counted and the method used to count them and initializes the
counter MSRs (PerfCtr[3:0]) to starting counts. The stop
counters procedure stops the performance counters. (See
“Starting and Stopping the Performance-Monitoring Counters”
on page 168 for more information about starting and stopping
the counters.)
The re a d c o u n t e rs p ro c e d u re re a d s t h e va l u e s in t he
PerfCtr[3:0] MSRs, and a read time-stamp counter procedure
reads the time-stamp counter. These procedures can be used
instead of enabling the RDTSC and RDPMC instructions, which
allow application code to read the counters directly.
Monitoring Counter Overflow
The AMD Athlon processor provides the option of generating a
debug interrupt when a performance-monitoring counter
overflows. This mechanism is enabled by setting the interrupt
enable flag in one of the PerfEvtSel[3:0] MSRs. The primary
use of this option is for statistical performance sampling.
To use this option, the operating system should do the
following:
■
■
■
Provide an interrupt routine for handling the counter
overflow as an APIC interrupt
Provide an entry in the IDT that points to a stub exception
handler that returns without executing any instructions
Provide an event monitor driver that provides the actual
interrupt handler and modifies the reserved IDT entry to
point to its interrupt routine
When interrupted by a counter overflow, the interrupt handler
needs to perform the following actions:
■
■
Monitoring Counter Overflow
Save the instruction pointer (EIP register), code segment
selector, TSS segment selector, counter values and other
relevant information at the time of the interrupt
Reset the counter to its initial setting and return from the
interrupt
169
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
An event monitor application utility or another application
program can read the collected performance information of the
profiled application.
170
Monitoring Counter Overflow
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
Appendix E
Programming the MTRR and
PAT
Introduction
The AMD Athlon™ processor includes a set of memory type
and range registers (MTRRs) to control cacheability and access
to specified memory regions. The processor also includes the
Page Address Table for defining attributes of pages. This
chapter documents the use and capabilities of this feature.
The purpose of the MTRRs is to provide system software with
the ability to manage the memory mapping of the hardware.
Both the BIOS software and operating systems utilize this
capability. The AMD Athlon processor’s implementation is
compatible to the Pentium® II. Prior to the MTRR mechanism,
chipsets usually provided this capability.
Memory Type Range Register (MTRR) Mechanism
The memory type and range registers allow the processor to
determine cacheability of various memory locations prior to
bus access and to optimize access to the memory system. The
AMD Athlon processor implements the MTRR programming
model in a manner compatible with Pentium II.
Introduction
171
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
There are two types of address ranges: fixed and variable. (See
Figure 12.) For each address range, there is a memory type. For
each 4K, 16K or 64K segment within the first 1 Mbyte of
memory, there is one fixed address MTRR. The fixed address
ranges all exist in the first 1 Mbyte. There are eight variable
address ranges above 1 Mbytes. Each is programmed to a
specific memory starting address, size and alignment. If a
variable range overlaps the lower 1 MByte and the fixed
MTRRs are enabled, then the fixed-memory type dominates.
The address regions have the following priority with respect to
each other:
1. Fixed address ranges
2. Variable address ranges
3. Default memory type (UC at reset)
172
Memory Type Range Register (MTRR) Mechanism
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
FFFFFFFFh
SMM TSeg
0-8 Variable Ranges
(212 to 232)
64 Fixed Ranges
(4 Kbytes each)
16 Fixed Ranges
(16 Kbytes each)
8 Fixed Ranges
(64 Kbytes each)
256 Kbytes
256 Kbytes
100000h
C0000h
80000h
512 Kbytes
0
Figure 12. MTRR Mapping of Physical Memory
Memory Type Range Register (MTRR) Mechanism
173
AMD Athlon™ Processor x86 Code Optimization
Memory Types
22007E/0—November 1999
Five standard memory types are defined by the AMD Athlon
processor: writethrough (WT), writeback (WB), write-protect
(WP), write-combining (WC), and uncacheable (UC). These are
described in Table 12 on page 174.
Table 12. Memory Type Encodings
Type Number
Type Name
00h
UC—Uncacheable
01h
WC—Write-Combining
Uncacheable for reads or writes. Can be combined. Can be speculative for
reads. Writes can never be speculative.
04h
WT—Writethrough
Reads allocate on a miss, but only to the S-state. Writes do not allocate on
a miss and, for a hit, writes update the cached entry and main memory.
05h
WP—Write-Protect
WP is functionally the same as the WT memory type, except stores do not
actually modify cached data and do not cause an exception.
WB—Writeback
Reads will allocate on a miss, and will allocate to:
S state if returned with a ReadDataShared command.
M state if returned with a ReadDataDirty command.
Writes allocate to the M state, if the read allows the line to be marked E.
06h
MTRR Capability
Register Format
Type Description
Uncacheable for reads or writes. Cannot be combined. Must be
non-speculative for reads or writes.
The MTRR capability register is a read-only register that
defines the specific MTRR capability of the processor and is
defined as follows.
63
11 10 9
8
W
C
F
I
X
7
0
VCNT
Reserved
Symbol
WC
FIX
VCNT
Description
Write Combining Memory Type
Fixed Range Registers
No. of Variable Range Registers
Bits
10
8
7–0
Figure 13. MTRR Capability Register Format
For the AMD Athlon processor, the MTRR capability register
should contain 0508h (write-combining, fixed MTRR s
supported, and eight variable MTRRs defined).
174
Memory Type Range Register (MTRR) Mechanism
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
MTRR Default Type Register Format. The MTRR default type register
is defined as follows.
63
11 10 9 8
E
F
E
7
3
2
1
0
Type
Reserved
Symbol
E
FE
Type
Description
MTRRs Enabled
Fixed Range Enabled
Default Memory Type
Bits
11
10
7–0
Figure 14. MTRR Default Type Register Format
E
MTRRs are enabled when set. All MTRRs (both fixed and
variable range) are disabled when clear, and all of
physical memory is mapped as uncacheable memory
(reset state = 0).
FE
Fixed-range MTRRs are enabled when set. All MTRRs
are disabled when clear. When the fixed-range MTRRs
are enabled and an overlap occurs with a variable-range
MTRR, the fixed-range MTRR takes priority (reset state
= 0).
Type Defines the default memory type (reset state = 0). See
Table 13 for more details.
Memory Type Range Register (MTRR) Mechanism
175
AMD Athlon™ Processor x86 Code Optimization
Table 13.
22007E/0—November 1999
Standard MTRR Types and Properties
Encoding in
MTRR
Internally
Cacheable
Writeback
Cacheable
Allows
Speculative
Reads
Uncacheable (UC)
0
No
No
No
Strong ordering
Write Combining (WC)
1
No
No
Yes
Weak ordering
Reserved
2
-
-
-
-
Reserved
3
-
-
-
-
Writethrough (WT)
4
Yes
No
Yes
Speculative ordering
Write Protected (WP)
5
No
Yes
Speculative ordering
Writeback (WB)
6
Yes
Yes
Yes
Speculative ordering
7-255
-
-
-
Memory Type
Reserved
Yes, reads
No, Writes
Memory Ordering Model
-
Note that if two or more variable memory ranges match then
the interactions are defined as follows:
1. If the memory types are identical, then that memory type is
used.
2. If one or more of the memory types is UC, the UC memory
type is used.
3. If one or more of the memory types is WT and the only other
matching memory type is WB then the WT memory type is
used.
4. Otherwise, if the combination of memory types is not listed
above then the behavior of the processor is undefined.
MTRR Overlapping
The Intel documentation (P6/PII) states that the mapping of
large pages into regions that are mapped with differing memory
types can result in undefined behavior. However, testing shows
that these processors decompose these large pages into 4-Kbyte
pages.
When a large page (2 Mbytes/4 Mbytes) mapping covers a
region that contains more than one memory type (as mapped by
the MTRRs), the AMD Athlon processor does not suppress the
caching of that large page mapping and only caches the
mapping for just that 4-Kbyte piece in the 4-Kbyte TLB.
Therefore, the AMD Athlon processor does not decompose
large pages under these conditions. The fixed range MTRRs are
176
Memory Type Range Register (MTRR) Mechanism
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
not affected by this issue, only the variable range (and MTRR
DefType) registers are affected.
Page Attribute Table (PAT)
The Page Attribute Table (PAT) is an extension of the page
table entry format, which allows the specification of memory
types to regions of physical memory based on the linear
address. The PAT provides the same functionality as MTRRs
with the flexibility of the page tables. It provides the operating
systems and applications to determine the desired memory
type for optimal performance. PAT support is detected in the
feature flags (bit 16) of the CPUID instruction.
MSR Access
The PAT is located in a 64-bit MSR at location 277h. It is
illustrated in Figure 15. Each of the eight PAn fields can contain
the memory type encodings as described in Table 12 on
page 174. An attempt to write an undefined memory type
encoding into the PAT will generate a GP fault.
31
26
24
18
PA3
63
58
16
10
PA7
50
48
PA6
2
PA1
PA2
56
8
42
PA0
40
PA5
0
34
32
PA4
Reserved
Figure 15. Page Attribute Table (MSR 277h)
Page Attribute Table (PAT)
177
AMD Athlon™ Processor x86 Code Optimization
Accessing the PAT
22007E/0—November 1999
A 3-bit index consisting of the PATi, PCD, and PWT bits of the
page table entry, is used to select one of the seven PAT register
fields to acquire the memory type for the desired page (PATi is
defined as bit 7 for 4-Kbyte PTEs and bit 12 for PDEs which
map to 2-Mbyte or 4-Mbyte pages). The memory type from the
PAT is used instead of the PCD and PWT for the effective
memory type.
A 2-bit index consisting of PCD and PWT bits of the page table
entry, is used to select one of four PAT register fields when PAE
(page address extensions) is enabled, or when the PDE doesn’t
describe a large page. In the latter case, the PATi bit for a PTE
(bit 7) corresponds to the page size bit in a PDE. Therefore, the
OS should only use PA0-3 when setting the memory type for a
page table that is also used as a page directory. See Table 14 on
page 178.
Table 14. PATi 3-Bit Encodings
MTRRs and PAT
PATi
PCD
PWT
PAT Entry
0
0
0
0
0
0
1
1
0
1
0
2
0
1
1
3
1
0
0
4
1
0
1
5
1
1
0
6
1
1
1
7
Reset Value
The processor contains MTRRs as described earlier which
provide a limited way of assigning memory types to specific
regions. However, the page tables allow memory types to be
assigned to the pages used for linear to physical translation.
The memory type as defined by PAT and MTRRs are combined
to determine the effective memory type as listed in Table 15
and Table 16. Shaded areas indicated reserved settings.
178
Page Attribute Table (PAT)
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 15. Effective Memory Type Based on PAT and MTRRs
PAT Memory Type
MTRR Memory Type
Effective Memory Type
UC-
WB, WT, WP, WC
UC-Page
UC
UC-MTRR
WC
x
WC
WT
WB, WT
WT
UC
UC
WC
CD
WP
CD
WB, WP
WP
UC
UC-MTRR
WC, WT
CD
WB
WB
UC
UC
WC
WC
WT
WT
WP
WP
WP
WB
Notes:
1. UC-MTRR indicates that the UC attribute came from the MTRRs and that the processor caches
should not be probed for performance reasons.
2. UC-Page indicates that the UC attribute came from the page tables and that the processor
caches must be probed due to page aliasing.
3. All reserved combinations default to CD.
Page Attribute Table (PAT)
179
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 16. Final Output Memory Types
WrMem
Effective. MType
forceCD5
RdMem
WrMem
MemType
●
●
UC
-
●
●
UC
1
●
●
CD
-
●
●
CD
1
●
●
WC
-
●
●
WC
1
●
●
WT
-
●
●
WT
1
●
●
WP
-
●
●
WP
1
●
●
WB
-
●
●
WB
●
●
-
●
●
●
CD
●
UC
-
●
UC
●
CD
-
●
CD
●
WC
-
●
WC
●
WT
-
●
CD
3
●
WP
-
●
WP
1
●
WB
-
●
CD
3
●
-
●
●
CD
2
●
UC
-
●
UC
●
CD
-
●
CD
●
WC
-
●
WC
●
WT
-
●
CD
6
●
WP
-
●
CD
6
●
WB
-
●
CD
6
●
-
●
●
CD
2
●
UC
-
●
UC
●
180
Output Memory Type
RdMem
Input Memory Type
AMD-751
Note
1, 2
Page Attribute Table (PAT)
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 16. Final Output Memory Types (Continued)
WrMem
Effective. MType
forceCD5
RdMem
WrMem
MemType
Output Memory Type
RdMem
Input Memory Type
●
●
CD
-
●
●
CD
●
●
WC
-
●
●
WC
●
●
WT
-
●
●
WT
●
●
WP
-
●
●
WP
●
●
WB
-
●
●
WT
4
●
●
-
●
●
●
CD
2
AMD-751
Note
Notes:
1. WP is not functional for RdMem/WrMem.
2. ForceCD must cause the MTRR memory type to be ignored in order to avoid x’s.
3. D-I should always be WP because the BIOS will only program RdMem-WrIO for WP. CD
is forced to preserve the write-protect intent.
4. Since cached IO lines cannot be copied back to IO, the processor forces WB to WT to
prevent cached IO from going dirty.
5. ForceCD. The memory type is forced CD due to (1) CR0[CD]=1, (2) memory type is for
the ITLB and the I-Cache is disabled or for the DTLB and the D-Cache is disabled, (3)
when clean victims must be written back and RdIO and WrIO and WT, WB, or WP, or
(4) access to Local APIC space.
6. The processor does not support this memory type.
Page Attribute Table (PAT)
181
AMD Athlon™ Processor x86 Code Optimization
MTRR Fixed-Range
Register Format
22007E/0—November 1999
The memory types defined for memory segments defined in
each of the MTRR fixed-range registers are defined in Table 17
(Also See “Standard MTRR Types and Propert ies” on
page 176.).
Table 17. MTRR Fixed Range Register Format
Address Range (in hexadecimal)
55:48
47:40
39:32
31:24
23:16
15:8
700007FFFF
600006FFFF
500005FFFF
400004FFFF
300003FFFF
200002FFFF
100001FFFF
000000FFFF
9C000
98000
94000
90000
8C000
88000
84000
80000
9FFFF
9BFFF
97FFF
93FFF
8FFFF
8BFFF
87FFF
83FFF
BC000BFFFF
B8000BBFFF
B4000B7FFF
B0000B3FFF
AC000AFFFF
A8000ABFFF
A4000A7FFF
A0000A3FFF
MTRR_fix16K_A0000
C7000C7FFF
C6000C6FFF
C5000C5FFF
C4000C4FFF
C3000C3FFF
C2000C2FFF
C1000C1FFF
C0000C0FFF
MTRR_fix4K_C0000
CF000C- CE000FFFF
CEFFF
CD000CDFFF
CC000CCFFF
CB000CBFFF
CA000CAFFF
C9000C9FFF
C8000C8FFF
MTRR_fix4K_C8000
D7000D7FFF
D6000D6FFF
D5000D5FFF
D4000D4FFF
D3000D3FFF
D2000D2FFF
D1000D1FFF
D0000D0FFF
MTRR_fix4K_D0000
DF000DFFFF
DE000DEFFF
DD000DDFFF
DC000DCFFF
DB000DBFFF
DA000DAFFF
D9000D9FFF
D8000D8FFF
MTRR_fix4K_D8000
E7000E7FFF
E6000E6FFF
E5000E5FFF
E4000E4FFF
E3000E3FFF
E2000E2FFF
E1000E1FFF
E0000E0FFF
MTRR_fix4K_E0000
EF000EFFFF
EE000EEFFF
ED000EDFFF
EC000ECFFF
EB000EBFFF
EA000EAFFF
E9000E9FFF
E8000E8FFF
MTRR_fix4K_E8000
F7000
F6000
F5000
F4000
F3000
F2000
F1000
F0000
F7FFF
F6FFF
F5FFF
F4FFF
F3FFF
F2FFF
F1FFF
F0FFF
FF000
FE000
FB000FBFFF
FA000FAFFF
F8000
FEFFF
FC000FCFFF
F9000
FFFFF
FD000FDFFF
F9FFF
F8FFF
182
7:0
Register
63:56
MTRR_fix64K_00000
MTRR_fix16K_80000
MTRR_fix4K_F0000
MTRR_fix4K_F8000
Page Attribute Table (PAT)
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Variable-Range
MTRRs
A variable MTRR can be programmed to start at address
0000_0000h because the fixed MTRRs always override the
variable ones. However, it is recommended not to create an
overlap.
The upper two variable MTRRs should not be used by the BIOS
and are reserved for operating system use.
Variable-Range MTRR
Register Format
The variable address range is power of 2 sized and aligned. The
range of supported sizes is from 212 to 236 in powers of 2. The
AMD Athlon processor does not implement A[35:32].
63
36 35
12 11
Physical Base
8 7
0
Type
Reserved
Symbol
Description
Bits
Physical Base Base address in Register Pair
35–12
Type
See MTRR Types and Properties 7–0
Figure 16. MTRRphysBasen Register Format
Note: A software attempt to write to reserved bits will generate a
general protection exception.
Page Attribute Table (PAT)
Physical
Base
Specifies a 24-bit value which is extended by 12
bits to form the base address of the region defined
in the register pair.
Type
See “Standard MTRR Types and Properties” on
page 176.
183
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
36 35
63
12 11 10
Physical Mask
0
V
Reserved
Symbol
Description
Bits
Physical Mask 24-Bit Mask
35–12
V
Variable Range Register Pair Enabled 11
(V = 0 at reset)
Figure 17. MTRRphysMaskn Register Format
Note: A software attempt to write to reserved bits will generate a
general protection exception.
Physical
Mask
Specifies a 24-bit mask to determine the range of
the region defined in the register pair.
V
Enables the register pair when set (V = 0 at reset).
Mask values can represent discontinuous ranges (when the
mask defines a lower significant bit as zero and a higher
significant bit as one). In a discontinuous range, the memory
area not mapped by the mask value is set to the default type.
Discontinuous ranges should not be used.
The range that is mapped by the variable-range MTRR register
pair must meet the following range size and alignment rule:
■
■
Each defined memory range must have a size equal to 2n (11
< n < 36).
The base address for the address pair must be aligned to a
similar 2n boundary.
An example of a variable MTRR pair is as follows:
To map the address range from 8 Mbytes (0080_0000h) to
16 Mbytes (00FF_FFFFh) as writeback memory, the base
register should be loaded with 80_0006h, and the mask
should be loaded with FFF8_00800h.
184
Page Attribute Table (PAT)
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
MTRR MSR Format
This table defines the model-specific registers related to the
memory type range register implementation. All MTRRs are
defined to be 64 bits.
Table 18. MTRR-Related Model-Specific Register (MSR) Map
Register Address
Register Name
0FEh
MTRRcap
See “MTRR Capability Register Format” on page 174.
200h
MTRR Base0
See “MTRRphysBasen Register Format” on page 183.
201h
MTRR Mask0
See “MTRRphysMaskn Register Format” on page 184.
202h
MTRR Base1
203h
MTRR Mask1
204h
MTRR Base2
205h
MTRR Mask2
206h
MTRR Base3
207h
MTRR Mask3
208h
MTRR Base4
209h
MTRR Mask4
20Ah
MTRR Base5
20Bh
MTRR Mask5
20Ch
MTRR Base6
20Dh
MTRR Mask6
20Eh
MTRR Base7
20Fh
MTRR Mask7
250h
MTRRFIX64k_00000
258h
MTRRFIX16k_80000
259h
MTRRFIX16k_A0000
268h
MTRRFIX4k_C0000
269h
MTRRFIX4k_C8000
26Ah
MTRRFIX4k_D0000
26Bh
MTRRFIX4k_D8000
26Ch
MTRRFIX4k_E0000
26Dh
MTRRFIX4k_E8000
26Eh
MTRRFIX4k_F0000
26Fh
MTRRFIX4k_F8000
2FFh
MTRRdefType
Page Attribute Table (PAT)
Description
See “MTRR Fixed-Range Register Format” on page 182.
See “MTRR Default Type Register Format” on page 175.
185
AMD Athlon™ Processor x86 Code Optimization
186
22007E/0—November 1999
Page Attribute Table (PAT)
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Appendix F
Instruction Dispatch and
Execution Resources
This chapter describes the MacroOPs generated by each
decoded instruction, along with the relative static execution
latencies of these groups of operations. Tables 19 through 24
starting on page 188 define the integer, MMX™, MMX
extensions, floating-point, 3DNow!™, and 3DNow! extensions
instructions, respectively.
The first column in these tables indicates the instruction
mnemonic and operand types with the following notations:
■
■
■
■
■
■
■
■
■
■
■
reg8—byte integer register defined by instruction byte(s) or
bits 5, 4, and 3 of the modR/M byte
mreg8—byte integer register defined by bits 2, 1, and 0 of
the modR/M byte
reg16/32—word and doubleword integer register defined by
instruction byte(s) or bits 5, 4, and 3 of the modR/M byte
mreg16/32—word and doubleword integer register defined
by bits 2, 1, and 0 of the modR/M byte
mem8—byte memory location
mem16/32—word or doubleword memory location
mem32/48—doubleword or 6-byte memory location
mem48—48-bit integer value in memory
mem64—64-bit value in memory
imm8/16/32—8-bit, 16-bit or 32-bit immediate value
disp8—8-bit displacement value
Instruction Dispatch and Execution Resources
187
AMD Athlon™ Processor x86 Code Optimization
■
■
■
■
■
■
■
■
■
22007E/0—November 1999
disp16/32—16-bit or 32-bit displacement value
disp32/48—32-bit or 48-bit displacement value
eXX—register width depending on the operand size
mem32real—32-bit floating-point value in memory
mem64real—64-bit floating-point value in memory
mem80real—80-bit floating-point value in memory
mmreg—MMX/3DNow! register
mmreg1—MMX/3DNow! register defined by bits 5, 4, and 3
of the modR/M byte
mmreg2—MMX/3DNow! register defined by bits 2, 1, and 0
of the modR/M byte
The second and third columns list all applicable encoding
opcode bytes.
The fourth column lists the modR/M byte used by the
instruction. The modR/M byte defines the instruction as
register or memory form. If mod bits 7 and 6 are documented as
mm (memory form), mm can only be 10b, 01b, or 00b.
The fifth column lists the type of instruction decode —
DirectPath or VectorPath (see “DirectPath Decoder” on page
13 3 and “Vec t orPa th D ec od e r” on pag e 13 3 fo r m o re
information). The AMD Athlon™ processor enhanced decode
logic can process three instructions per clock.
The FPU, MMX, and 3DNow! instruction tables have an
additional column that lists the possible FPU execution
pipelines available for use by any particular DirectPath
decoded operation. Typically, VectorPath instructions require
more than one execution pipe resource.
Table 19. Integer Instructions
Instruction Mnemonic
First Second
Byte Byte
ModR/M
Byte
Decode
Type
AAA
37h
AAD
D5h
0Ah
VectorPath
AAM
D4h
0Ah
VectorPath
AAS
3Fh
188
VectorPath
VectorPath
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
First Second
Byte Byte
ModR/M
Byte
Decode
Type
ADC mreg8, reg8
10h
11-xxx-xxx
DirectPath
ADC mem8, reg8
10h
ADC mreg16/32, reg16/32
11h
ADC mem16/32, reg16/32
11h
ADC reg8, mreg8
12h
ADC reg8, mem8
12h
mm-xxx-xxx DirectPath
ADC reg16/32, mreg16/32
13h
11-xxx-xxx
ADC reg16/32, mem16/32
13h
mm-xxx-xxx DirectPath
ADC AL, imm8
14h
DirectPath
ADC EAX, imm16/32
15h
DirectPath
ADC mreg8, imm8
80h
ADC mem8, imm8
80h
ADC mreg16/32, imm16/32
81h
ADC mem16/32, imm16/32
81h
ADC mreg16/32, imm8 (sign extended)
83h
ADC mem16/32, imm8 (sign extended)
83h
ADD mreg8, reg8
00h
ADD mem8, reg8
00h
ADD mreg16/32, reg16/32
01h
ADD mem16/32, reg16/32
01h
ADD reg8, mreg8
02h
ADD reg8, mem8
02h
ADD reg16/32, mreg16/32
03h
ADD reg16/32, mem16/32
03h
mm-xxx-xxx DirectPath
ADD AL, imm8
04h
DirectPath
ADD EAX, imm16/32
05h
DirectPath
ADD mreg8, imm8
80h
ADD mem8, imm8
80h
ADD mreg16/32, imm16/32
81h
ADD mem16/32, imm16/32
81h
ADD mreg16/32, imm8 (sign extended)
83h
ADD mem16/32, imm8 (sign extended)
83h
AND mreg8, reg8
20h
Instruction Mnemonic
Instruction Dispatch and Execution Resources
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
11-010-xxx
DirectPath
DirectPath
DirectPath
mm-010-xxx DirectPath
11-010-xxx
DirectPath
mm-010-xxx DirectPath
11-010-xxx
DirectPath
mm-010-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
11-000-xxx
DirectPath
DirectPath
mm-000-xxx DirectPath
11-000-xxx
DirectPath
mm-000-xxx DirectPath
11-000-xxx
DirectPath
mm-000-xxx DirectPath
11-xxx-xxx
DirectPath
189
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
Instruction Mnemonic
First Second
Byte Byte
ModR/M
Byte
Decode
Type
AND mem8, reg8
20h
AND mreg16/32, reg16/32
21h
AND mem16/32, reg16/32
21h
AND reg8, mreg8
22h
AND reg8, mem8
22h
AND reg16/32, mreg16/32
23h
AND reg16/32, mem16/32
23h
mm-xxx-xxx DirectPath
AND AL, imm8
24h
DirectPath
AND EAX, imm16/32
25h
DirectPath
AND mreg8, imm8
80h
AND mem8, imm8
80h
AND mreg16/32, imm16/32
81h
AND mem16/32, imm16/32
81h
AND mreg16/32, imm8 (sign extended)
83h
AND mem16/32, imm8 (sign extended)
83h
ARPL mreg16, reg16
63h
ARPL mem16, reg16
63h
mm-xxx-xxx VectorPath
BOUND
62h
VectorPath
BSF reg16/32, mreg16/32
0Fh
BCh
BSF reg16/32, mem16/32
0Fh
BCh
BSR reg16/32, mreg16/32
0Fh
BDh
BSR reg16/32, mem16/32
0Fh
BDh
mm-xxx-xxx VectorPath
BSWAP EAX
0Fh
C8h
DirectPath
BSWAP ECX
0Fh
C9h
DirectPath
BSWAP EDX
0Fh
CAh
DirectPath
BSWAP EBX
0Fh
CBh
DirectPath
BSWAP ESP
0Fh
CCh
DirectPath
BSWAP EBP
0Fh
CDh
DirectPath
BSWAP ESI
0Fh
CEh
DirectPath
BSWAP EDI
0Fh
CFh
DirectPath
BT mreg16/32, reg16/32
0Fh
A3h
BT mem16/32, reg16/32
0Fh
A3h
mm-xxx-xxx VectorPath
BT mreg16/32, imm8
0Fh
BAh
11-100-xxx
190
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
11-100-xxx
DirectPath
DirectPath
mm-100-xxx DirectPath
11-100-xxx
DirectPath
mm-100-xxx DirectPath
11-100-xxx
DirectPath
mm-100-xxx DirectPath
11-xxx-xxx
11-xxx-xxx
VectorPath
VectorPath
mm-xxx-xxx VectorPath
11-xxx-xxx
11-xxx-xxx
VectorPath
DirectPath
DirectPath
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
Instruction Mnemonic
First Second
Byte Byte
ModR/M
Byte
Decode
Type
BT mem16/32, imm8
0Fh
BAh
BTC mreg16/32, reg16/32
0Fh
BBh
BTC mem16/32, reg16/32
0Fh
BBh
mm-xxx-xxx VectorPath
BTC mreg16/32, imm8
0Fh
BAh
11-111-xxx
BTC mem16/32, imm8
0Fh
BAh
BTR mreg16/32, reg16/32
0Fh
B3h
BTR mem16/32, reg16/32
0Fh
B3h
mm-xxx-xxx VectorPath
BTR mreg16/32, imm8
0Fh
BAh
11-110-xxx
BTR mem16/32, imm8
0Fh
BAh
BTS mreg16/32, reg16/32
0Fh
ABh
BTS mem16/32, reg16/32
0Fh
ABh
mm-xxx-xxx VectorPath
BTS mreg16/32, imm8
0Fh
BAh
11-101-xxx
BTS mem16/32, imm8
0Fh
BAh
CALL full pointer
9Ah
VectorPath
CALL near imm16/32
E8h
VectorPath
CALL mem16:16/32
FFh
11-011-xxx
VectorPath
CALL near mreg32 (indirect)
FFh
11-010-xxx
VectorPath
CALL near mem32 (indirect)
FFh
mm-010-xxx VectorPath
CBW/CWDE
98h
DirectPath
CLC
F8h
DirectPath
CLD
FCh
VectorPath
CLI
FAh
VectorPath
CLTS
0Fh
CMC
F5h
CMOVA/CMOVNBE reg16/32, reg16/32
0Fh
47h
CMOVA/CMOVNBE reg16/32, mem16/32
0Fh
47h
mm-xxx-xxx DirectPath
CMOVAE/CMOVNB/CMOVNC reg16/32, mem16/32
0Fh
43h
11-xxx-xxx
CMOVAE/CMOVNB/CMOVNC mem16/32,
mem16/32
0Fh
43h
mm-xxx-xxx DirectPath
CMOVB/CMOVC/CMOVNAE reg16/32, reg16/32
0Fh
42h
11-xxx-xxx
CMOVB/CMOVC/CMOVNAE mem16/32, reg16/32
0Fh
42h
CMOVBE/CMOVNA reg16/32, reg16/32
0Fh
46h
CMOVBE/CMOVNA reg16/32, mem16/32
0Fh
46h
Instruction Dispatch and Execution Resources
mm-100-xxx DirectPath
11-xxx-xxx
VectorPath
VectorPath
mm-111-xxx VectorPath
11-xxx-xxx
VectorPath
VectorPath
mm-110-xxx VectorPath
11-xxx-xxx
VectorPath
VectorPath
mm-101-xxx VectorPath
06h
VectorPath
DirectPath
11-xxx-xxx
DirectPath
DirectPath
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
191
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
First Second
Byte Byte
ModR/M
Byte
Decode
Type
CMOVE/CMOVZ reg16/32, reg16/32
0Fh
44h
11-xxx-xxx
DirectPath
CMOVE/CMOVZ reg16/32, mem16/32
0Fh
44h
mm-xxx-xxx DirectPath
CMOVG/CMOVNLE reg16/32, reg16/32
0Fh
4Fh
11-xxx-xxx
CMOVG/CMOVNLE reg16/32, mem16/32
0Fh
4Fh
CMOVGE/CMOVNL reg16/32, reg16/32
0Fh
4Dh
CMOVGE/CMOVNL reg16/32, mem16/32
0Fh
4Dh
CMOVL/CMOVNGE reg16/32, reg16/32
0Fh
4Ch
CMOVL/CMOVNGE reg16/32, mem16/32
0Fh
4Ch
CMOVLE/CMOVNG reg16/32, reg16/32
0Fh
4Eh
11-xxx-xxx
CMOVLE/CMOVNG reg16/32, mem16/32
0Fh
4Eh
mm-xxx-xxx DirectPath
CMOVNE/CMOVNZ reg16/32, reg16/32
0Fh
45h
11-xxx-xxx
CMOVNE/CMOVNZ reg16/32, mem16/32
0Fh
45h
CMOVNO reg16/32, reg16/32
0Fh
41h
CMOVNO reg16/32, mem16/32
0Fh
41h
CMOVNP/CMOVPO reg16/32, reg16/32
0Fh
4Bh
CMOVNP/CMOVPO reg16/32, mem16/32
0Fh
4Bh
mm-xxx-xxx DirectPath
CMOVNS reg16/32, reg16/32
0Fh
49h
11-xxx-xxx
CMOVNS reg16/32, mem16/32
0Fh
49h
CMOVO reg16/32, reg16/32
0Fh
40h
CMOVO reg16/32, mem16/32
0Fh
40h
CMOVP/CMOVPE reg16/32, reg16/32
0Fh
4Ah
CMOVP/CMOVPE reg16/32, mem16/32
0Fh
4Ah
CMOVS reg16/32, reg16/32
0Fh
48h
CMOVS reg16/32, mem16/32
0Fh
48h
CMP mreg8, reg8
38h
CMP mem8, reg8
38h
mm-xxx-xxx DirectPath
CMP mreg16/32, reg16/32
39h
11-xxx-xxx
CMP mem16/32, reg16/32
39h
mm-xxx-xxx DirectPath
CMP reg8, mreg8
3Ah
11-xxx-xxx
CMP reg8, mem8
3Ah
mm-xxx-xxx DirectPath
CMP reg16/32, mreg16/32
3Bh
11-xxx-xxx
CMP reg16/32, mem16/32
3Bh
mm-xxx-xxx DirectPath
CMP AL, imm8
3Ch
DirectPath
Instruction Mnemonic
192
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
DirectPath
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
DirectPath
DirectPath
DirectPath
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
Instruction Mnemonic
First Second
Byte Byte
ModR/M
Byte
Decode
Type
CMP EAX, imm16/32
3Dh
CMP mreg8, imm8
80h
CMP mem8, imm8
80h
CMP mreg16/32, imm16/32
81h
CMP mem16/32, imm16/32
81h
CMP mreg16/32, imm8 (sign extended)
83h
CMP mem16/32, imm8 (sign extended)
83h
mm-111-xxx DirectPath
CMPSB mem8,mem8
A6h
VectorPath
CMPSW mem16, mem32
A7h
VectorPath
CMPSD mem32, mem32
A7h
VectorPath
CMPXCHG mreg8, reg8
0Fh
B0h
CMPXCHG mem8, reg8
0Fh
B0h
CMPXCHG mreg16/32, reg16/32
0Fh
B1h
CMPXCHG mem16/32, reg16/32
0Fh
B1h
mm-xxx-xxx VectorPath
CMPXCHG8B mem64
0Fh
C7h
mm-xxx-xxx VectorPath
CPUID
0Fh
A2h
VectorPath
CWD/CDQ
99h
DirectPath
DAA
27h
VectorPath
DAS
2Fh
VectorPath
DEC EAX
48h
DirectPath
DEC ECX
49h
DirectPath
DEC EDX
4Ah
DirectPath
DEC EBX
4Bh
DirectPath
DEC ESP
4Ch
DirectPath
DEC EBP
4Dh
DirectPath
DEC ESI
4Eh
DirectPath
DEC EDI
4Fh
DirectPath
DEC mreg8
FEh
DEC mem8
FEh
DEC mreg16/32
FFh
DEC mem16/32
FFh
DIV AL, mreg8
F6h
DIV AL, mem8
F6h
Instruction Dispatch and Execution Resources
DirectPath
11-111-xxx
DirectPath
mm-111-xxx DirectPath
11-111-xxx
DirectPath
mm-111-xxx DirectPath
11-111-xxx
11-xxx-xxx
DirectPath
VectorPath
mm-xxx-xxx VectorPath
11-xxx-xxx
11-001-xxx
VectorPath
DirectPath
mm-001-xxx DirectPath
11-001-xxx
DirectPath
mm-001-xxx DirectPath
11-110-xxx
VectorPath
mm-110-xxx VectorPath
193
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
First Second
Byte Byte
ModR/M
Byte
Decode
Type
DIV EAX, mreg16/32
F7h
11-110-xxx
VectorPath
DIV EAX, mem16/32
F7h
mm-110-xxx VectorPath
ENTER
C8
VectorPath
IDIV mreg8
F6h
IDIV mem8
F6h
IDIV EAX, mreg16/32
F7h
IDIV EAX, mem16/32
F7h
IMUL reg16/32, imm16/32
69h
11-xxx-xxx
VectorPath
IMUL reg16/32, mreg16/32, imm16/32
69h
11-xxx-xxx
VectorPath
IMUL reg16/32, mem16/32, imm16/32
69h
IMUL reg16/32, imm8 (sign extended)
6Bh
11-xxx-xxx
VectorPath
IMUL reg16/32, mreg16/32, imm8 (signed)
6Bh
11-xxx-xxx
VectorPath
IMUL reg16/32, mem16/32, imm8 (signed)
6Bh
mm-xxx-xxx VectorPath
IMUL AX, AL, mreg8
F6h
11-101-xxx
IMUL AX, AL, mem8
F6h
mm-101-xxx VectorPath
IMUL EDX:EAX, EAX, mreg16/32
F7h
11-101-xxx
IMUL EDX:EAX, EAX, mem16/32
F7h
IMUL reg16/32, mreg16/32
0Fh
AFh
IMUL reg16/32, mem16/32
0Fh
AFh
IN AL, imm8
E4h
VectorPath
IN AX, imm8
E5h
VectorPath
IN EAX, imm8
E5h
VectorPath
IN AL, DX
ECh
VectorPath
IN AX, DX
EDh
VectorPath
IN EAX, DX
EDh
VectorPath
INC EAX
40h
DirectPath
INC ECX
41h
DirectPath
INC EDX
42h
DirectPath
INC EBX
43h
DirectPath
INC ESP
44h
DirectPath
INC EBP
45h
DirectPath
INC ESI
46h
DirectPath
INC EDI
47h
DirectPath
Instruction Mnemonic
194
11-111-xxx
VectorPath
mm-111-xxx VectorPath
11-111-xxx
VectorPath
mm-111-xxx VectorPath
mm-xxx-xxx VectorPath
VectorPath
VectorPath
mm-101-xxx VectorPath
11-xxx-xxx
VectorPath
mm-xxx-xxx VectorPath
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
First Second
Byte Byte
ModR/M
Byte
Decode
Type
INC mreg8
FEh
11-000-xxx
DirectPath
INC mem8
FEh
INC mreg16/32
FFh
INC mem16/32
FFh
INVD
0Fh
08h
VectorPath
INVLPG
0Fh
01h
mm-111-xxx VectorPath
JO short disp8
70h
DirectPath
JNO short disp8
71h
DirectPath
JB/JNAE/JC short disp8
72h
DirectPath
JNB/JAE/JNC short disp8
73h
DirectPath
JZ/JE short disp8
74h
DirectPath
JNZ/JNE short disp8
75h
DirectPath
JBE/JNA short disp8
76h
DirectPath
JNBE/JA short disp8
77h
DirectPath
JS short disp8
78h
DirectPath
JNS short disp8
79h
DirectPath
JP/JPE short disp8
7Ah
DirectPath
JNP/JPO short disp8
7Bh
DirectPath
JL/JNGE short disp8
7Ch
DirectPath
JNL/JGE short disp8
7Dh
DirectPath
JLE/JNG short disp8
7Eh
DirectPath
JNLE/JG short disp8
7Fh
DirectPath
JCXZ/JEC short disp8
E3h
VectorPath
JO near disp16/32
0Fh
80h
DirectPath
JNO near disp16/32
0Fh
81h
DirectPath
JB/JNAE near disp16/32
0Fh
82h
DirectPath
JNB/JAE near disp16/32
0Fh
83h
DirectPath
JZ/JE near disp16/32
0Fh
84h
DirectPath
JNZ/JNE near disp16/32
0Fh
85h
DirectPath
JBE/JNA near disp16/32
0Fh
86h
DirectPath
JNBE/JA near disp16/32
0Fh
87h
DirectPath
JS near disp16/32
0Fh
88h
DirectPath
JNS near disp16/32
0Fh
89h
DirectPath
Instruction Mnemonic
Instruction Dispatch and Execution Resources
mm-000-xxx DirectPath
11-000-xxx
DirectPath
mm-000-xxx DirectPath
195
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
Instruction Mnemonic
First Second
Byte Byte
ModR/M
Byte
Decode
Type
JP/JPE near disp16/32
0Fh
8Ah
DirectPath
JNP/JPO near disp16/32
0Fh
8Bh
DirectPath
JL/JNGE near disp16/32
0Fh
8Ch
DirectPath
JNL/JGE near disp16/32
0Fh
8Dh
DirectPath
JLE/JNG near disp16/32
0Fh
8Eh
DirectPath
JNLE/JG near disp16/32
0Fh
8Fh
DirectPath
JMP near disp16/32 (direct)
E9h
DirectPath
JMP far disp32/48 (direct)
EAh
VectorPath
JMP disp8 (short)
EBh
DirectPath
JMP far mem32 (indirect)
EFh
mm-101-xxx VectorPath
JMP far mreg32 (indirect)
FFh
mm-101-xxx VectorPath
JMP near mreg16/32 (indirect)
FFh
11-100-xxx
JMP near mem16/32 (indirect)
FFh
mm-100-xxx DirectPath
LAHF
9Fh
VectorPath
LAR reg16/32, mreg16/32
0Fh
02h
LAR reg16/32, mem16/32
0Fh
02h
LDS reg16/32, mem32/48
C5h
mm-xxx-xxx VectorPath
LEA reg16, mem16/32
8Dh
mm-xxx-xxx VectorPath
LEA reg32, mem16/32
8Dh
mm-xxx-xxx DirectPath
LEAVE
C9h
VectorPath
LES reg16/32, mem32/48
C4h
mm-xxx-xxx VectorPath
LFS reg16/32, mem32/48
0Fh
B4h
VectorPath
LGDT mem48
0Fh
01h
mm-010-xxx VectorPath
LGS reg16/32, mem32/48
0Fh
B5h
VectorPath
LIDT mem48
0Fh
01h
mm-011-xxx VectorPath
LLDT mreg16
0Fh
00h
LLDT mem16
0Fh
00h
mm-010-xxx VectorPath
LMSW mreg16
0Fh
01h
11-100-xxx
LMSW mem16
0Fh
01h
LODSB AL, mem8
ACh
VectorPath
LODSW AX, mem16
ADh
VectorPath
LODSD EAX, mem32
ADh
VectorPath
LOOP disp8
E2h
VectorPath
196
11-xxx-xxx
DirectPath
VectorPath
mm-xxx-xxx VectorPath
11-010-xxx
VectorPath
VectorPath
mm-100-xxx VectorPath
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
Instruction Mnemonic
First Second
Byte Byte
ModR/M
Byte
Decode
Type
LOOPE/LOOPZ disp8
E1h
VectorPath
LOOPNE/LOOPNZ disp8
E0h
VectorPath
LSL reg16/32, mreg16/32
0Fh
03h
LSL reg16/32, mem16/32
0Fh
03h
mm-xxx-xxx VectorPath
LSS reg16/32, mem32/48
0Fh
B2h
mm-xxx-xxx VectorPath
LTR mreg16
0Fh
00h
11-011-xxx
LTR mem16
0Fh
00h
MOV mreg8, reg8
88h
MOV mem8, reg8
88h
mm-xxx-xxx DirectPath
MOV mreg16/32, reg16/32
89h
11-xxx-xxx
MOV mem16/32, reg16/32
89h
mm-xxx-xxx DirectPath
MOV reg8, mreg8
8Ah
MOV reg8, mem8
8Ah
mm-xxx-xxx DirectPath
MOV reg16/32, mreg16/32
8Bh
11-xxx-xxx
MOV reg16/32, mem16/32
8Bh
mm-xxx-xxx DirectPath
MOV mreg16, segment reg
8Ch
11-xxx-xxx
MOV mem16, segment reg
8Ch
mm-xxx-xxx VectorPath
MOV segment reg, mreg16
8Eh
11-xxx-xxx
MOV segment reg, mem16
8Eh
mm-xxx-xxx VectorPath
MOV AL, mem8
A0h
DirectPath
MOV EAX, mem16/32
A1h
DirectPath
MOV mem8, AL
A2h
DirectPath
MOV mem16/32, EAX
A3h
DirectPath
MOV AL, imm8
B0h
DirectPath
MOV CL, imm8
B1h
DirectPath
MOV DL, imm8
B2h
DirectPath
MOV BL, imm8
B3h
DirectPath
MOV AH, imm8
B4h
DirectPath
MOV CH, imm8
B5h
DirectPath
MOV DH, imm8
B6h
DirectPath
MOV BH, imm8
B7h
DirectPath
MOV EAX, imm16/32
B8h
DirectPath
MOV ECX, imm16/32
B9h
DirectPath
Instruction Dispatch and Execution Resources
11-xxx-xxx
VectorPath
VectorPath
mm-011-xxx VectorPath
11-xxx-xxx
11-xxx-xxx
DirectPath
DirectPath
DirectPath
DirectPath
VectorPath
VectorPath
197
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
Instruction Mnemonic
First Second
Byte Byte
ModR/M
Byte
Decode
Type
MOV EDX, imm16/32
BAh
DirectPath
MOV EBX, imm16/32
BBh
DirectPath
MOV ESP, imm16/32
BCh
DirectPath
MOV EBP, imm16/32
BDh
DirectPath
MOV ESI, imm16/32
BEh
DirectPath
MOV EDI, imm16/32
BFh
DirectPath
MOV mreg8, imm8
C6h
MOV mem8, imm8
C6h
MOV mreg16/32, imm16/32
C7h
MOV mem16/32, imm16/32
C7h
mm-000-xxx DirectPath
MOVSB mem8,mem8
A4h
VectorPath
MOVSD mem16, mem16
A5h
VectorPath
MOVSW mem32, mem32
A5h
VectorPath
MOVSX reg16/32, mreg8
0Fh
BEh
11-xxx-xxx
MOVSX reg16/32, mem8
0Fh
BEh
mm-xxx-xxx DirectPath
MOVSX reg32, mreg16
0Fh
BFh
MOVSX reg32, mem16
0Fh
BFh
mm-xxx-xxx DirectPath
MOVZX reg16/32, mreg8
0Fh
B6h
11-xxx-xxx
MOVZX reg16/32, mem8
0Fh
B6h
MOVZX reg32, mreg16
0Fh
B7h
MOVZX reg32, mem16
0Fh
B7h
MUL AL, mreg8
F6h
11-100-xxx
VectorPath
MUL AL, mem8
F6h
mm-100-xx
VectorPath
MUL AX, mreg16
F7h
11-100-xxx
VectorPath
MUL AX, mem16
F7h
MUL EAX, mreg32
F7h
11-100-xxx
VectorPath
MUL EAX, mem32
F7h
mm-100-xx
VectorPath
NEG mreg8
F6h
11-011-xxx
DirectPath
NEG mem8
F6h
mm-011-xx
DirectPath
NEG mreg16/32
F7h
11-011-xxx
DirectPath
NEG mem16/32
F7h
mm-011-xx
DirectPath
NOP (XCHG EAX, EAX)
90h
NOT mreg8
F6h
198
11-000-xxx
DirectPath
mm-000-xxx DirectPath
11-000-xxx
11-xxx-xxx
DirectPath
DirectPath
DirectPath
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
mm-100-xxx VectorPath
DirectPath
11-010-xxx
DirectPath
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
First Second
Byte Byte
ModR/M
Byte
Decode
Type
NOT mem8
F6h
mm-010-xx
DirectPath
NOT mreg16/32
F7h
11-010-xxx
DirectPath
NOT mem16/32
F7h
mm-010-xx
DirectPath
OR mreg8, reg8
08h
11-xxx-xxx
DirectPath
OR mem8, reg8
08h
mm-xxx-xxx DirectPath
OR mreg16/32, reg16/32
09h
11-xxx-xxx
OR mem16/32, reg16/32
09h
mm-xxx-xxx DirectPath
OR reg8, mreg8
0Ah
OR reg8, mem8
0Ah
mm-xxx-xxx DirectPath
OR reg16/32, mreg16/32
0Bh
11-xxx-xxx
OR reg16/32, mem16/32
0Bh
mm-xxx-xxx DirectPath
OR AL, imm8
0Ch
DirectPath
OR EAX, imm16/32
0Dh
DirectPath
OR mreg8, imm8
80h
OR mem8, imm8
80h
OR mreg16/32, imm16/32
81h
OR mem16/32, imm16/32
81h
OR mreg16/32, imm8 (sign extended)
83h
OR mem16/32, imm8 (sign extended)
83h
mm-001-xxx DirectPath
OUT imm8, AL
E6h
VectorPath
OUT imm8, AX
E7h
VectorPath
OUT imm8, EAX
E7h
VectorPath
OUT DX, AL
EEh
VectorPath
OUT DX, AX
EFh
VectorPath
OUT DX, EAX
EFh
VectorPath
POP ES
07h
VectorPath
POP SS
17h
VectorPath
POP DS
1Fh
VectorPath
POP FS
0Fh
A1h
VectorPath
POP GS
0Fh
A9h
VectorPath
POP EAX
58h
VectorPath
POP ECX
59h
VectorPath
POP EDX
5Ah
VectorPath
Instruction Mnemonic
Instruction Dispatch and Execution Resources
11-xxx-xxx
11-001-xxx
DirectPath
DirectPath
DirectPath
DirectPath
mm-001-xxx DirectPath
11-001-xxx
DirectPath
mm-001-xxx DirectPath
11-001-xxx
DirectPath
199
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
Instruction Mnemonic
First Second
Byte Byte
ModR/M
Byte
Decode
Type
POP EBX
5Bh
VectorPath
POP ESP
5Ch
VectorPath
POP EBP
5Dh
VectorPath
POP ESI
5Eh
VectorPath
POP EDI
5Fh
VectorPath
POP mreg 16/32
8Fh
POP mem 16/32
8Fh
mm-000-xxx VectorPath
POPA/POPAD
61h
VectorPath
POPF/POPFD
9Dh
VectorPath
PUSH ES
06h
VectorPath
PUSH CS
0Eh
VectorPath
PUSH FS
0Fh
A0h
VectorPath
PUSH GS
0Fh
A8h
VectorPath
PUSH SS
16h
VectorPath
PUSH DS
1Eh
VectorPath
PUSH EAX
50h
DirectPath
PUSH ECX
51h
DirectPath
PUSH EDX
52h
DirectPath
PUSH EBX
53h
DirectPath
PUSH ESP
54h
DirectPath
PUSH EBP
55h
DirectPath
PUSH ESI
56h
DirectPath
PUSH EDI
57h
DirectPath
PUSH imm8
6Ah
DirectPath
PUSH imm16/32
68h
DirectPath
PUSH mreg16/32
FFh
PUSH mem16/32
FFh
mm-110-xxx VectorPath
PUSHA/PUSHAD
60h
VectorPath
PUSHF/PUSHFD
9Ch
VectorPath
RCL mreg8, imm8
C0h
RCL mem8, imm8
C0h
RCL mreg16/32, imm8
C1h
RCL mem16/32, imm8
C1h
200
11-000-xxx
11-110-xxx
11-010-xxx
VectorPath
VectorPath
DirectPath
mm-010-xxx VectorPath
11-010-xxx
DirectPath
mm-010-xxx VectorPath
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
First Second
Byte Byte
ModR/M
Byte
Decode
Type
RCL mreg8, 1
D0h
11-010-xxx
DirectPath
RCL mem8, 1
D0h
RCL mreg16/32, 1
D1h
RCL mem16/32, 1
D1h
RCL mreg8, CL
D2h
RCL mem8, CL
D2h
RCL mreg16/32, CL
D3h
RCL mem16/32, CL
D3h
mm-010-xxx VectorPath
RCR mreg8, imm8
C0h
11-011-xxx
RCR mem8, imm8
C0h
mm-011-xxx VectorPath
RCR mreg16/32, imm8
C1h
11-011-xxx
RCR mem16/32, imm8
C1h
mm-011-xxx VectorPath
RCR mreg8, 1
D0h
RCR mem8, 1
D0h
RCR mreg16/32, 1
D1h
11-011-xxx
RCR mem16/32, 1
D1h
mm-011-xxx DirectPath
RCR mreg8, CL
D2h
RCR mem8, CL
D2h
RCR mreg16/32, CL
D3h
RCR mem16/32, CL
D3h
RDMSR
0Fh
32h
VectorPath
RDPMC
0Fh
33h
VectorPath
RDTSC
0F
31h
VectorPath
RET near imm16
C2h
VectorPath
RET near
C3h
VectorPath
RET far imm16
CAh
VectorPath
RET far
CBh
VectorPath
ROL mreg8, imm8
C0h
ROL mem8, imm8
C0h
ROL mreg16/32, imm8
C1h
ROL mem16/32, imm8
C1h
ROL mreg8, 1
D0h
ROL mem8, 1
D0h
Instruction Mnemonic
Instruction Dispatch and Execution Resources
mm-010-xxx DirectPath
11-010-xxx
DirectPath
mm-010-xxx DirectPath
11-010-xxx
DirectPath
mm-010-xxx VectorPath
11-010-xxx
11-011-xxx
DirectPath
DirectPath
DirectPath
DirectPath
mm-011-xxx DirectPath
11-011-xxx
DirectPath
DirectPath
mm-011-xxx VectorPath
11-011-xxx
DirectPath
mm-011-xxx VectorPath
11-000-xxx
DirectPath
mm-000-xxx DirectPath
11-000-xxx
DirectPath
mm-000-xxx DirectPath
11-000-xxx
DirectPath
mm-000-xxx DirectPath
201
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
First Second
Byte Byte
ModR/M
Byte
Decode
Type
ROL mreg16/32, 1
D1h
11-000-xxx
DirectPath
ROL mem16/32, 1
D1h
ROL mreg8, CL
D2h
ROL mem8, CL
D2h
ROL mreg16/32, CL
D3h
ROL mem16/32, CL
D3h
ROR mreg8, imm8
C0h
ROR mem8, imm8
C0h
ROR mreg16/32, imm8
C1h
ROR mem16/32, imm8
C1h
ROR mreg8, 1
D0h
ROR mem8, 1
D0h
ROR mreg16/32, 1
D1h
ROR mem16/32, 1
D1h
ROR mreg8, CL
D2h
ROR mem8, CL
D2h
ROR mreg16/32, CL
D3h
ROR mem16/32, CL
D3h
mm-001-xxx DirectPath
SAHF
9Eh
VectorPath
SAR mreg8, imm8
C0h
SAR mem8, imm8
C0h
SAR mreg16/32, imm8
C1h
SAR mem16/32, imm8
C1h
SAR mreg8, 1
D0h
SAR mem8, 1
D0h
SAR mreg16/32, 1
D1h
SAR mem16/32, 1
D1h
SAR mreg8, CL
D2h
SAR mem8, CL
D2h
SAR mreg16/32, CL
D3h
SAR mem16/32, CL
D3h
SBB mreg8, reg8
18h
SBB mem8, reg8
18h
Instruction Mnemonic
202
mm-000-xxx DirectPath
11-000-xxx
DirectPath
mm-000-xxx DirectPath
11-000-xxx
DirectPath
mm-000-xxx DirectPath
11-001-xxx
DirectPath
mm-001-xxx DirectPath
11-001-xxx
DirectPath
mm-001-xxx DirectPath
11-001-xxx
DirectPath
mm-001-xxx DirectPath
11-001-xxx
DirectPath
mm-001-xxx DirectPath
11-001-xxx
DirectPath
mm-001-xxx DirectPath
11-001-xxx
11-111-xxx
DirectPath
DirectPath
mm-111-xxx DirectPath
11-111-xxx
DirectPath
mm-111-xxx DirectPath
11-111-xxx
DirectPath
mm-111-xxx DirectPath
11-111-xxx
DirectPath
mm-111-xxx DirectPath
11-111-xxx
DirectPath
mm-111-xxx DirectPath
11-111-xxx
DirectPath
mm-111-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
First Second
Byte Byte
ModR/M
Byte
Decode
Type
SBB mreg16/32, reg16/32
19h
11-xxx-xxx
DirectPath
SBB mem16/32, reg16/32
19h
mm-xxx-xxx DirectPath
SBB reg8, mreg8
1Ah
SBB reg8, mem8
1Ah
mm-xxx-xxx DirectPath
SBB reg16/32, mreg16/32
1Bh
11-xxx-xxx
SBB reg16/32, mem16/32
1Bh
mm-xxx-xxx DirectPath
SBB AL, imm8
1Ch
DirectPath
SBB EAX, imm16/32
1Dh
DirectPath
SBB mreg8, imm8
80h
SBB mem8, imm8
80h
SBB mreg16/32, imm16/32
81h
SBB mem16/32, imm16/32
81h
SBB mreg16/32, imm8 (sign extended)
83h
SBB mem16/32, imm8 (sign extended)
83h
mm-011-xxx DirectPath
SCASB AL, mem8
AEh
VectorPath
SCASW AX, mem16
AFh
VectorPath
SCASD EAX, mem32
AFh
VectorPath
SETO mreg8
0Fh
90h
SETO mem8
0Fh
90h
SETNO mreg8
0Fh
91h
SETNO mem8
0Fh
91h
mm-xxx-xxx DirectPath
SETB/SETC/SETNAE mreg8
0Fh
92h
11-xxx-xxx
SETB/SETC/SETNAE mem8
0Fh
92h
mm-xxx-xxx DirectPath
SETAE/SETNB/SETNC mreg8
0Fh
93h
SETAE/SETNB/SETNC mem8
0Fh
93h
SETE/SETZ mreg8
0Fh
94h
SETE/SETZ mem8
0Fh
94h
SETNE/SETNZ mreg8
0Fh
95h
SETNE/SETNZ mem8
0Fh
95h
SETBE/SETNA mreg8
0Fh
96h
11-xxx-xxx
SETBE/SETNA mem8
0Fh
96h
mm-xxx-xxx DirectPath
SETA/SETNBE mreg8
0Fh
97h
SETA/SETNBE mem8
0Fh
97h
Instruction Mnemonic
Instruction Dispatch and Execution Resources
11-xxx-xxx
11-011-xxx
DirectPath
DirectPath
DirectPath
mm-011-xxx DirectPath
11-011-xxx
DirectPath
mm-011-xxx DirectPath
11-011-xxx
11-xxx-xxx
DirectPath
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
11-xxx-xxx
DirectPath
DirectPath
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
DirectPath
mm-xxx-xxx DirectPath
203
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
First Second
Byte Byte
ModR/M
Byte
Decode
Type
SETS mreg8
0Fh
98h
11-xxx-xxx
DirectPath
SETS mem8
0Fh
98h
SETNS mreg8
0Fh
99h
SETNS mem8
0Fh
99h
SETP/SETPE mreg8
0Fh
9Ah
SETP/SETPE mem8
0Fh
9Ah
SETNP/SETPO mreg8
0Fh
9Bh
SETNP/SETPO mem8
0Fh
9Bh
SETL/SETNGE mreg8
0Fh
9Ch
SETL/SETNGE mem8
0Fh
9Ch
mm-xxx-xxx DirectPath
SETGE/SETNL mreg8
0Fh
9Dh
11-xxx-xxx
SETGE/SETNL mem8
0Fh
9Dh
mm-xxx-xxx DirectPath
SETLE/SETNG mreg8
0Fh
9Eh
11-xxx-xxx
SETLE/SETNG mem8
0Fh
9Eh
SETG/SETNLE mreg8
0Fh
9Fh
SETG/SETNLE mem8
0Fh
9Fh
mm-xxx-xxx DirectPath
SGDT mem48
0Fh
01h
mm-000-xxx VectorPath
SIDT mem48
0Fh
01h
mm-001-xxx VectorPath
SHL/SAL mreg8, imm8
C0h
SHL/SAL mem8, imm8
C0h
SHL/SAL mreg16/32, imm8
C1h
SHL/SAL mem16/32, imm8
C1h
SHL/SAL mreg8, 1
D0h
SHL/SAL mem8, 1
D0h
SHL/SAL mreg16/32, 1
D1h
SHL/SAL mem16/32, 1
D1h
SHL/SAL mreg8, CL
D2h
SHL/SAL mem8, CL
D2h
SHL/SAL mreg16/32, CL
D3h
SHL/SAL mem16/32, CL
D3h
SHR mreg8, imm8
C0h
SHR mem8, imm8
C0h
SHR mreg16/32, imm8
C1h
Instruction Mnemonic
204
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
DirectPath
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
11-100-xxx
DirectPath
DirectPath
mm-100-xxx DirectPath
11-100-xxx
DirectPath
mm-100-xxx DirectPath
11-100-xxx
DirectPath
mm-100-xxx DirectPath
11-100-xxx
DirectPath
mm-100-xxx DirectPath
11-100-xxx
DirectPath
mm-100-xxx DirectPath
11-100-xxx
DirectPath
mm-100-xxx DirectPath
11-101-xxx
DirectPath
mm-101-xxx DirectPath
11-101-xxx
DirectPath
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
Instruction Mnemonic
First Second
Byte Byte
ModR/M
Byte
Decode
Type
SHR mem16/32, imm8
C1h
SHR mreg8, 1
D0h
SHR mem8, 1
D0h
SHR mreg16/32, 1
D1h
11-101-xxx
SHR mem16/32, 1
D1h
mm-101-xxx DirectPath
SHR mreg8, CL
D2h
SHR mem8, CL
D2h
SHR mreg16/32, CL
D3h
SHR mem16/32, CL
D3h
SHLD mreg16/32, reg16/32, imm8
0Fh
A4h
SHLD mem16/32, reg16/32, imm8
0Fh
A4h
SHLD mreg16/32, reg16/32, CL
0Fh
A5h
SHLD mem16/32, reg16/32, CL
0Fh
A5h
SHRD mreg16/32, reg16/32, imm8
0Fh
ACh
11-xxx-xxx
SHRD mem16/32, reg16/32, imm8
0Fh
ACh
mm-xxx-xxx VectorPath
SHRD mreg16/32, reg16/32, CL
0Fh
ADh
11-xxx-xxx
SHRD mem16/32, reg16/32, CL
0Fh
ADh
mm-xxx-xxx VectorPath
SLDT mreg16
0Fh
00h
11-000-xxx
SLDT mem16
0Fh
00h
SMSW mreg16
0Fh
01h
SMSW mem16
0Fh
01h
STC
F9h
DirectPath
STD
FDh
VectorPath
STI
FBh
VectorPath
STOSB mem8, AL
AAh
VectorPath
STOSW mem16, AX
ABh
VectorPath
STOSD mem32, EAX
ABh
VectorPath
STR mreg16
0Fh
00h
STR mem16
0Fh
00h
SUB mreg8, reg8
28h
SUB mem8, reg8
28h
SUB mreg16/32, reg16/32
29h
SUB mem16/32, reg16/32
29h
Instruction Dispatch and Execution Resources
mm-101-xxx DirectPath
11-101-xxx
DirectPath
mm-101-xxx DirectPath
11-101-xxx
DirectPath
DirectPath
mm-101-xxx DirectPath
11-101-xxx
DirectPath
mm-101-xxx DirectPath
11-xxx-xxx
VectorPath
mm-xxx-xxx VectorPath
11-xxx-xxx
VectorPath
mm-xxx-xxx VectorPath
VectorPath
VectorPath
VectorPath
mm-000-xxx VectorPath
11-100-xxx
VectorPath
mm-100-xxx VectorPath
11-001-xxx
VectorPath
mm-001-xxx VectorPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
205
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
First Second
Byte Byte
ModR/M
Byte
Decode
Type
SUB reg8, mreg8
2Ah
11-xxx-xxx
DirectPath
SUB reg8, mem8
2Ah
SUB reg16/32, mreg16/32
2Bh
SUB reg16/32, mem16/32
2Bh
mm-xxx-xxx DirectPath
SUB AL, imm8
2Ch
DirectPath
SUB EAX, imm16/32
2Dh
DirectPath
SUB mreg8, imm8
80h
SUB mem8, imm8
80h
SUB mreg16/32, imm16/32
81h
SUB mem16/32, imm16/32
81h
SUB mreg16/32, imm8 (sign extended)
83h
SUB mem16/32, imm8 (sign extended)
83h
SYSCALL
0Fh
05h
VectorPath
SYSENTER
0Fh
34h
VectorPath
SYSEXIT
0Fh
35h
VectorPath
SYSRET
0Fh
07h
VectorPath
TEST mreg8, reg8
84h
TEST mem8, reg8
84h
TEST mreg16/32, reg16/32
85h
TEST mem16/32, reg16/32
85h
mm-xxx-xxx DirectPath
TEST AL, imm8
A8h
DirectPath
TEST EAX, imm16/32
A9h
DirectPath
TEST mreg8, imm8
F6h
TEST mem8, imm8
F6h
TEST mreg8, imm16/32
F7h
TEST mem8, imm16/32
F7h
VERR mreg16
0Fh
00h
VERR mem16
0Fh
00h
VERW mreg16
0Fh
00h
VERW mem16
0Fh
00h
WAIT
9Bh
WBINVD
0Fh
09h
VectorPath
WRMSR
0Fh
30h
VectorPath
Instruction Mnemonic
206
mm-xxx-xxx DirectPath
11-xxx-xxx
11-101-xxx
DirectPath
DirectPath
mm-101-xxx DirectPath
11-101-xxx
DirectPath
mm-101-xxx DirectPath
11-101-xxx
DirectPath
mm-101-xxx DirectPath
11-xxx-xxx
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
11-000-xxx
DirectPath
DirectPath
mm-000-xxx DirectPath
11-000-xxx
DirectPath
mm-000-xxx DirectPath
11-100-xxx
VectorPath
mm-100-xxx VectorPath
11-101-xxx
VectorPath
mm-101-xxx VectorPath
DirectPath
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 19. Integer Instructions (Continued)
First Second
Byte Byte
ModR/M
Byte
Decode
Type
XADD mreg8, reg8
0Fh
C0h
11-100-xxx
VectorPath
XADD mem8, reg8
0Fh
C0h
XADD mreg16/32, reg16/32
0Fh
C1h
XADD mem16/32, reg16/32
0Fh
C1h
XCHG reg8, mreg8
86h
XCHG reg8, mem8
86h
XCHG reg16/32, mreg16/32
87h
XCHG reg16/32, mem16/32
87h
mm-xxx-xxx VectorPath
XCHG EAX, EAX
90h
DirectPath
XCHG EAX, ECX
91h
VectorPath
XCHG EAX, EDX
92h
VectorPath
XCHG EAX, EBX
93h
VectorPath
XCHG EAX, ESP
94h
VectorPath
XCHG EAX, EBP
95h
VectorPath
XCHG EAX, ESI
96h
VectorPath
XCHG EAX, EDI
97h
VectorPath
XLAT
D7h
VectorPath
XOR mreg8, reg8
30h
XOR mem8, reg8
30h
XOR mreg16/32, reg16/32
31h
XOR mem16/32, reg16/32
31h
mm-xxx-xxx DirectPath
XOR reg8, mreg8
32h
11-xxx-xxx
XOR reg8, mem8
32h
XOR reg16/32, mreg16/32
33h
XOR reg16/32, mem16/32
33h
mm-xxx-xxx DirectPath
XOR AL, imm8
34h
DirectPath
XOR EAX, imm16/32
35h
DirectPath
XOR mreg8, imm8
80h
XOR mem8, imm8
80h
XOR mreg16/32, imm16/32
81h
XOR mem16/32, imm16/32
81h
XOR mreg16/32, imm8 (sign extended)
83h
XOR mem16/32, imm8 (sign extended)
83h
Instruction Mnemonic
Instruction Dispatch and Execution Resources
mm-100-xxx VectorPath
11-101-xxx
VectorPath
mm-101-xxx VectorPath
11-xxx-xxx
VectorPath
mm-xxx-xxx VectorPath
11-xxx-xxx
11-xxx-xxx
VectorPath
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
DirectPath
DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
11-110-xxx
DirectPath
DirectPath
mm-110-xxx DirectPath
11-110-xxx
DirectPath
mm-110-xxx DirectPath
11-110-xxx
DirectPath
mm-110-xxx DirectPath
207
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 20. MMX™ Instructions
Instruction Mnemonic
Prefix First
Byte(s) Byte
ModR/M
Byte
Decode
Type
FPU Pipe(s)
DirectPath
FADD/FMUL/FSTORE
EMMS
0Fh
77h
MOVD mmreg, reg32
0Fh
6Eh
MOVD mmreg, mem32
0Fh
6Eh
MOVD reg32, mmreg
0Fh
7Eh
11-xxx-xxx
MOVD mem32, mmreg
0Fh
7Eh
mm-xxx-xxx DirectPath
MOVQ mmreg1, mmreg2
0Fh
6Fh
11-xxx-xxx
MOVQ mmreg, mem64
0Fh
6Fh
MOVQ mmreg2, mmreg1
0Fh
7Fh
MOVQ mem64, mmreg
0Fh
7Fh
mm-xxx-xxx DirectPath
PACKSSDW mmreg1, mmreg2
0Fh
6Bh
11-xxx-xxx
PACKSSDW mmreg, mem64
0Fh
6Bh
PACKSSWB mmreg1, mmreg2
0Fh
63h
11-xxx-xxx
DirectPath
FADD/FMUL
PACKSSWB mmreg, mem64
0Fh
63h
mm-xxx-xxx DirectPath
FADD/FMUL
PACKUSWB mmreg1, mmreg2
0Fh
67h
11-xxx-xxx
DirectPath
FADD/FMUL
PACKUSWB mmreg, mem64
0Fh
67h
mm-xxx-xxx DirectPath
FADD/FMUL
PADDB mmreg1, mmreg2
0Fh
FCh
11-xxx-xxx
DirectPath
FADD/FMUL
PADDB mmreg, mem64
0Fh
FCh
mm-xxx-xxx DirectPath
FADD/FMUL
PADDD mmreg1, mmreg2
0Fh
FEh
11-xxx-xxx
DirectPath
FADD/FMUL
PADDD mmreg, mem64
0Fh
FEh
mm-xxx-xxx DirectPath
FADD/FMUL
PADDSB mmreg1, mmreg2
0Fh
ECh
PADDSB mmreg, mem64
0Fh
ECh
PADDSW mmreg1, mmreg2
0Fh
EDh
PADDSW mmreg, mem64
0Fh
EDh
PADDUSB mmreg1, mmreg2
0Fh
DCh
PADDUSB mmreg, mem64
0Fh
DCh
PADDUSW mmreg1, mmreg2
0Fh
DDh
PADDUSW mmreg, mem64
0Fh
DDh
PADDW mmreg1, mmreg2
0Fh
FDh
PADDW mmreg, mem64
0Fh
FDh
PAND mmreg1, mmreg2
0Fh
DBh
PAND mmreg, mem64
0Fh
DBh
11-xxx-xxx
VectorPath
mm-xxx-xxx DirectPath
mm-xxx-xxx DirectPath
11-xxx-xxx
1
FADD/FMUL/FSTORE
VectorPath
DirectPath
DirectPath
1
FSTORE
FADD/FMUL
FADD/FMUL/FSTORE
FADD/FMUL
FSTORE
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
Notes
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
Notes:
1. Bits 2, 1, and 0 of the modR/M byte select the integer register.
208
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 20. MMX™ Instructions (Continued)
Instruction Mnemonic
Prefix First
Byte(s) Byte
ModR/M
Byte
Decode
Type
FPU Pipe(s)
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
PANDN mmreg1, mmreg2
0Fh
DFh
PANDN mmreg, mem64
0Fh
DFh
PCMPEQB mmreg1, mmreg2
0Fh
74h
PCMPEQB mmreg, mem64
0Fh
74h
PCMPEQD mmreg1, mmreg2
0Fh
76h
PCMPEQD mmreg, mem64
0Fh
76h
PCMPEQW mmreg1, mmreg2
0Fh
75h
PCMPEQW mmreg, mem64
0Fh
75h
PCMPGTB mmreg1, mmreg2
0Fh
64h
11-xxx-xxx
DirectPath
FADD/FMUL
PCMPGTB mmreg, mem64
0Fh
64h
mm-xxx-xxx DirectPath
FADD/FMUL
PCMPGTD mmreg1, mmreg2
0Fh
66h
11-xxx-xxx
DirectPath
FADD/FMUL
PCMPGTD mmreg, mem64
0Fh
66h
mm-xxx-xxx DirectPath
FADD/FMUL
PCMPGTW mmreg1, mmreg2
0Fh
65h
11-xxx-xxx
DirectPath
FADD/FMUL
PCMPGTW mmreg, mem64
0Fh
65h
mm-xxx-xxx DirectPath
FADD/FMUL
PMADDWD mmreg1, mmreg2
0Fh
F5h
11-xxx-xxx
PMADDWD mmreg, mem64
0Fh
F5h
PMULHW mmreg1, mmreg2
0Fh
E5h
PMULHW mmreg, mem64
0Fh
E5h
PMULLW mmreg1, mmreg2
0Fh
D5h
PMULLW mmreg, mem64
0Fh
D5h
POR mmreg1, mmreg2
0Fh
EBh
POR mmreg, mem64
0Fh
EBh
PSLLD mmreg1, mmreg2
0Fh
F2h
PSLLD mmreg, mem64
0Fh
PSLLD mmreg, imm8
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
DirectPath
FMUL
mm-xxx-xxx DirectPath
FMUL
11-xxx-xxx
DirectPath
FMUL
mm-xxx-xxx DirectPath
FMUL
11-xxx-xxx
DirectPath
FMUL
mm-xxx-xxx DirectPath
FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
F2h
mm-xxx-xxx DirectPath
FADD/FMUL
0Fh
72h
11-110-xxx
DirectPath
FADD/FMUL
PSLLQ mmreg1, mmreg2
0Fh
F3h
11-xxx-xxx
DirectPath
FADD/FMUL
PSLLQ mmreg, mem64
0Fh
F3h
mm-xxx-xxx DirectPath
FADD/FMUL
PSLLQ mmreg, imm8
0Fh
73h
11-110-xxx
DirectPath
FADD/FMUL
PSLLW mmreg1, mmreg2
0Fh
F1h
11-xxx-xxx
DirectPath
FADD/FMUL
PSLLW mmreg, mem64
0Fh
F1h
mm-xxx-xxx DirectPath
FADD/FMUL
PSLLW mmreg, imm8
0Fh
71h
11-110-xxx
FADD/FMUL
DirectPath
Notes
Notes:
1. Bits 2, 1, and 0 of the modR/M byte select the integer register.
Instruction Dispatch and Execution Resources
209
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 20. MMX™ Instructions (Continued)
Instruction Mnemonic
Prefix First
Byte(s) Byte
ModR/M
Byte
Decode
Type
FPU Pipe(s)
11-xxx-xxx
DirectPath
FADD/FMUL
PSRAW mmreg1, mmreg2
0Fh
E1h
PSRAW mmreg, mem64
0Fh
E1h
mm-xxx-xxx DirectPath
FADD/FMUL
PSRAW mmreg, imm8
0Fh
71h
11-100-xxx
DirectPath
FADD/FMUL
PSRAD mmreg1, mmreg2
0Fh
E2h
11-xxx-xxx
DirectPath
FADD/FMUL
PSRAD mmreg, mem64
0Fh
E2h
mm-xxx-xxx DirectPath
FADD/FMUL
PSRAD mmreg, imm8
0Fh
72h
11-100-xxx
DirectPath
FADD/FMUL
PSRLD mmreg1, mmreg2
0Fh
D2h
11-xxx-xxx
DirectPath
FADD/FMUL
PSRLD mmreg, mem64
0Fh
D2h
mm-xxx-xxx DirectPath
FADD/FMUL
PSRLD mmreg, imm8
0Fh
72h
11-010-xxx
DirectPath
FADD/FMUL
PSRLQ mmreg1, mmreg2
0Fh
D3h
11-xxx-xxx
DirectPath
FADD/FMUL
PSRLQ mmreg, mem64
0Fh
D3h
mm-xxx-xxx DirectPath
FADD/FMUL
PSRLQ mmreg, imm8
0Fh
73h
11-010-xxx
DirectPath
FADD/FMUL
PSRLW mmreg1, mmreg2
0Fh
D1h
11-xxx-xxx
DirectPath
FADD/FMUL
PSRLW mmreg, mem64
0Fh
D1h
mm-xxx-xxx DirectPath
FADD/FMUL
PSRLW mmreg, imm8
0Fh
71h
11-010-xxx
DirectPath
FADD/FMUL
PSUBB mmreg1, mmreg2
0Fh
F8h
11-xxx-xxx
DirectPath
FADD/FMUL
PSUBB mmreg, mem64
0Fh
F8h
mm-xxx-xxx DirectPath
FADD/FMUL
PSUBD mmreg1, mmreg2
0Fh
FAh
PSUBD mmreg, mem64
0Fh
FAh
PSUBSB mmreg1, mmreg2
0Fh
E8h
PSUBSB mmreg, mem64
0Fh
E8h
PSUBSW mmreg1, mmreg2
0Fh
E9h
PSUBSW mmreg, mem64
0Fh
E9h
PSUBUSB mmreg1, mmreg2
0Fh
D8h
PSUBUSB mmreg, mem64
0Fh
D8h
PSUBUSW mmreg1, mmreg2
0Fh
D9h
PSUBUSW mmreg, mem64
0Fh
D9h
PSUBW mmreg1, mmreg2
0Fh
F9h
PSUBW mmreg, mem64
0Fh
F9h
PUNPCKHBW mmreg1, mmreg2
0Fh
68h
11-xxx-xxx
DirectPath
FADD/FMUL
PUNPCKHBW mmreg, mem64
0Fh
68h
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
Notes
Notes:
1. Bits 2, 1, and 0 of the modR/M byte select the integer register.
210
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 20. MMX™ Instructions (Continued)
Instruction Mnemonic
Prefix First
Byte(s) Byte
ModR/M
Byte
Decode
Type
FPU Pipe(s)
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
PUNPCKHDQ mmreg1, mmreg2
0Fh
6Ah
PUNPCKHDQ mmreg, mem64
0Fh
6Ah
PUNPCKHWD mmreg1, mmreg2
0Fh
69h
11-xxx-xxx
DirectPath
FADD/FMUL
PUNPCKHWD mmreg, mem64
0Fh
69h
mm-xxx-xxx DirectPath
FADD/FMUL
PUNPCKLBW mmreg1, mmreg2
0Fh
60h
11-xxx-xxx
DirectPath
FADD/FMUL
PUNPCKLBW mmreg, mem64
0Fh
60h
mm-xxx-xxx DirectPath
FADD/FMUL
PUNPCKLDQ mmreg1, mmreg2
0Fh
62h
11-xxx-xxx
DirectPath
FADD/FMUL
PUNPCKLDQ mmreg, mem64
0Fh
62h
mm-xxx-xxx DirectPath
FADD/FMUL
PUNPCKLWD mmreg1, mmreg2
0Fh
61h
11-xxx-xxx
DirectPath
FADD/FMUL
PUNPCKLWD mmreg, mem64
0Fh
61h
mm-xxx-xxx DirectPath
FADD/FMUL
PXOR mmreg1, mmreg2
0Fh
EFh
PXOR mmreg, mem64
0Fh
EFh
11-xxx-xxx
DirectPath
FADD/FMUL
mm-xxx-xxx DirectPath
FADD/FMUL
Notes
Notes:
1. Bits 2, 1, and 0 of the modR/M byte select the integer register.
Table 21. MMX™ Extensions
Instruction Mnemonic
Prefix First
Byte(s) Byte
ModR/M
Byte
Decode
Type
FPU
Pipe(s)
MASKMOVQ mmreg1, mmreg2
0Fh
F7h
VectorPath
FADD/FMUL/FSTORE
MOVNTQ mem64, mmreg
0Fh
E7h
DirectPath
FSTORE
PAVGB mmreg1, mmreg2
0Fh
E0h
DirectPath
FADD/FMUL
PAVGB mmreg, mem64
0Fh
E0h mm-xxx-xxx DirectPath
FADD/FMUL
PAVGW mmreg1, mmreg2
0Fh
E3h
DirectPath
FADD/FMUL
PAVGW mmreg, mem64
0Fh
E3h mm-xxx-xxx DirectPath
FADD/FMUL
PEXTRW reg32, mmreg, imm8
0Fh
C5h
VectorPath
PINSRW mmreg, reg32, imm8
0Fh
C4h
VectorPath
PINSRW mmreg, mem16, imm8
0Fh
C4h
VectorPath
PMAXSW mmreg1, mmreg2
0Fh
EEh
DirectPath
FADD/FMUL
PMAXSW mmreg, mem64
0Fh
EEh mm-xxx-xxx DirectPath
FADD/FMUL
PMAXUB mmreg1, mmreg2
0Fh
DEh
DirectPath
FADD/FMUL
PMAXUB mmreg, mem64
0Fh
DEh mm-xxx-xxx DirectPath
FADD/FMUL
PMINSW mmreg1, mmreg2
0Fh
EAh
FADD/FMUL
11-xxx-xxx
11-xxx-xxx
11-xxx-xxx
11-xxx-xxx
11-xxx-xxx
DirectPath
Notes
Notes:
1. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line that will be prefetched.
Instruction Dispatch and Execution Resources
211
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 21. MMX™ Extensions (Continued)
Prefix First
Byte(s) Byte
Instruction Mnemonic
ModR/M
Byte
Decode
Type
FPU
Pipe(s)
Notes
PMINSW mmreg, mem64
0Fh
EAh mm-xxx-xxx DirectPath
FADD/FMUL
PMINUB mmreg1, mmreg2
0Fh
DAh
DirectPath
FADD/FMUL
PMINUB mmreg, mem64
0Fh
DAh mm-xxx-xxx DirectPath
FADD/FMUL
PMOVMSKB reg32, mmreg
0Fh
D7h
PMULHUW mmreg1, mmreg2
0Fh
E4h
DirectPath
FMUL
PMULHUW mmreg, mem64
0Fh
E4h mm-xxx-xxx DirectPath
FMUL
PSADBW mmreg1, mmreg2
0Fh
F6h
DirectPath
FADD
PSADBW mmreg, mem64
0Fh
F6h mm-xxx-xxx DirectPath
FADD
PSHUFW mmreg1, mmreg2, imm8
0Fh
70h
DirectPath
FADD/FMUL
PSHUFW mmreg, mem64, imm8
0Fh
70h
DirectPath
FADD/FMUL
PREFETCHNTA mem8
0Fh
18h
DirectPath
-
1
PREFETCHT0 mem8
0Fh
18h
DirectPath
-
1
PREFETCHT1 mem8
0Fh
18h
DirectPath
-
1
PREFETCHT2 mem8
0Fh
18h
DirectPath
-
1
SFENCE
0Fh
AEh
VectorPath
-
11-xxx-xxx
VectorPath
11-xxx-xxx
11-xxx-xxx
Notes:
1. For the PREFETCHNTA/T0/T1/T2 instructions, the mem8 value refers to an address in the 64-byte line that will be prefetched.
Table 22. Floating-Point Instructions
Instruction Mnemonic
First Second
Byte Byte
ModR/M
Byte
Decode
Type
FPU
Pipe(s)
F2XM1
D9h
F0h
VectorPath
FABS
D9h
E1h
DirectPath
FMUL
FADD ST, ST(i)
D8h
DirectPath
FADD
FADD [mem32real]
D8h
mm-000-xxx DirectPath
FADD
FADD ST(i), ST
DCh
FADD [mem64real]
DCh
FADDP ST(i), ST
DEh
11-000-xxx
FBLD [mem80]
DFh
mm-100-xxx VectorPath
FBSTP [mem80]
DFh
mm-110-xxx VectorPath
FCHS
D9h
E0h
DirectPath
FCLEX
DBh
E2h
VectorPath
11-000-xxx
11-000-xxx
DirectPath
FADD
mm-000-xxx DirectPath
FADD
DirectPath
FADD
Note
1
1
1
FMUL
Notes:
1. The last three bits of the modR/M byte select the stack entry ST(i).
212
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 22. Floating-Point Instructions (Continued)
Instruction Mnemonic
First Second
Byte Byte
ModR/M
Byte
Decode
Type
FPU
Pipe(s)
Note
FCMOVB ST(0), ST(i)
DAh C0-C7h
VectorPath
FCMOVE ST(0), ST(i)
DAh C8-CFh
VectorPath
FCMOVBE ST(0), ST(i)
DAh D0-D7h
VectorPath
FCMOVU ST(0), ST(i)
DAh D8-DFh
VectorPath
FCMOVNB ST(0), ST(i)
DBh C0-C7h
VectorPath
FCMOVNE ST(0), ST(i)
DBh C8-CFh
VectorPath
FCMOVNBE ST(0), ST(i)
DBh D0-D7h
VectorPath
FCMOVNU ST(0), ST(i)
DBh D8-DFh
VectorPath
FCOM ST(i)
D8h
11-010-xxx
DirectPath
FADD
1
FCOMP ST(i)
D8h
11-011-xxx
DirectPath
FADD
1
FCOM [mem32real]
D8h
mm-010-xxx DirectPath
FADD
FCOM [mem64real]
DCh
mm-010-xxx DirectPath
FADD
FCOMI ST, ST(i)
DBh
F0-F7h
VectorPath
FADD
FCOMIP ST, ST(i)
DFh
F0-F7h
VectorPath
FADD
FCOMP [mem32real]
D8h
mm-011-xxx DirectPath
FADD
FCOMP [mem64real]
DCh
mm-011-xxx DirectPath
FADD
FCOMPP
DEh
D9h
11-011-001
FADD
FCOS
D9h
FFh
VectorPath
FDECSTP
D9h
F6h
DirectPath
FADD/FMUL/FSTORE
FDIV ST, ST(i)
D8h
11-110-xxx
DirectPath
FMUL
1
FDIV ST(i), ST
DCh
11-111-xxx
DirectPath
FMUL
1
FDIV [mem32real]
D8h
mm-110-xxx DirectPath
FMUL
FDIV [mem64real]
DCh
mm-110-xxx DirectPath
FMUL
FDIVP ST, ST(i)
DEh
11-111-xxx
DirectPath
FMUL
1
FDIVR ST, ST(i)
D8h
11-110-xxx
DirectPath
FMUL
1
FDIVR ST(i), ST
DCh
11-111-xxx
DirectPath
FMUL
1
FDIVR [mem32real]
D8h
mm-111-xxx DirectPath
FMUL
FDIVR [mem64real]
DCh
mm-111-xxx DirectPath
FMUL
FDIVRP ST(i), ST
DEh
11-110-xxx
DirectPath
FMUL
1
FFREE ST(i)
DDh
11-000-xxx
DirectPath
FADD/FMUL/FSTORE
1
FFREEP ST(i)
DFh C0-C7h
DirectPath
FADD/FMUL/FSTORE
1
DirectPath
Notes:
1. The last three bits of the modR/M byte select the stack entry ST(i).
Instruction Dispatch and Execution Resources
213
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 22. Floating-Point Instructions (Continued)
Instruction Mnemonic
First Second
Byte Byte
ModR/M
Byte
Decode
Type
FPU
Pipe(s)
FIADD [mem32int]
DAh
mm-000-xxx VectorPath
FIADD [mem16int]
DEh
mm-000-xxx VectorPath
FICOM [mem32int]
DAh
mm-010-xxx VectorPath
FICOM [mem16int]
DEh
mm-010-xxx VectorPath
FICOMP [mem32int]
DAh
mm-011-xxx VectorPath
FICOMP [mem16int]
DEh
mm-011-xxx VectorPath
FIDIV [mem32int]
DAh
mm-110-xxx VectorPath
FIDIV [mem16int]
DEh
mm-110-xxx VectorPath
FIDIVR [mem32int]
DAh
mm-111-xxx VectorPath
FIDIVR [mem16int]
DEh
mm-111-xxx VectorPath
FILD [mem16int]
DFh
mm-000-xxx DirectPath
FSTORE
FILD [mem32int]
DBh
mm-000-xxx DirectPath
FSTORE
FILD [mem64int]
DFh
mm-101-xxx DirectPath
FSTORE
FIMUL [mem32int]
DAh
mm-001-xxx VectorPath
FIMUL [mem16int]
DEh
mm-001-xxx VectorPath
FINCSTP
D9h
F7h
DirectPath
FINIT
DBh
E3h
VectorPath
FIST [mem16int]
DFh
mm-010-xxx DirectPath
FSTORE
FIST [mem32int]
DBh
mm-010-xxx DirectPath
FSTORE
FISTP [mem16int]
DFh
mm-011-xxx DirectPath
FSTORE
FISTP [mem32int]
DBh
mm-011-xxx DirectPath
FSTORE
FISTP [mem64int]
DFh
mm-111-xxx DirectPath
FSTORE
FISUB [mem32int]
DAh
mm-100-xxx VectorPath
FISUB [mem16int]
DEh
mm-100-xxx VectorPath
FISUBR [mem32int]
DAh
mm-101-xxx VectorPath
FISUBR [mem16int]
DEh
mm-101-xxx VectorPath
FLD ST(i)
D9h
11-000-xxx
FLD [mem32real]
D9h
mm-000-xxx DirectPath
FADD/FMUL/FSTORE
FLD [mem64real]
DDh
mm-000-xxx DirectPath
FADD/FMUL/FSTORE
FLD [mem80real]
DBh
mm-101-xxx VectorPath
FLD1
D9h
E8h
DirectPath
DirectPath
Note
FADD/FMUL/FSTORE
FADD/FMUL
1
FSTORE
Notes:
1. The last three bits of the modR/M byte select the stack entry ST(i).
214
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 22. Floating-Point Instructions (Continued)
Instruction Mnemonic
First Second
Byte Byte
ModR/M
Byte
Decode
Type
FPU
Pipe(s)
Note
FLDCW [mem16]
D9h
mm-101-xxx VectorPath
FLDENV [mem14byte]
D9h
mm-100-xxx VectorPath
FLDENV [mem28byte]
D9h
mm-100-xxx VectorPath
FLDL2E
D9h
EAh
DirectPath
FSTORE
FLDL2T
D9h
E9h
DirectPath
FSTORE
FLDLG2
D9h
ECh
DirectPath
FSTORE
FLDLN2
D9h
EDh
DirectPath
FSTORE
FLDPI
D9h
EBh
DirectPath
FSTORE
FLDZ
D9h
EEh
DirectPath
FSTORE
FMUL ST, ST(i)
D8h
11-001-xxx
DirectPath
FMUL
1
FMUL ST(i), ST
DCh
11-001-xxx
DirectPath
FMUL
1
FMUL [mem32real]
D8h
mm-001-xxx DirectPath
FMUL
FMUL [mem64real]
DCh
mm-001-xxx DirectPath
FMUL
FMULP ST, ST(i)
DEh
FNOP
D9h
FPTAN
11-001-xxx
DirectPath
FMUL
D0h
DirectPath
FADD/FMUL/FSTORE
D9h
F2h
VectorPath
FPATAN
D9h
F3h
VectorPath
FPREM
D9h
F8h
DirectPath
FMUL
FPREM1
D9h
F5h
DirectPath
FMUL
FRNDINT
D9h
FCh
VectorPath
FRSTOR [mem94byte]
DDh
mm-100-xxx VectorPath
FRSTOR [mem108byte]
DDh
mm-100-xxx VectorPath
FSAVE [mem94byte]
DDh
mm-110-xxx VectorPath
FSAVE [mem108byte]
DDh
mm-110-xxx VectorPath
FSCALE
D9h
FDh
VectorPath
FSIN
D9h
FEh
VectorPath
FSINCOS
D9h
FBh
VectorPath
FSQRT
D9h
FAh
DirectPath
FST [mem32real]
D9h
mm-010-xxx DirectPath
FSTORE
FST [mem64real]
DDh
mm-010-xxx DirectPath
FSTORE
FST ST(i)
DDh
11-010xxx
DirectPath
1
FMUL
FADD/FMUL
Notes:
1. The last three bits of the modR/M byte select the stack entry ST(i).
Instruction Dispatch and Execution Resources
215
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 22. Floating-Point Instructions (Continued)
Instruction Mnemonic
First Second
Byte Byte
ModR/M
Byte
Decode
Type
FPU
Pipe(s)
Note
FSTCW [mem16]
D9h
mm-111-xxx VectorPath
FSTENV [mem14byte]
D9h
mm-110-xxx VectorPath
FSTENV [mem28byte]
D9h
mm-110-xxx VectorPath
FSTP [mem32real]
D9h
mm-011-xxx DirectPath
FADD/FMUL
FSTP [mem64real]
DDh
mm-011-xxx DirectPath
FADD/FMUL
FSTP [mem80real]
D9h
mm-111-xxx VectorPath
FSTP ST(i)
DDh
FSTSW AX
DFh
FSTSW [mem16]
DDh
mm-111-xxx VectorPath
FSTORE
FSUB [mem32real]
D8h
mm-100-xxx DirectPath
FADD
FSUB [mem64real]
DCh
mm-100-xxx DirectPath
FADD
FSUB ST, ST(i)
D8h
11-100-xxx
DirectPath
FADD
1
FSUB ST(i), ST
DCh
11-101-xxx
DirectPath
FADD
1
FSUBP ST, ST(i)
DEh
11-101-xxx
DirectPath
FADD
1
FSUBR [mem32real]
D8h
mm-101-xxx DirectPath
FADD
FSUBR [mem64real]
DCh
mm-101-xxx DirectPath
FADD
FSUBR ST, ST(i)
D8h
11-100-xxx
DirectPath
FADD
1
FSUBR ST(i), ST
DCh
11-101-xxx
DirectPath
FADD
1
FSUBRP ST(i), ST
DEh
11-100-xxx
DirectPath
FADD
1
FTST
D9h
DirectPath
FADD
FUCOM
DDh
DirectPath
FADD
FUCOMI ST, ST(i)
DB
E8-EFh
VectorPath
FADD
FUCOMIP ST, ST(i)
DF
E8-EFh
VectorPath
FADD
DirectPath
FADD
DirectPath
FADD
11-011-xxx
E0h
DirectPath
FADD/FMUL
VectorPath
E4h
11-100-xxx
FUCOMP
DDh
11-101-xxx
FUCOMPP
DAh
FWAIT
9Bh
FXAM
D9h
FXCH
D9h
FXTRACT
D9h
F4h
VectorPath
FYL2X
D9h
F1h
VectorPath
FYL2XP1
D9h
F9h
VectorPath
E9h
DirectPath
E5h
VectorPath
11-001-xxx
DirectPath
FADD/FMUL/FSTORE
Notes:
1. The last three bits of the modR/M byte select the stack entry ST(i).
216
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 23. 3DNow!™ Instructions
Prefix
Byte(s)
imm8
0Fh
0Eh
PAVGUSB mmreg1, mmreg2
0Fh, 0Fh
BFh
11-xxx-xxx
DirectPath
FADD/FMUL
PAVGUSB mmreg, mem64
0Fh, 0Fh
BFh
mm-xxx-xxx DirectPath
FADD/FMUL
PF2ID mmreg1, mmreg2
0Fh, 0Fh
1Dh
11-xxx-xxx
DirectPath
FADD
PF2ID mmreg, mem64
0Fh, 0Fh
1Dh
mm-xxx-xxx DirectPath
FADD
PFACC mmreg1, mmreg2
0Fh, 0Fh
AEh
11-xxx-xxx
DirectPath
FADD
PFACC mmreg, mem64
0Fh, 0Fh
AEh
mm-xxx-xxx DirectPath
FADD
PFADD mmreg1, mmreg2
0Fh, 0Fh
9Eh
11-xxx-xxx
DirectPath
FADD
PFADD mmreg, mem64
0Fh, 0Fh
9Eh
mm-xxx-xxx DirectPath
FADD
PFCMPEQ mmreg1, mmreg2
0Fh, 0Fh
B0h
11-xxx-xxx
DirectPath
FADD
PFCMPEQ mmreg, mem64
0Fh, 0Fh
B0h
mm-xxx-xxx DirectPath
FADD
PFCMPGE mmreg1, mmreg2
0Fh, 0Fh
90h
PFCMPGE mmreg, mem64
0Fh, 0Fh
90h
PFCMPGT mmreg1, mmreg2
0Fh, 0Fh
A0h
11-xxx-xxx
DirectPath
FADD
PFCMPGT mmreg, mem64
0Fh, 0Fh
A0h
mm-xxx-xxx DirectPath
FADD
PFMAX mmreg1, mmreg2
0Fh, 0Fh
A4h
PFMAX mmreg, mem64
0Fh, 0Fh
A4h
PFMIN mmreg1, mmreg2
0Fh, 0Fh
94h
11-xxx-xxx
DirectPath
FADD
PFMIN mmreg, mem64
0Fh, 0Fh
94h
mm-xxx-xxx DirectPath
FADD
PFMUL mmreg1, mmreg2
0Fh, 0Fh
B4h
11-xxx-xxx
DirectPath
FMUL
PFMUL mmreg, mem64
0Fh, 0Fh
B4h
mm-xxx-xxx DirectPath
FMUL
PFRCP mmreg1, mmreg2
0Fh, 0Fh
96h
11-xxx-xxx
DirectPath
FMUL
PFRCP mmreg, mem64
0Fh, 0Fh
96h
mm-xxx-xxx DirectPath
FMUL
PFRCPIT1 mmreg1, mmreg2
0Fh, 0Fh
A6h
11-xxx-xxx
DirectPath
FMUL
PFRCPIT1 mmreg, mem64
0Fh, 0Fh
A6h
mm-xxx-xxx DirectPath
FMUL
PFRCPIT2 mmreg1, mmreg2
0Fh, 0Fh
B6h
PFRCPIT2 mmreg, mem64
0Fh, 0Fh
B6h
PFRSQIT1 mmreg1, mmreg2
0Fh, 0Fh
A7h
11-xxx-xxx
DirectPath
FMUL
PFRSQIT1 mmreg, mem64
0Fh, 0Fh
A7h
mm-xxx-xxx DirectPath
FMUL
PFRSQRT mmreg1, mmreg2
0Fh, 0Fh
97h
11-xxx-xxx
FMUL
Instruction Mnemonic
FEMMS
ModR/M
Byte
Decode
Type
FPU
Pipe(s)
DirectPath FADD/FMUL/FSTORE
11-xxx-xxx
DirectPath
FADD
mm-xxx-xxx DirectPath
FADD
11-xxx-xxx
DirectPath
FADD
mm-xxx-xxx DirectPath
FADD
11-xxx-xxx
DirectPath
FMUL
mm-xxx-xxx DirectPath
FMUL
DirectPath
Note
2
Notes:
1. For the PREFETCH and PREFETCHW instructions, the mem8 value refers to an address in the 64-byte line that will be
prefetched.
2. The byte listed in the column titled ‘imm8’ is actually the opcode byte.
Instruction Dispatch and Execution Resources
217
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 23. 3DNow!™ Instructions (Continued)
Instruction Mnemonic
Prefix
Byte(s)
imm8
PFRSQRT mmreg, mem64
0Fh, 0Fh
97h
mm-xxx-xxx DirectPath
FMUL
PFSUB mmreg1, mmreg2
0Fh, 0Fh
9Ah
11-xxx-xxx
DirectPath
FADD
PFSUB mmreg, mem64
0Fh, 0Fh
9Ah
mm-xxx-xxx DirectPath
FADD
PFSUBR mmreg1, mmreg2
0Fh, 0Fh
AAh
11-xxx-xxx
DirectPath
FADD
PFSUBR mmreg, mem64
0Fh, 0Fh
AAh
mm-xxx-xxx DirectPath
FADD
PI2FD mmreg1, mmreg2
0Fh, 0Fh
0Dh
11-xxx-xxx
DirectPath
FADD
PI2FD mmreg, mem64
0Fh, 0Fh
0Dh
mm-xxx-xxx DirectPath
FADD
PMULHRW mmreg1, mmreg2 0Fh, 0Fh
B7h
PMULHRW mmreg1, mem64
0Fh, 0Fh
PREFETCH mem8
PREFETCHW mem8
ModR/M
Byte
11-xxx-xxx
Decode
Type
FPU
Pipe(s)
Note
DirectPath
FMUL
B7h
mm-xxx-xxx DirectPath
FMUL
0Fh
0Dh
mm-000-xxx DirectPath
-
1, 2
0Fh
0Dh
mm-001-xxx DirectPath
-
1, 2
Notes:
1. For the PREFETCH and PREFETCHW instructions, the mem8 value refers to an address in the 64-byte line that will be
prefetched.
2. The byte listed in the column titled ‘imm8’ is actually the opcode byte.
Table 24. 3DNow!™ Extensions
Prefix
Byte(s)
imm8
ModR/M
Byte
Decode
Type
FPU
Pipe(s)
PF2IW mmreg1, mmreg2
0Fh, 0Fh
1Ch
11-xxx-xxx
DirectPath
FADD
PF2IW mmreg, mem64
0Fh, 0Fh
1Ch
mm-xxx-xxx DirectPath
FADD
PFNACC mmreg1, mmreg2
0Fh, 0Fh
8Ah
PFNACC mmreg, mem64
0Fh, 0Fh
8Ah
PFPNACC mmreg1, mmreg2
0Fh, 0Fh
8Eh
PFPNACC mmreg, mem64
0Fh, 0Fh
8Eh
PI2FW mmreg1, mmreg2
0Fh, 0Fh
0Ch
PI2FW mmreg, mem64
0Fh, 0Fh
0Ch
PSWAPD mmreg1, mmreg2
0Fh, 0Fh
BBh
11-xxx-xxx
DirectPath
FADD/FMUL
PSWAPD mmreg, mem64
0Fh, 0Fh
BBh
mm-xxx-xxx DirectPath
FADD/FMUL
Instruction Mnemonic
218
11-xxx-xxx
DirectPath
FADD
mm-xxx-xxx DirectPath
FADD
11-xxx-xxx
DirectPath
FADD
mm-xxx-xxx DirectPath
FADD
11-xxx-xxx
DirectPath
FADD
mm-xxx-xxx DirectPath
FADD
Note
Instruction Dispatch and Execution Resources
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Appendix G
DirectPath versus
VectorPath Instructions
Select DirectPath Over VectorPath Instructions
U s e D i r e c t Pa t h i n s t r u c t i o n s r a t h e r t h a n Ve c t o r Pa t h
instructions. DirectPath instructions are optimized for decode
and execute efficiently by minimizing the number of operations
per x86 instruction, which includes ‘register ← register op
memory’ as well as ‘register ← register op register’ forms of
instructions.
DirectPath Instructions
The following tables contain DirectPath instructions, which
should be used in the AMD Athlon processor wherever possible:
■
Table 25, “DirectPath Integer Instructions,” on page 220
■
Table 26, “DirectPath MMX™ Instructions,” on page 227
and Table 27, “DirectPath MMX™ Extensions,” on page 228
■
Table 28, “DirectPath Floating-Point Instructions,” on
page 229
■
All 3DNow! instructions, including the 3DNow! Extensions,
are DirectPath and are listed in Table 23, “3DNow!™
Instructions,” on page 217 and Table 24, “3DNow!™ Extensions,” on page 218.
Select DirectPath Over VectorPath Instructions
219
AMD Athlon™ Processor x86 Code Optimization
Table 25. DirectPath Integer Instructions
22007E/0—November 1999
Table 25. DirectPath Integer Instructions (Continued)
Instruction Mnemonic
Instruction Mnemonic
ADC mreg8, reg8
AND mreg16/32, reg16/32
ADC mem8, reg8
AND mem16/32, reg16/32
ADC mreg16/32, reg16/32
AND reg8, mreg8
ADC mem16/32, reg16/32
AND reg8, mem8
ADC reg8, mreg8
AND reg16/32, mreg16/32
ADC reg8, mem8
AND reg16/32, mem16/32
ADC reg16/32, mreg16/32
AND AL, imm8
ADC reg16/32, mem16/32
AND EAX, imm16/32
ADC AL, imm8
AND mreg8, imm8
ADC EAX, imm16/32
AND mem8, imm8
ADC mreg8, imm8
AND mreg16/32, imm16/32
ADC mem8, imm8
AND mem16/32, imm16/32
ADC mreg16/32, imm16/32
AND mreg16/32, imm8 (sign extended)
ADC mem16/32, imm16/32
AND mem16/32, imm8 (sign extended)
ADC mreg16/32, imm8 (sign extended)
BSWAP EAX
ADC mem16/32, imm8 (sign extended)
BSWAP ECX
ADD mreg8, reg8
BSWAP EDX
ADD mem8, reg8
BSWAP EBX
ADD mreg16/32, reg16/32
BSWAP ESP
ADD mem16/32, reg16/32
BSWAP EBP
ADD reg8, mreg8
BSWAP ESI
ADD reg8, mem8
BSWAP EDI
ADD reg16/32, mreg16/32
BT mreg16/32, reg16/32
ADD reg16/32, mem16/32
BT mreg16/32, imm8
ADD AL, imm8
BT mem16/32, imm8
ADD EAX, imm16/32
CBW/CWDE
ADD mreg8, imm8
CLC
ADD mem8, imm8
CMC
ADD mreg16/32, imm16/32
CMOVA/CMOVBE reg16/32, reg16/32
ADD mem16/32, imm16/32
CMOVA/CMOVBE reg16/32, mem16/32
ADD mreg16/32, imm8 (sign extended)
CMOVAE/CMOVNB/CMOVNC reg16/32, mem16/32
ADD mem16/32, imm8 (sign extended)
CMOVAE/CMOVNB/CMOVNC mem16/32, mem16/32
AND mreg8, reg8
CMOVB/CMOVC/CMOVNAE reg16/32, reg16/32
AND mem8, reg8
CMOVB/CMOVC/CMOVNAE mem16/32, reg16/32
220
DirectPath Instructions
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions (Continued)
Instruction Mnemonic
Instruction Mnemonic
CMOVBE/CMOVNA reg16/32, reg16/32
CMP AL, imm8
CMOVBE/CMOVNA reg16/32, mem16/32
CMP EAX, imm16/32
CMOVE/CMOVZ reg16/32, reg16/32
CMP mreg8, imm8
CMOVE/CMOVZ reg16/32, mem16/32
CMP mem8, imm8
CMOVG/CMOVNLE reg16/32, reg16/32
CMP mreg16/32, imm16/32
CMOVG/CMOVNLE reg16/32, mem16/32
CMP mem16/32, imm16/32
CMOVGE/CMOVNL reg16/32, reg16/32
CMP mreg16/32, imm8 (sign extended)
CMOVGE/CMOVNL reg16/32, mem16/32
CMP mem16/32, imm8 (sign extended)
CMOVL/CMOVNGE reg16/32, reg16/32
CWD/CDQ
CMOVL/CMOVNGE reg16/32, mem16/32
DEC EAX
CMOVLE/CMOVNG reg16/32, reg16/32
DEC ECX
CMOVLE/CMOVNG reg16/32, mem16/32
DEC EDX
CMOVNE/CMOVNZ reg16/32, reg16/32
DEC EBX
CMOVNE/CMOVNZ reg16/32, mem16/32
DEC ESP
CMOVNO reg16/32, reg16/32
DEC EBP
CMOVNO reg16/32, mem16/32
DEC ESI
CMOVNP/CMOVPO reg16/32, reg16/32
DEC EDI
CMOVNP/CMOVPO reg16/32, mem16/32
DEC mreg8
CMOVNS reg16/32, reg16/32
DEC mem8
CMOVNS reg16/32, mem16/32
DEC mreg16/32
CMOVO reg16/32, reg16/32
DEC mem16/32
CMOVO reg16/32, mem16/32
INC EAX
CMOVP/CMOVPE reg16/32, reg16/32
INC ECX
CMOVP/CMOVPE reg16/32, mem16/32
INC EDX
CMOVS reg16/32, reg16/32
INC EBX
CMOVS reg16/32, mem16/32
INC ESP
CMP mreg8, reg8
INC EBP
CMP mem8, reg8
INC ESI
CMP mreg16/32, reg16/32
INC EDI
CMP mem16/32, reg16/32
INC mreg8
CMP reg8, mreg8
INC mem8
CMP reg8, mem8
INC mreg16/32
CMP reg16/32, mreg16/32
INC mem16/32
CMP reg16/32, mem16/32
JO short disp8
DirectPath Instructions
221
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions (Continued)
Instruction Mnemonic
Instruction Mnemonic
JNO short disp8
JMP near mreg16/32 (indirect)
JB/JNAE short disp8
JMP near mem16/32 (indirect)
JNB/JAE short disp8
LEA reg32, mem16/32
JZ/JE short disp8
MOV mreg8, reg8
JNZ/JNE short disp8
MOV mem8, reg8
JBE/JNA short disp8
MOV mreg16/32, reg16/32
JNBE/JA short disp8
MOV mem16/32, reg16/32
JS short disp8
MOV reg8, mreg8
JNS short disp8
MOV reg8, mem8
JP/JPE short disp8
MOV reg16/32, mreg16/32
JNP/JPO short disp8
MOV reg16/32, mem16/32
JL/JNGE short disp8
MOV AL, mem8
JNL/JGE short disp8
MOV EAX, mem16/32
JLE/JNG short disp8
MOV mem8, AL
JNLE/JG short disp8
MOV mem16/32, EAX
JO near disp16/32
MOV AL, imm8
JNO near disp16/32
MOV CL, imm8
JB/JNAE near disp16/32
MOV DL, imm8
JNB/JAE near disp16/32
MOV BL, imm8
JZ/JE near disp16/32
MOV AH, imm8
JNZ/JNE near disp16/32
MOV CH, imm8
JBE/JNA near disp16/32
MOV DH, imm8
JNBE/JA near disp16/32
MOV BH, imm8
JS near disp16/32
MOV EAX, imm16/32
JNS near disp16/32
MOV ECX, imm16/32
JP/JPE near disp16/32
MOV EDX, imm16/32
JNP/JPO near disp16/32
MOV EBX, imm16/32
JL/JNGE near disp16/32
MOV ESP, imm16/32
JNL/JGE near disp16/32
MOV EBP, imm16/32
JLE/JNG near disp16/32
MOV ESI, imm16/32
JNLE/JG near disp16/32
MOV EDI, imm16/32
JMP near disp16/32 (direct)
MOV mreg8, imm8
JMP far disp32/48 (direct)
MOV mem8, imm8
JMP disp8 (short)
MOV mreg16/32, imm16/32
222
DirectPath Instructions
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions (Continued)
Instruction Mnemonic
Instruction Mnemonic
MOV mem16/32, imm16/32
PUSH EAX
MOVSX reg16/32, mreg8
PUSH ECX
MOVSX reg16/32, mem8
PUSH EDX
MOVSX reg32, mreg16
PUSH EBX
MOVSX reg32, mem16
PUSH ESP
MOVZX reg16/32, mreg8
PUSH EBP
MOVZX reg16/32, mem8
PUSH ESI
MOVZX reg32, mreg16
PUSH EDI
MOVZX reg32, mem16
PUSH imm8
NEG mreg8
PUSH imm16/32
NEG mem8
RCL mreg8, imm8
NEG mreg16/32
RCL mreg16/32, imm8
NEG mem16/32
RCL mreg8, 1
NOP (XCHG EAX, EAX)
RCL mem8, 1
NOT mreg8
RCL mreg16/32, 1
NOT mem8
RCL mem16/32, 1
NOT mreg16/32
RCL mreg8, CL
NOT mem16/32
RCL mreg16/32, CL
OR mreg8, reg8
RCR mreg8, imm8
OR mem8, reg8
RCR mreg16/32, imm8
OR mreg16/32, reg16/32
RCR mreg8, 1
OR mem16/32, reg16/32
RCR mem8, 1
OR reg8, mreg8
RCR mreg16/32, 1
OR reg8, mem8
RCR mem16/32, 1
OR reg16/32, mreg16/32
RCR mreg8, CL
OR reg16/32, mem16/32
RCR mreg16/32, CL
OR AL, imm8
ROL mreg8, imm8
OR EAX, imm16/32
ROL mem8, imm8
OR mreg8, imm8
ROL mreg16/32, imm8
OR mem8, imm8
ROL mem16/32, imm8
OR mreg16/32, imm16/32
ROL mreg8, 1
OR mem16/32, imm16/32
ROL mem8, 1
OR mreg16/32, imm8 (sign extended)
ROL mreg16/32, 1
OR mem16/32, imm8 (sign extended)
ROL mem16/32, 1
DirectPath Instructions
223
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions (Continued)
Instruction Mnemonic
Instruction Mnemonic
ROL mreg8, CL
SBB reg16/32, mreg16/32
ROL mem8, CL
SBB reg16/32, mem16/32
ROL mreg16/32, CL
SBB AL, imm8
ROL mem16/32, CL
SBB EAX, imm16/32
ROR mreg8, imm8
SBB mreg8, imm8
ROR mem8, imm8
SBB mem8, imm8
ROR mreg16/32, imm8
SBB mreg16/32, imm16/32
ROR mem16/32, imm8
SBB mem16/32, imm16/32
ROR mreg8, 1
SBB mreg16/32, imm8 (sign extended)
ROR mem8, 1
SBB mem16/32, imm8 (sign extended)
ROR mreg16/32, 1
SETO mreg8
ROR mem16/32, 1
SETO mem8
ROR mreg8, CL
SETNO mreg8
ROR mem8, CL
SETNO mem8
ROR mreg16/32, CL
SETB/SETC/SETNAE mreg8
ROR mem16/32, CL
SETB/SETC/SETNAE mem8
SAR mreg8, imm8
SETAE/SETNB/SETNC mreg8
SAR mem8, imm8
SETAE/SETNB/SETNC mem8
SAR mreg16/32, imm8
SETE/SETZ mreg8
SAR mem16/32, imm8
SETE/SETZ mem8
SAR mreg8, 1
SETNE/SETNZ mreg8
SAR mem8, 1
SETNE/SETNZ mem8
SAR mreg16/32, 1
SETBE/SETNA mreg8
SAR mem16/32, 1
SETBE/SETNA mem8
SAR mreg8, CL
SETA/SETNBE mreg8
SAR mem8, CL
SETA/SETNBE mem8
SAR mreg16/32, CL
SETS mreg8
SAR mem16/32, CL
SETS mem8
SBB mreg8, reg8
SETNS mreg8
SBB mem8, reg8
SETNS mem8
SBB mreg16/32, reg16/32
SETP/SETPE mreg8
SBB mem16/32, reg16/32
SETP/SETPE mem8
SBB reg8, mreg8
SETNP/SETPO mreg8
SBB reg8, mem8
SETNP/SETPO mem8
224
DirectPath Instructions
22007E/0—November 1999
AMD Athlon™ Processor x86 Code Optimization
Table 25. DirectPath Integer Instructions (Continued) Table 25. DirectPath Integer Instructions (Continued)
Instruction Mnemonic
Instruction Mnemonic
SETL/SETNGE mreg8
SUB mem8, reg8
SETL/SETNGE mem8
SUB mreg16/32, reg16/32
SETGE/SETNL mreg8
SUB mem16/32, reg16/32
SETGE/SETNL mem8
SUB reg8, mreg8
SETLE/SETNG mreg8
SUB reg8, mem8
SETLE/SETNG mem8
SUB reg16/32, mreg16/32
SETG/SETNLE mreg8
SUB reg16/32, mem16/32
SETG/SETNLE mem8
SUB AL, imm8
SHL/SAL mreg8, imm8
SUB EAX, imm16/32
SHL/SAL mem8, imm8
SUB mreg8, imm8
SHL/SAL mreg16/32, imm8
SUB mem8, imm8
SHL/SAL mem16/32, imm8
SUB mreg16/32, imm16/32
SHL/SAL mreg8, 1
SUB mem16/32, imm16/32
SHL/SAL mem8, 1
SUB mreg16/32, imm8 (sign extended)
SHL/SAL mreg16/32, 1
SUB mem16/32, imm8 (sign extended)
SHL/SAL mem16/32, 1
TEST mreg8, reg8
SHL/SAL mreg8, CL
TEST mem8, reg8
SHL/SAL mem8, CL
TEST mreg16/32, reg16/32
SHL/SAL mreg16/32, CL
TEST mem16/32, reg16/32
SHL/SAL mem16/32, CL
TEST AL, imm8
SHR mreg8, imm8
TEST EAX, imm16/32
SHR mem8, imm8
TEST mreg8, imm8
SHR mreg16/32, imm8
TEST mem8, imm8
SHR mem16/32, imm8
TEST mreg8, imm16/32
SHR mreg8, 1
TEST mem8, imm16/32
SHR mem8, 1
WAIT
SHR mreg16/32, 1
XCHG EAX, EAX
SHR mem16/32, 1
XOR mreg8, reg8
SHR mreg8, CL
XOR mem8, reg8
SHR mem8, CL
XOR mreg16/32, reg16/32
SHR mreg16/32, CL
XOR mem16/32, reg16/32
SHR mem16/32, CL
XOR reg8, mreg8
STC
XOR reg8, mem8
SUB mreg8, reg8
XOR reg16/32, mreg16/32
DirectPath Instructions
225
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 25. DirectPath Integer Instructions (Continued)
Instruction Mnemonic
XOR reg16/32, mem16/32
XOR AL, imm8
XOR EAX, imm16/32
XOR mreg8, imm8
XOR mem8, imm8
XOR mreg16/32, imm16/32
XOR mem16/32, imm16/32
XOR mreg16/32, imm8 (sign extended)
XOR mem16/32, imm8 (sign extended)
226
DirectPath Instructions
22007E/0—November 1999
Table 26. DirectPath MMX™ Instructions
Instruction Mnemonic
AMD Athlon™ Processor x86 Code Optimization
Table 26. DirectPath MMX™ Instructions (Continued)
Instruction Mnemonic
EMMS
PCMPEQD mmreg, mem64
MOVD mmreg, mem32
PCMPEQW mmreg1, mmreg2
MOVD mem32, mmreg
PCMPEQW mmreg, mem64
MOVQ mmreg1, mmreg2
PCMPGTB mmreg1, mmreg2
MOVQ mmreg, mem64
PCMPGTB mmreg, mem64
MOVQ mmreg2, mmreg1
PCMPGTD mmreg1, mmreg2
MOVQ mem64, mmreg
PCMPGTD mmreg, mem64
PACKSSDW mmreg1, mmreg2
PCMPGTW mmreg1, mmreg2
PACKSSDW mmreg, mem64
PCMPGTW mmreg, mem64
PACKSSWB mmreg1, mmreg2
PMADDWD mmreg1, mmreg2
PACKSSWB mmreg, mem64
PMADDWD mmreg, mem64
PACKUSWB mmreg1, mmreg2
PMULHW mmreg1, mmreg2
PACKUSWB mmreg, mem64
PMULHW mmreg, mem64
PADDB mmreg1, mmreg2
PMULLW mmreg1, mmreg2
PADDB mmreg, mem64
PMULLW mmreg, mem64
PADDD mmreg1, mmreg2
POR mmreg1, mmreg2
PADDD mmreg, mem64
POR mmreg, mem64
PADDSB mmreg1, mmreg2
PSLLD mmreg1, mmreg2
PADDSB mmreg, mem64
PSLLD mmreg, mem64
PADDSW mmreg1, mmreg2
PSLLD mmreg, imm8
PADDSW mmreg, mem64
PSLLQ mmreg1, mmreg2
PADDUSB mmreg1, mmreg2
PSLLQ mmreg, mem64
PADDUSB mmreg, mem64
PSLLQ mmreg, imm8
PADDUSW mmreg1, mmreg2
PSLLW mmreg1, mmreg2
PADDUSW mmreg, mem64
PSLLW mmreg, mem64
PADDW mmreg1, mmreg2
PSLLW mmreg, imm8
PADDW mmreg, mem64
PSRAW mmreg1, mmreg2
PAND mmreg1, mmreg2
PSRAW mmreg, mem64
PAND mmreg, mem64
PSRAW mmreg, imm8
PANDN mmreg1, mmreg2
PSRAD mmreg1, mmreg2
PANDN mmreg, mem64
PSRAD mmreg, mem64
PCMPEQB mmreg1, mmreg2
PSRAD mmreg, imm8
PCMPEQB mmreg, mem64
PSRLD mmreg1, mmreg2
PCMPEQD mmreg1, mmreg2
PSRLD mmreg, mem64
DirectPath Instructions
227
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 26. DirectPath MMX™ Instructions (Continued) Table 26. DirectPath MMX™ Instructions (Continued)
Instruction Mnemonic
PSRLD mmreg, imm8
Instruction Mnemonic
PXOR mmreg, mem64
PSRLQ mmreg1, mmreg2
PSRLQ mmreg, mem64
PSRLQ mmreg, imm8
PSRLW mmreg1, mmreg2
Table 27. DirectPath MMX™ Extensions
Instruction Mnemonic
PSRLW mmreg, mem64
MOVNTQ mem64, mmreg
PSRLW mmreg, imm8
PAVGB mmreg1, mmreg2
PSUBB mmreg1, mmreg2
PAVGB mmreg, mem64
PSUBB mmreg, mem64
PAVGW mmreg1, mmreg2
PSUBD mmreg1, mmreg2
PAVGW mmreg, mem64
PSUBD mmreg, mem64
PMAXSW mmreg1, mmreg2
PSUBSB mmreg1, mmreg2
PMAXSW mmreg, mem64
PSUBSB mmreg, mem64
PMAXUB mmreg1, mmreg2
PSUBSW mmreg1, mmreg2
PMAXUB mmreg, mem64
PSUBSW mmreg, mem64
PMINSW mmreg1, mmreg2
PSUBUSB mmreg1, mmreg2
PMINSW mmreg, mem64
PSUBUSB mmreg, mem64
PMINUB mmreg1, mmreg2
PSUBUSW mmreg1, mmreg2
PMINUB mmreg, mem64
PSUBUSW mmreg, mem64
PMULHUW mmreg1, mmreg2
PSUBW mmreg1, mmreg2
PMULHUW mmreg, mem64
PSUBW mmreg, mem64
PSADBW mmreg1, mmreg2
PUNPCKHBW mmreg1, mmreg2
PSADBW mmreg, mem64
PUNPCKHBW mmreg, mem64
PSHUFW mmreg1, mmreg2, imm8
PUNPCKHDQ mmreg1, mmreg2
PSHUFW mmreg, mem64, imm8
PUNPCKHDQ mmreg, mem64
PREFETCHNTA mem8
PUNPCKHWD mmreg1, mmreg2
PREFETCHT0 mem8
PUNPCKHWD mmreg, mem64
PREFETCHT1 mem8
PUNPCKLBW mmreg1, mmreg2
PREFETCHT2 mem8
PUNPCKLBW mmreg, mem64
PUNPCKLDQ mmreg1, mmreg2
PUNPCKLDQ mmreg, mem64
PUNPCKLWD mmreg1, mmreg2
PUNPCKLWD mmreg, mem64
PXOR mmreg1, mmreg2
228
DirectPath Instructions
22007E/0—November 1999
Table 28. DirectPath Floating-Point Instructions
AMD Athlon™ Processor x86 Code Optimization
Table 28. DirectPath Floating-Point Instructions
Instruction Mnemonic
Instruction Mnemonic
FABS
FIST [mem32int]
FADD ST, ST(i)
FISTP [mem16int]
FADD [mem32real]
FISTP [mem32int]
FADD ST(i), ST
FISTP [mem64int]
FADD [mem64real]
FLD ST(i)
FADDP ST(i), ST
FLD [mem32real]
FCHS
FLD [mem64real]
FCOM ST(i)
FLD [mem80real]
FCOMP ST(i)
FLD1
FCOM [mem32real]
FLDL2E
FCOM [mem64real]
FLDL2T
FCOMP [mem32real]
FLDLG2
FCOMP [mem64real]
FLDLN2
FCOMPP
FLDPI
FDECSTP
FLDZ
FDIV ST, ST(i)
FMUL ST, ST(i)
FDIV ST(i), ST
FMUL ST(i), ST
FDIV [mem32real]
FMUL [mem32real]
FDIV [mem64real]
FMUL [mem64real]
FDIVP ST, ST(i)
FMULP ST, ST(i)
FDIVR ST, ST(i)
FNOP
FDIVR ST(i), ST
FPREM
FDIVR [mem32real]
FPREM1
FDIVR [mem64real]
FSQRT
FDIVRP ST(i), ST
FST [mem32real]
FFREE ST(i)
FST [mem64real]
FFREEP ST(i)
FST ST(i)
FILD [mem16int]
FSTP [mem32real]
FILD [mem32int]
FSTP [mem64real]
FILD [mem64int]
FSTP [mem80real]
FIMUL [mem32int]
FSTP ST(i)
FIMUL [mem16int]
FSUB [mem32real]
FINCSTP
FSUB [mem64real]
FIST [mem16int]
FSUB ST, ST(i)
DirectPath Instructions
229
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 28. DirectPath Floating-Point Instructions
Instruction Mnemonic
FSUB ST(i), ST
FSUBP ST, ST(i)
FSUBR [mem32real]
FSUBR [mem64real]
FSUBR ST, ST(i)
FSUBR ST(i), ST
FSUBRP ST(i), ST
FTST
FUCOM
FUCOMP
FUCOMPP
FWAIT
FXCH
230
DirectPath Instructions
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
VectorPath Instructions
The following tables contain VectorPath instructions, which
should be avoided in the AMD Athlon processor:
■
■
■
Table 29, “VectorPath Integer Instructions,” on page 231
Table 30, “VectorPath MMX™ Instructions,” on page 234
and Table 31, “VectorPath MMX™ Extensions,” on
page 234
Table 32, “VectorPath Floating-Point Instructions,” on
page 235
Table 29. VectorPath Integer Instructions
Table 29. VectorPath Integer Instructions (Continued)
Instruction Mnemonic
Instruction Mnemonic
AAA
BTS mem16/32, imm8
AAD
CALL full pointer
AAM
CALL near imm16/32
AAS
CALL mem16:16/32
ARPL mreg16, reg16
CALL near mreg32 (indirect)
ARPL mem16, reg16
CALL near mem32 (indirect)
BOUND
CLD
BSF reg16/32, mreg16/32
CLI
BSF reg16/32, mem16/32
CLTS
BSR reg16/32, mreg16/32
CMPSB mem8,mem8
BSR reg16/32, mem16/32
CMPSW mem16, mem32
BT mem16/32, reg16/32
CMPSD mem32, mem32
BTC mreg16/32, reg16/32
CMPXCHG mreg8, reg8
BTC mem16/32, reg16/32
CMPXCHG mem8, reg8
BTC mreg16/32, imm8
CMPXCHG mreg16/32, reg16/32
BTC mem16/32, imm8
CMPXCHG mem16/32, reg16/32
BTR mreg16/32, reg16/32
CMPXCHG8B mem64
BTR mem16/32, reg16/32
CPUID
BTR mreg16/32, imm8
DAA
BTR mem16/32, imm8
DAS
BTS mreg16/32, reg16/32
DIV AL, mreg8
BTS mem16/32, reg16/32
DIV AL, mem8
BTS mreg16/32, imm8
DIV EAX, mreg16/32
VectorPath Instructions
231
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 29. VectorPath Integer Instructions (Continued) Table 29. VectorPath Integer Instructions (Continued)
Instruction Mnemonic
Instruction Mnemonic
DIV EAX, mem16/32
LEA reg16, mem16/32
ENTER
LEAVE
IDIV mreg8
LES reg16/32, mem32/48
IDIV mem8
LFS reg16/32, mem32/48
IDIV EAX, mreg16/32
LGDT mem48
IDIV EAX, mem16/32
LGS reg16/32, mem32/48
IMUL reg16/32, imm16/32
LIDT mem48
IMUL reg16/32, mreg16/32, imm16/32
LLDT mreg16
IMUL reg16/32, mem16/32, imm16/32
LLDT mem16
IMUL reg16/32, imm8 (sign extended)
LMSW mreg16
IMUL reg16/32, mreg16/32, imm8 (signed)
LMSW mem16
IMUL reg16/32, mem16/32, imm8 (signed)
LODSB AL, mem8
IMUL AX, AL, mreg8
LODSW AX, mem16
IMUL AX, AL, mem8
LODSD EAX, mem32
IMUL EDX:EAX, EAX, mreg16/32
LOOP disp8
IMUL EDX:EAX, EAX, mem16/32
LOOPE/LOOPZ disp8
IMUL reg16/32, mreg16/32
LOOPNE/LOOPNZ disp8
IMUL reg16/32, mem16/32
LSL reg16/32, mreg16/32
IN AL, imm8
LSL reg16/32, mem16/32
IN AX, imm8
LSS reg16/32, mem32/48
IN EAX, imm8
LTR mreg16
IN AL, DX
LTR mem16
IN AX, DX
MOV mreg16, segment reg
IN EAX, DX
MOV mem16, segment reg
INVD
MOV segment reg, mreg16
INVLPG
MOV segment reg, mem16
JCXZ/JEC short disp8
MOVSB mem8,mem8
JMP far disp32/48 (direct)
MOVSD mem16, mem16
JMP far mem32 (indirect)
MOVSW mem32, mem32
JMP far mreg32 (indirect)
MUL AL, mreg8
LAHF
MUL AL, mem8
LAR reg16/32, mreg16/32
MUL AX, mreg16
LAR reg16/32, mem16/32
MUL AX, mem16
LDS reg16/32, mem32/48
MUL EAX, mreg32
232
VectorPath Instructions
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 29. VectorPath Integer Instructions (Continued) Table 29. VectorPath Integer Instructions (Continued)
Instruction Mnemonic
Instruction Mnemonic
MUL EAX, mem32
RCL mem8, imm8
OUT imm8, AL
RCL mem16/32, imm8
OUT imm8, AX
RCL mem8, CL
OUT imm8, EAX
RCL mem16/32, CL
OUT DX, AL
RCR mem8, imm8
OUT DX, AX
RCR mem16/32, imm8
OUT DX, EAX
RCR mem8, CL
POP ES
RCR mem16/32, CL
POP SS
RDMSR
POP DS
RDPMC
POP FS
RDTSC
POP GS
RET near imm16
POP EAX
RET near
POP ECX
RET far imm16
POP EDX
RET far
POP EBX
SAHF
POP ESP
SCASB AL, mem8
POP EBP
SCASW AX, mem16
POP ESI
SCASD EAX, mem32
POP EDI
SGDT mem48
POP mreg 16/32
SIDT mem48
POP mem 16/32
SHLD mreg16/32, reg16/32, imm8
POPA/POPAD
SHLD mem16/32, reg16/32, imm8
POPF/POPFD
SHLD mreg16/32, reg16/32, CL
PUSH ES
SHLD mem16/32, reg16/32, CL
PUSH CS
SHRD mreg16/32, reg16/32, imm8
PUSH FS
SHRD mem16/32, reg16/32, imm8
PUSH GS
SHRD mreg16/32, reg16/32, CL
PUSH SS
SHRD mem16/32, reg16/32, CL
PUSH DS
SLDT mreg16
PUSH mreg16/32
SLDT mem16
PUSH mem16/32
SMSW mreg16
PUSHA/PUSHAD
SMSW mem16
PUSHF/PUSHFD
STD
VectorPath Instructions
233
AMD Athlon™ Processor x86 Code Optimization
Table 29. VectorPath Integer Instructions (Continued)
22007E/0—November 1999
Table 30. VectorPath MMX™ Instructions
Instruction Mnemonic
Instruction Mnemonic
STI
MOVD mmreg, mreg32
STOSB mem8, AL
MOVD mreg32, mmreg
STOSW mem16, AX
STOSD mem32, EAX
Table 31. VectorPath MMX™ Extensions
STR mreg16
STR mem16
SYSCALL
SYSENTER
SYSEXIT
SYSRET
VERR mreg16
Instruction Mnemonic
MASKMOVQ mmreg1, mmreg2
PEXTRW reg32, mmreg, imm8
PINSRW mmreg, reg32, imm8
PINSRW mmreg, mem16, imm8
PMOVMSKB reg32, mmreg
SFENCE
VERR mem16
VERW mreg16
VERW mem16
WBINVD
WRMSR
XADD mreg8, reg8
XADD mem8, reg8
XADD mreg16/32, reg16/32
XADD mem16/32, reg16/32
XCHG reg8, mreg8
XCHG reg8, mem8
XCHG reg16/32, mreg16/32
XCHG reg16/32, mem16/32
XCHG EAX, ECX
XCHG EAX, EDX
XCHG EAX, EBX
XCHG EAX, ESP
XCHG EAX, EBP
XCHG EAX, ESI
XCHG EAX, EDI
XLAT
234
VectorPath Instructions
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Table 32. VectorPath Floating-Point Instructions
Table 32. VectorPath Floating-Point Instructions (Continued)
Instruction Mnemonic
Instruction Mnemonic
F2XM1
FLDENV [mem14byte]
FBLD [mem80]
FLDENV [mem28byte]
FBSTP [mem80]
FPTAN
FCLEX
FPATAN
FCMOVB ST(0), ST(i)
FRNDINT
FCMOVE ST(0), ST(i)
FRSTOR [mem94byte]
FCMOVBE ST(0), ST(i)
FRSTOR [mem108byte]
FCMOVU ST(0), ST(i)
FSAVE [mem94byte]
FCMOVNB ST(0), ST(i)
FSAVE [mem108byte]
FCMOVNE ST(0), ST(i)
FSCALE
FCMOVNBE ST(0), ST(i)
FSIN
FCMOVNU ST(0), ST(i)
FSINCOS
FCOMI ST, ST(i)
FSTCW [mem16]
FCOMIP ST, ST(i)
FSTENV [mem14byte]
FCOS
FSTENV [mem28byte]
FIADD [mem32int]
FSTP [mem80real]
FIADD [mem16int]
FSTSW AX
FICOM [mem32int]
FSTSW [mem16]
FICOM [mem16int]
FUCOMI ST, ST(i)
FICOMP [mem32int]
FUCOMIP ST, ST(i)
FICOMP [mem16int]
FXAM
FIDIV [mem32int]
FXTRACT
FIDIV [mem16int]
FYL2X
FIDIVR [mem32int]
FYL2XP1
FIDIVR [mem16int]
FIMUL [mem32int]
FIMUL [mem16int]
FINIT
FISUB [mem32int]
FISUB [mem16int]
FISUBR [mem32int]
FISUBR [mem16int]
FLD [mem80real]
FLDCW [mem16]
VectorPath Instructions
235
AMD Athlon™ Processor x86 Code Optimization
236
22007E/0—November 1999
VectorPath Instructions
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
Index
Numerics
D
3DNow!™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . 10, 107
3DNow! and MMX™ Intra-Operand Swapping . . . . . . . 112
Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Fast Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Fast Square Root and Reciprocal Square Root . . . . . . . 110
FEMMS Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
PAVGUSB for MPEG-2 Motion Compensation . . . . . . . . 123
PFCMP Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
PFMUL Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . 113–114
PI2FW Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
PREFETCH and PREFETCHW Instructions . 8, 46–47, 49
PSWAPD Instruction . . . . . . . . . . . . . . . . . . . . . . . . 112, 126
Scalar Code Translated into 3DNow! Code . . . . . . . . 61–64
Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33, 133
Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
DirectPath
Decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
DirectPath Over VectorPath Instructions . . . . . . 9, 34, 219
Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Displacements, 8-Bit Sign-Extended . . . . . . . . . . . . . . . . . . 39
Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77–80, 93, 95
Replace Divides with Multiplies, Integer . . . . . . . . . 31, 77
Using 3DNow! Instructions . . . . . . . . . . . . . . . . . . . 108–109
Dynamic Memory Allocation Consideration . . . . . . . . . . . . 25
A
Address Generation Interlocks . . . . . . . . . . . . . . . . . . . . . . . 72
AMD Athlon™ Processor
Branch-Free Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Code Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Family. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . .4, 129–130
AMD Athlon™ System Bus . . . . . . . . . . . . . . . . . . . . . . . . . 139
E
Event and Time-Stamp Monitoring Software . . . . . . . . . . 168
Execution Unit Resources. . . . . . . . . . . . . . . . . . . . . . . . . . 148
Extended-Precision Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
F
Blended Code, AMD-K6 and AMD Athlon Processors
3DNow! and MMX Intra-Operand Swapping . . . . . . . . . 112
Block Copies and Block Fills . . . . . . . . . . . . . . . . . . . . . . 115
Branch Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Code Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Signed Words to Floating-Point Example. . . . . . . . . . . . 113
Branches
Align Branch Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Compound Branch Conditions . . . . . . . . . . . . . . . . . . . . . . 20
Dependent on Random Data . . . . . . . . . . . . . . . . . . . . 10, 57
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Replace with Computation in 3DNow! Code . . . . . . . . . . 60
Far Control Transfer Instructions. . . . . . . . . . . . . . . . . . . . . 65
Fetch and Decode Pipeline Stages . . . . . . . . . . . . . . . . . . . 141
FFREEP Macro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Floating-Point
Compare Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Divides and Square Roots . . . . . . . . . . . . . . . . . . . . . . . . . 29
Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Pipeline Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Pipeline Stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Subexpression Elimination . . . . . . . . . . . . . . . . . . . . . . . 103
To Integer Conversions . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Variables and Expressions are Type Float . . . . . . . . . . . 13
FRNDINT Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
FSINCOS Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
FXCH Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99, 103
C
G
C Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Array-Style Over Pointer-Style Code . . . . . . . . . . . . . . . . 15
C Code to 3DNow! Code Examples . . . . . . . . . . . . . . . 61–64
Structure Component Considerations . . . . . . . . . . . . 27, 55
Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
64-Byte Cache Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11, 50
Cache and Memory Optimizations . . . . . . . . . . . . . . . . . . 45
CALL and RETURN Instructions . . . . . . . . . . . . . . . . . . . . . 59
Code Padding Using Neutral Code Fillers . . . . . . . . . . . . . . 39
Code Sample Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Complex Number Arithmetic . . . . . . . . . . . . . . . . . . . . . . . 126
Const Type Qualifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Constant Control Code, Multiple . . . . . . . . . . . . . . . . . . . . . 23
Group I — Essential Optimizations . . . . . . . . . . . . . . . . . . 7–8
Group II — Secondary Optimizations . . . . . . . . . . . . . . . . 7, 9
B
Index
I
If Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Immediates, 8-Bit Sign-Extended . . . . . . . . . . . . . . . . . . . . .
Inline Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71, 72,
Inline REP String with Low Counts . . . . . . . . . . . . . . . . . . .
24
38
86
85
237
AMD Athlon™ Processor x86 Code Optimization
Instruction
Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Dispatch and Execution Resources. . . . . . . . . . . . . . . . . 187
Short Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Short Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Integer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Arithmetic, 64-Bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Operand, Consider Sign . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Pipeline Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Pipeline Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Use 32-Bit Data Types for Integer Code . . . . . . . . . . . . . . 13
L
L2 Cache Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
LEA Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Load/Store Pipeline Operations . . . . . . . . . . . . . . . . . . . . . 151
Load-Execute Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 9, 34
Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . 10, 35
Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Load-Store Unit (LSU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Local Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Local Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28, 31, 56
Loop Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Loops
Deriving Loop Control For Partially Unrolled . . . . . . . . . 70
Generic Loop Hoisting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Minimize Pointer Arithmetic. . . . . . . . . . . . . . . . . . . . . . . 73
Partial Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
REP String with Low Variable Counts . . . . . . . . . . . . . . . 85
Unroll Small Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Unrolling Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
22007E/0—November 1999
MOVZX and MOVSX Instructions . . . . . . . . . . . . . . . . . . . . 73
MSR Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Multiplication
Alternative Code When Multiplying by a Constant . . . . 81
Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Multiplies over Divides, Floating Point . . . . . . . . . . . . . . 97
Muxing Constructs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
N
Newton-Raphson Reciprocal. . . . . . . . . . . . . . . . . . . . . . . . 109
Newton-Raphson Reciprocal Square Root . . . . . . . . . . . . 111
O
Operands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Largest Possible Operand Size, Repeated String . . . . . . 84
Optimization Star. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
P
Page Attribute Table (PAT). . . . . . . . . . . . . . . . . 171, 177–178
Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
PerfCtr MSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
PerfEvtSel MSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Performance-Monitoring Counters. . . . . . . . . . . 161, 168–169
Pipeline and Execution Unit Resources Overview. . . . . . 141
Pointers
De-referenced Arguments . . . . . . . . . . . . . . . . . . . . . . . . . 31
Use Array-Style Code Instead . . . . . . . . . . . . . . . . . . . . . 15
Population Count Function. . . . . . . . . . . . . . . . . . . . . . . . . . 91
Predecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Prefetch
Determing Distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Multiple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Prototypes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
M
R
Memory
Pushing Memory Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Size and Alignment Issues . . . . . . . . . . . . . . . . . . . . . . . 8, 45
Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Memory Type Range Register (MTRR) . . . . . . . . . . . . . . . 171
Capability Register Format . . . . . . . . . . . . . . . . . . . . . . . 174
Default Type Register Format . . . . . . . . . . . . . . . . . . . . . 175
Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Fixed-Range Register Format . . . . . . . . . . . . . . . . . . . . . 182
MSR Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
MTRRs and PAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Overlapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Variable-Range MTRR Register Format . . . . . . . . . . . . 183
MMX™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Block Copies and Block Fills . . . . . . . . . . . . . . . . . . . . . . 115
Integer-Only Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
MOVQ Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
PAND to Find Absolute Value in 3DNow! Code. . . . . . . 119
PCMP Instead of 3DNow! PFCMP . . . . . . . . . . . . . . . . . 114
PCMPEQD to Set an MMX Register. . . . . . . . . . . . . . . . 119
PMADDWD Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . 111
PREFETCHNTA/T0/T1/T2 Instruction . . . . . . . . . . . . . . . 47
PXOR Instruction . . . . . . . . . . . . . . . . . . . . . . .113, 118–119
Recursive Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Register Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Register Reads and Writes, Partial . . . . . . . . . . . . . . . . . . . 37
REP Prefix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40, 84–85
238
S
Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
SHLD Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
SHR Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Signed Words to Floating-Point Conversion . . . . . . . . . . . 113
Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Stack
Alignment Considerations . . . . . . . . . . . . . . . . . . . . . . . . 54
Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Store-to-Load Forwarding . . . . . . . . . . . . . . . . . . 18, 51, 53–54
Stream of Packed Unsigned Bytes . . . . . . . . . . . . . . . . . . . 125
String Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Structure (Struct). . . . . . . . . . . . . . . . . . . . . . . . . . . . 27–28, 56
Subexpressions, Explicitly Extract Common . . . . . . . . . . . 26
Superscalar Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Switch Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21, 24
Index
AMD Athlon™ Processor x86 Code Optimization
22007E/0—November 1999
T
W
TBYTE Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Trigonometric Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 103
Write Combining . . . . . . . . . . . . . . 10, 50, 139, 155–157, 159
V
VectorPath Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
VectorPath Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Index
X
x86 Optimization Guidelines . . . . . . . . . . . . . . . . . . . . . . . 127
XOR Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
239
AMD Athlon™ Processor x86 Code Optimization
240
22007E/0—November 1999
Index