This manual describes in detail how to write optimized assembly language code, with particular focus on the Pentium® family of microprocessors.
Most of the information herein is based on my own research. Many people have sent me useful information and corrections for this manual, and I keep updating it whenever I have new important information. This manual is therefore more accurate, detailed, comprehensive and exact than any other source of information, and it contains many details not found anywhere else. This information will enable you in many cases to calculate exactly how many clock cycles a piece of code will take. I do not claim, though, that all information in this manual is exact: Some timings etc. can be difficult or impossible to measure exactly, and I do not have access to the inside information on technical implementations that the writers of Intel manuals have.
The following versions of Pentium processors are discussed in this manual:
|PPlain||plain old Pentium (without MMX)|
|PMMX||Pentium with MMX|
|PII||Pentium II (including Celeron and Xeon)|
|PIII||Pentium III (including variants)|
The assembly language syntax used in this manual is MASM 5.10 syntax. There is no official standard for X86 assembly language, but this is the closest you can get to a de facto standard since most assemblers have a MASM 5.10 compatible mode. (I do not recommend using MASM version 5.10 though, because it has a serious bug in 32 bit mode. Use TASM or a later version of MASM).
Some of the remarks in this manual may seem like a criticism of Intel. This should not be taken to mean that other brands are better. The Pentium family of microprocessors may be faster than any compatible competing brand, better documented, and with better testability features. For these reasons, no competing brand has been subjected to the same level of independent research by me or by anybody else.
Programming in assembly language is much more difficult than high level language. Making bugs is very easy, and finding them is very difficult. Now you have been warned! It is assumed that the reader is already experienced in assembly programming. If not, then please read some books on the subject and get some programming experience before you begin to do complicated optimizations.
The hardware design of the PPlain and PMMX chips has many features which are optimized specifically for some commonly used instructions or instruction combinations, rather than using general optimization methods. Consequently, the rules for optimizing software for this design are complicated and have many exceptions, but the possible gain in performance may be substantial. The PPro, PII and PIII processors have a very different design where the processor takes care of much of the optimization work by executing instructions out of order, but the more complicated design of these processors generate many potential bottlenecks, so there may be a lot to gain by optimizing manually for these processors.
Before you start to convert your code to assembly, make sure that your algorithm is optimal. Often you can improve a piece of code much more by improving the algorithm than by converting it to assembly code.
Next, you have to identify the critical parts of your program. Often more than 99% of the CPU time is spent in the innermost loop of a program. In this case you should optimize only this loop and leave everything else in high level language. Some assembly programmers waste a lot of energy optimizing the wrong parts of their programs, the only significant effect of their effort being that the programs become more difficult to debug and maintain!
If it is not obvious where the critical parts of your program are then you may use a profiler to find them. If it turns out that the bottleneck is disk access, then you may modify your program to make disk access sequential in order to improve disk caching, rather than turning to assembly programming. If the bottleneck is graphics output then you may look for a way of reducing the number of calls to graphic procedures.
Some high level language compilers offer relatively good optimization for specific processors, but further optimization by hand can usually make it much better.
Please don't send your programming questions to me. I am not gonna do your homework for you!
Good luck with your hunt for nanoseconds!