» 首页 > 程序资料 > MMX 汇编优化 > MMX 优化: How to optimize for the Pentium family of microprocessors

7. Cache

日期: 2000-04-01 14:00 | 联系我 | 关注我: Telegram, Twitter

7. Cache

The PPlain and PPro have 8 kb of on-chip cache (level one cache) for code, and 8 kb for data. The PMMX, PII and PIII have 16 kb for code and 16 kb for data. Data in the level 1 cache can be read or written to in just one clock cycle, whereas a cache miss may cost many clock cycles. It is therefore important that you understand how the cache works in order to use it most efficiently.

The data cache consists of 256 or 512 lines of 32 bytes each. Each time you read a data item which is not cached, the processor will read an entire cache line from memory. The cache lines are always aligned to a physical address divisible by 32. When you have read a byte at an address divisible by 32, then the next 31 bytes can be read or written to at almost no extra cost. You can take advantage of this by arranging data items which are used near each other together into aligned blocks of 32 bytes of memory. If, for example, you have a loop which accesses two arrays, then you may interleave the two arrays into one array of structures, so that data which are used together are also stored together.

If the size of an array or other data structure is a multiple of 32 bytes, then you should preferably align it by 32.

The cache is set-associative. This means that a cache line can not be assigned to an arbitrary memory address. Each cache line has a 7-bit set-value which must match bits 5 through 11 of the physical RAM address (bit 0-4 define the 32 bytes within a cache line). The PPlain and PPro have two cache lines for each of the 128 set-values, so there are two possible cache lines to assign to any RAM address. The PMMX, PII and PIII have four.

The consequence of this is that the cache can hold no more than two or four different data blocks which have the same value in bits 5-11 of the address. You can determine if two addresses have the same set-value by the following method: Strip off the lower 5 bits of each address to get a value divisible by 32. If the difference between the two truncated addresses is a multiple of 4096 (=1000H), then the addresses have the same set-value.

Let me illustrate this by the following piece of code, where ESI holds an address divisible by 32:

AGAIN: MOV EAX, [ESI] MOV EBX, [ESI + 13*4096 + 4] MOV ECX, [ESI + 20*4096 + 28] DEC EDX JNZ AGAIN

The three addresses used here all have the same set-value because the differences between the truncated addresses are multipla of 4096. This loop will perform very poorly on the PPlain and PPro. At the time you read ECX there is no free cache line with the proper set-value so the processor takes the least recently used of the two possible cache lines, that is the one which was used for EAX, and fills it with the data from [ESI+20*4096] to [ESI+20*4096+31] and reads ECX. Next, when reading EAX, you find that the cache line that held the value for EAX has now been discarded, so you take the least recently used line, which is the one holding the EBX value, and so on.. You have nothing but cache misses and the loop takes something like 60 clock cycles. If the third line is changed to:

MOV ECX, [ESI + 20*4096 + 32]

then we have crossed a 32 byte boundary, so that we do not have the same set-value as in the first two lines, and there will be no problem assigning a cache line to each of the three addresses. The loop now takes only 3 clock cycles (except for the first time) - a very considerable improvement! As already mentioned, the PMMX, PII and PIII have 4-way caches so that you have four cache lines with the same set-value. (Some Intel documents erroneously say that the PII cache is 2-way).

It may be very difficult to determine if your data addresses have the same set-values, especially if they are scattered around in different segments. The best thing you can do to avoid problems of this kind is to keep all data used in the critical part or your program within one contiguous block not bigger than the cache, or two contiguous blocks no bigger than half that size (for example one block for static data and another block for data on the stack). This will make sure that your cache lines are used optimally.

If the critical part of your code accesses big data structures or random data addresses, then you may want to keep all frequently used variables (counters, pointers, control variables, etc.) within a single contiguous block of max 4 kbytes so that you have a complete set of cache lines free for accessing random data. Since you probably need stack space anyway for subroutine parameters and return addresses, the best thing is to copy all frequently used static data to dynamic variables on the stack, and copy them back again outside the critical loop if they have been changed.

Reading a data item which is not in the level one cache causes an entire cache line to be filled from the level two cache, which takes approximately 200 ns (that is 20 clocks on a 100 MHz system or 40 clocks on a 200 MHz system), but the bytes you ask for first are available already after 50-100 ns. If the data item is not in the level two cache either, then you will get a delay of something like 200-300 ns. This delay will be somewhat longer if you cross a DRAM page boundary. (The size of a DRAM page is 1 kb for 4 and 8 MB 72 pin RAM modules, and 2 kb for 16 and 32 MB modules).

When reading big blocks of data from memory, the speed is limited by the time it takes to fill cache lines. You can sometimes improve speed by reading data in a non-sequential order: before you finish reading data from one cache line start reading the first item from the next cache line. This method can increase reading speed by 20 - 40% when reading from main memory or level 2 cache on PPlain and PMMX, and from level 2 cache on PPro, PII and PIII. A disadvantage of this method is of course that the program code becomes extremely clumsy and complicated. For further information on this trick see http://www.intelligentfirm.com/.

When you write to an address which is not in the level 1 cache, then the value will go right through to the level 2 cache or to the RAM (depending on how the level 2 cache is set up) on the PPlain and PMMX. This takes approximately 100 ns. If you write eight or more times to the same 32 byte block of memory without also reading from it, and the block is not in the level one cache, then it may be advantageous to make a dummy read from the block first to load it into a cache line. All subsequent writes to the same block will then go to the cache instead, which takes only one clock cycle. On PPlain and PMMX, there is sometimes a small penalty for writing repeatedly to the same address without reading in between.

On PPro, PII and PIII, a write miss will normally load a cache line, but it is possible to setup an area of memory to perform differently, for example video RAM (See Pentium Pro Family Developer's Manual, vol. 3: Operating System Writer's Guide").

Other ways of speeding up memory reads and writes are discussed in chapter 27.8 below.

The PPlain and PPro have two write buffers, PMMX, PII and PIII have four. On the PMMX, PII and PIII you may have up to four unfinished writes to uncached memory without delaying the subsequent instructions. Each write buffer can handle operands up to 64 bits wide.

Temporary data may conveniently be stored on the stack because the stack area is very likely to be in the cache. However, you should be aware of the alignment problems if your data elements are bigger than the stack word size.

If the life ranges of two data structures do not overlap, then they may share the same RAM area to increase cache efficiency. This is consistent with the common practice of allocating space for temporary variables on the stack.

Storing temporary data in registers is of course even more efficient. Since registers is a scarce ressource you may want to use [ESP] rather than [EBP] for addressing data on the stack, in order to free EBP for other purposes. Just don't forget that the value of ESP changes every time you do a PUSH or POP. (You cannot use ESP under 16-bit Windows because the timer interrupt will modify the high word of ESP at unpredictable places in your code.)

There is a separate cache for code, which is similar to the data cache. The size of the code cache is 8 kb on PPlain and PPro and 16 kb on the PMMX, PII and PIII. It is important that the critical part of your code (the innermost loops) fit in the code cache. Frequently used pieces of code or routines which are used together should preferable be stored near each other. Seldom used branches or procedures should be put away in the bottom of your code or somewhere else.

前一篇：12. Prefixes (PPlain and PMMX)
下一篇：11. Splitting complex instructions into simpler ones (PPlain and PMMX)

标签: MMX 优化 | Cache

发表你的评论如果你想针对此文发表评论, 请填写下列表单:
姓名:	* 必填 (Twitter 用户可输入以 @ 开头的用户名, Steemit 用户可输入 @@ 开头的用户名)
E-mail:	可选 (不会被公开。如果我回复了你的评论，你将会收到邮件通知)
反垃圾广告:	为了防止广告机器人自动发贴, 请计算下列表达式的值: 3 x 1 + 1 = * 必填
评论内容:	* 必填你可以使用下列标签修饰文字: [b] 文字 [/b]: 加粗文字 [quote] 文字 [/quote]: 引用文字