» 首页 > 程序资料 > MMX 汇编优化 > MMX 优化: How to optimize for the Pentium family of microprocessors

29.1 Integer instructions

日期: 2000-04-03 14:00 | 联系我 | 关注我: Telegram, Twitter

29. List of instruction timings and micro-op breakdown for PPro, PII and PIII

Explanations:

Operands:

r = register, m = memory, i = immediate data, sr = segment register, m32 = 32 bit memory operand, etc.

Micro-ops:

The number of micro-ops that the instruction generates for each execution port.

p0: port 0: ALU, etc.

p1: port 1: ALU, jumps

p01: instructions that can go to either port 0 or 1, whichever is vacant first.

p2: port 2: load data, etc.

p3: port 3: address generation for store

p4: port 4: store data

Delay:

This is the delay that the instruction generates in a dependency chain. (This is not the same as the time spent in the execution unit. Values may be inaccurate in situations where they cannot be measured exactly, especially with memory operands). The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NANs and infinity increase the delays by 50-150 clocks, except in XMM move, shuffle and boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay.

Throughput:

The maximum throughput for several instructions of the same kind. For example, a throughput of 1/2 for FMUL means that a new FMUL instruction can start executing every 2 clock cycles.

< TD>

29.1 Integer instructions
Instruction	Operands	micro-ops						delay	throughput
		p0	p1	p01	p2	p3	p4
NOP				1
MOV	r,r/i			1
MOV	r,m				1
MOV	m,r/i					1	1
MOV	r,sr			1
MOV	m,sr			1		1	1
MOV	sr,r	8						5
MOV	sr,m	7			1			8
MOVSX MOVZX	r,r			1
MOVSX MOVZX	r,m				1
CMOVcc	r,r	1		1
CMOVcc	r,m	1		1	1
XCHG	r,r			3
XCHG	r,m			4	1	1	1	high b)
XLAT				1	1
PUSH	r/i			1		1	1
POP	r			1	1
POP	(E)SP			2	1
PUSH	m			1	1	1	1
POP	m			5	1	1	1
PUSH	sr			2		1	1
POP	sr			8	1
PUSHF(D)		3		11		1	1
POPF(D)		10		6	1
PUSHA(D)				2		8	8
POPA(D)				2	8
LAHF SAHF				1
LEA	r,m	1						1 c)
LDS LES LFS LGS LSS	m			8	3
ADD SUB AND OR XOR	r,r/i			1
ADD SUB AND OR XOR	r,m			1	1
ADD SUB AND OR XOR	m,r/i			1	1	1	1
ADC SBB	r,r/i			2
ADC SBB	r,m			2	1
ADC SBB	m,r/i			3	1	1	1
CMP TEST	r,r/i			1
CMP TEST	m,r/i			1	1
INC DEC NEG NOT	r			1
INC DEC NEG NOT	m			1	1	1	1
AAS DAA DAS			1
AAD		1		2				4
AAM		1	1	2				15
MUL IMUL	r,(r),(i)	1						4	1/1
MUL IMUL	(r),m	1			1			4	1/1
DIV IDIV	r8	2		1				19	1/12
DIV IDIV	r16	3		1				23	1/21
DIV IDIV	r32	3		1				39	1/37
DIV IDIV	m8	2		1	1			19	1/12
DIV IDIV	m16	2		1	1			23	1/21
DIV IDIV	m32	2		1	1			39	1/37
CBW CWDE				1
CWD CDQ		1
SHR SHL SAR ROR ROL	r,i/CL	1
SHR SHL SAR ROR ROL	m,i/CL	1			1	1	1
RCR RCL	r,1	1		1
RCR RCL	r8,i/CL	4		4
RCR RCL	r16/32,i/CL	3		3
RCR RCL	m,1	1		2	1	1	1
RCR RCL	m8,i/CL	4		3	1	1	1
RCR RCL	m16/32,i/CL	4		2	1	1	1
SHLD SHRD	r,r,i/CL	2
SHLD SHRD	m,r,i/CL	2		1	1	1	1
BT	r,r/i			1
BT	m,r/i	1		6	1
BTR BTS BTC	r,r/i			1
BTR BTS BTC	m,r/i	1		6	1	1	1
BSF BSR	r,r		1	1
BSF BSR	r,m		1	1	1
SETcc	r			1
SETcc	m			1		1	1
JMP	short/near		1						1/2
JMP	far	21			1
JMP	r		1						1/2
JMP	m(near)		1		1				1/2
JMP	m(far)	21			2
conditional jump	short/near		1						1/2
CALL	near		1	1		1	1		1/2
CALL	far	28			1	2	2
CALL	r		1	2		1	1		1/2
CALL	m(near)		1	4	1	1	1		1/2
CALL	m (far)	28			2	2	2
RETN			1	2	1				1/2
RETN	i		1	3	1				1/2
RETF		23			3
RETF	i	23			3
J(E)CXZ	short		1	1
LOOP	short	2	1	8
LOOP(N)E	short	2	1	8
ENTER	i,0			12		1	1
ENTER	a,b	ca. 18+4b				b-1	2b
LEAVE				2	1
BOUND	r,m	7		6	2
CLC STC CMC				1
CLD STD				4
CLI		9
STI		17
INTO				5
LODS					2
REP LODS				10+6n
STOS					1	1	1
REP STOS				ca. 5n a)
MOVS				1	3	1	1
REP MOVS				ca. 6n a)
SCAS				1	2
REP(N)E SCAS				12+7n
CMPS				4	2
REP(N)E CMPS				12+9n
BSWAP		1		1
CPUID		23-48
RDTSC		31
IN		18						>300
OUT		18						>300
PREFETCHNTA d)	m				1
PREFETCHT0 d)	m				1
PREFETCHT1 d)	m				1
PREFETCHT2 d)	m				1
SFENCE d)						1	1		1/6

Notes:

a) faster under certain conditions: see chapter 26.3.

b) see chapter 26.1

c) 3 if constant without base or index register

d) PIII only.

前一篇：27.8 Moving blocks of data (all processors)
下一篇：30. Testing speed

标签: MMX 优化 | Integer

发表你的评论如果你想针对此文发表评论, 请填写下列表单:
姓名:	* 必填 (Twitter 用户可输入以 @ 开头的用户名, Steemit 用户可输入 @@ 开头的用户名)
E-mail:	可选 (不会被公开。如果我回复了你的评论，你将会收到邮件通知)
反垃圾广告:	为了防止广告机器人自动发贴, 请计算下列表达式的值: 8 x 2 + 3 = * 必填
评论内容:	* 必填你可以使用下列标签修饰文字: [b] 文字 [/b]: 加粗文字 [quote] 文字 [/quote]: 引用文字