assembly instruction timings

Assembly instruction timings

27 Nov 2011

Instruction tables

The following tables show the numbers of cycles it takes for certain operations to be executed. It is shown in the format of latency / throughput where latency means the actual time the instruction takes from start to finish. Due to the concept of pipelining another instruction can be started before the first one is finished. This of course requires that the second operation must not depend on the result of the first operation. The throughput gives the time between two results if they are perfectly pipelined, i.e. the best performance possible. The tables only contain a very few (but important) instructions and only values for register to register operations (using the full register width, e.g. 64-bit on a x86-64, and 80-bit for x87 operations). For a much more comprehensive overview see the references below.

cpu	year	instruction	clock	peak flops	add/mul flops	lin algebra naive/eigen/ atlas/goto	mov	add	imul	idiv	fxch	fadd	fmul	fdiv	fsqrt	fsin	f2xm1	fyl2x	addsd	mulsd	addpd	mulpd
Intel 8086+87	1978	x86-16, x87	5MHz	45kflops			2	3	128-154	165-184	10-15	70-100	130-145	193-203	180-186		310-630	900-1100
Intel 80286+87	1982	x86-16, x87	16MHz	145kflops			2	2	21	25	10-15	70-100	130-145	193-203	180-186		310-630	900-1100
Intel 80386+87	1985	x86-32, x87	25MHz	725kflops	977kflops	278/-/-/- kflops	2	2	9-38	43	18	23-34	46-57	88-91	122-129	122-771	211-476	120-538
Intel 80486	1989	x86-32, x87	66MHz	5.5Mflops	4.9Mflops	2.8/3.6/-/- Mflops	1	1	13-42	43	4	8-20	16	73	83-87	257-354	140-279	196-329
Pentium	1993	x86-32, x87	100MHz	100Mflops	81Mflops	10/22/-/- Mflops	1	1	10	46	0	3/1	3/2	39	70	16-126	13-57	22-111
Pentium MMX	1996	x86-32, x87	200MHz	200Mflops	163Mflops	13/31/80/- Mflops	1	1	9	46	0	3/1	3/2	39/37	70/68	65-100	53-59	103
Pentium II	1997	x86-32, x87	300MHz	300Mflops	295Mflops	30/-/-/172 Mflops	1	1	4/1	39/37	0	3/1	5/2	38/37	69	27-103	66	103
Pentium III	1999	x86-32, x87, sse	500MHz	500Mflops			1	1	4/1	39/37	0	3/1	5/2	38/37	69	27-103	66	103
Pentium 4 Northwood	2002	x86-32, x87, sse2	2.4GHz	4.8Gflops	4.8Gflops	0.2/1.0/3.1/3.7 Gflops	0.5-1.5 / 0.25	0.5-1.5 / 0.25	16/8	50/23	0	5/1	7/2	43	43	180/170	165/63	200/90	4/2	6/2	4/2	6/2
Atom Diamondville	2008	x86-32, x87, ssse3	1.67GHz	1.7Gflops	1.6Gflops	0.2/0.4/0.5/1.0 Gflops	1	1	5/2	61	1	5/1	5/2	71	71	260	100	220	5/1	5/2	6/6	9/9

cpu	year	instruction	core	clock	peak flops	add/mul flops	lin algebra naive/eigen/ atlas/goto	mov	add	mul	idiv	fcpyd	faddd	fmuld	fdivd	fsqrtd	[exp]	[log]	[pow]	[sin]	[erf]
ARM11 (raspi)	2012	armv6, vfp	1	700MHz		362Mflops	28/70/-/- Mflops				-		8/2	9/2	34	48	305	540	830	260	185
ARM Cortex-A5 (htc desire c)	2012	armv7-a, vfpv4, neon	1	600MHz		573Mflops	55/152/-/- Mflops				-		4/1	7/4	32	35	265	385	645	215	170

cpu	year	instruction	core	clock	peak flops	add/mul flops	lin algebra naive/eigen/ atlas/goto	mov	add	imul	idiv	movapd	addpd	mulpd	divpd	sqrtpd	[exp]	[log]	[pow]	[sin]	[erf]
Core 2 Wolfdale	2007	x86-64, x87, sse4.1	2	2.33GHz	19Gflops	19Gflops	1.6/11/15/16 Gflops	1/0.33	1/0.33	5/2	34-88	1/0.33	3/1	5/1	6-21	6-20	118	125	422	88	81
Core i5 / i7 Nehalem (Bloomfield)	2008	x86-64, x87, sse4	4	2.67GHz	43Gflops	42Gflops	1.7/21/33/40 Gflops	1/0.33	1/0.33	3/2	37-100/26-86	1	3/1	5/1	7-22	7-32	100	114	404	90	68
Core i5 / i7 Sandy Bridge	2011	x86-64, x87, sse4, avx	4	3.0GHz	96Gflops	95Gflops	3.0/27/70/45 Gflops	1/0.33	1/0.33	3/1	40-103/25-84	1	3/1	5/1	10-22	10-21	94	108	418	88	62

References:

Unknown source (possibly John Allen) x86 instruction timings and x87 instruction timings, 2002
Agner Fog's optimisation manuals, in particular instruction_tables.pdf, 2008
Wikipedia page on x87 performance
google answer on the performance of the 8087
Stanford University cpu db
own benchmark: used to get all timing values for ARM cpu's and timings for all soft implementations of exp, log, etc; values might not be as reliable

Calculation of peak flops

To calculate peak flops (double precision) we assume an algorithm which uses the same amount of addition as multiplication and no other operation. Furthermore we assume that operations are independent of each other and can be fully pipelined and all variables are kept in registers so no memory access is necessary and then the formula to calculate is:

(double prec) peak flops = 2/combinded_pipelined_double_prec(add, mul) * clock * vector_multiplier * cores

and the time for a combined and independent add/mul seems to be max(add,mul) for a Pentium and onwards and add+mul for anything up to the 80486. The vector multiplier is:

instruction	vector multiplier
x87	1
sse2	2
avx	4

Realistic values for flops

This very much depends on the algorithm and how much memory is used and how well memory can be cached. However, even if we can assume all values are stored in registers, the result can still vary enormously. For example if we need to execute additions and multiplications but results are dependent on each other then pipelining can't be used, sse2/avx packed operations can't be used and all needs to run on one core. So we get
flops = 1/average(appsd, mulsd) * clock * 1 * 1 = 1/4 * clock = 667 Mflops
on a 2.67Mhz Nehalem cpu for example which is way off the 43Gflops peak. Worse even if we need to use lots of divisions, assuming a worst case latency of 22 cycles we'd get a mere 121 Mflops.
Matrix-Matrix operations can be easily sse2/avx vectorised, pipelined and parallelised onto all cores, and often use alternating additions and multiplications, so that if efficient memory management is implemented maximises caching, close to peak performance can be achieved. That's why netlib benchmarks normally get fairly close to peak performance. For highly efficient implementations of linear algebra and matrix operations see below.
References: