Assembly instruction timings

27 Nov 2011

Instruction tables

The following tables show the numbers of cycles it takes for certain operations to be executed. It is shown in the format of latency / throughput where latency means the actual time the instruction takes from start to finish. Due to the concept of pipelining another instruction can be started before the first one is finished. This of course requires that the second operation must not depend on the result of the first operation. The throughput gives the time between two results if they are perfectly pipelined, i.e. the best performance possible. The tables only contain a very few (but important) instructions and only values for register to register operations (using the full register width, e.g. 64-bit on a x86-64, and 80-bit for x87 operations). For a much more comprehensive overview see the references below.
cpu year instruction clock peak flops add/mul flops lin algebra
naive/eigen/ atlas/goto
mov add imul idiv fxch fadd fmul fdiv fsqrt fsin f2xm1 fyl2x addsd mulsd addpd mulpd
Intel 8086+87 1978 x86-16, x87 5MHz 45kflops 2 3 128-154 165-184 10-15 70-100 130-145 193-203 180-186 310-630 900-1100
Intel 80286+87 1982 x86-16, x87 16MHz 145kflops 2 2 21 25 10-15 70-100 130-145 193-203 180-186 310-630 900-1100
Intel 80386+87 1985 x86-32, x87 25MHz 725kflops 977kflops 278/-/-/- kflops 2 2 9-38 43 18 23-34 46-57 88-91 122-129 122-771 211-476 120-538
Intel 80486 1989 x86-32, x87 66MHz 5.5Mflops 4.9Mflops 2.8/3.6/-/- Mflops 1 1 13-42 43 4 8-20 16 73 83-87 257-354 140-279 196-329
Pentium 1993 x86-32, x87 100MHz 100Mflops 81Mflops 10/22/-/- Mflops 1 1 10 46 0 3/1 3/2 39 70 16-126 13-57 22-111
Pentium MMX 1996 x86-32, x87 200MHz 200Mflops 163Mflops 13/31/80/- Mflops 1 1 9 46 0 3/1 3/2 39/37 70/68 65-100 53-59 103
Pentium II 1997 x86-32, x87 300MHz 300Mflops 295Mflops 30/-/-/172 Mflops 1 1 4/1 39/37 0 3/1 5/2 38/37 69 27-103 66 103
Pentium III 1999 x86-32, x87, sse 500MHz 500Mflops 1 1 4/1 39/37 0 3/1 5/2 38/37 69 27-103 66 103
Pentium 4 Northwood 2002 x86-32, x87, sse2 2.4GHz 4.8Gflops 4.8Gflops 0.2/1.0/3.1/3.7 Gflops 0.5-1.5 / 0.25 0.5-1.5 / 0.25 16/8 50/23 0 5/1 7/2 43 43 180/170 165/63 200/90 4/2 6/2 4/2 6/2
Atom Diamondville 2008 x86-32, x87, ssse3 1.67GHz 1.7Gflops 1.6Gflops 0.2/0.4/0.5/1.0 Gflops 1 1 5/2 61 1 5/1 5/2 71 71 260 100 220 5/1 5/2 6/6 9/9
cpu year instruction core clock peak flops add/mul flops lin algebra
naive/eigen/ atlas/goto
mov add mul idiv fcpyd faddd fmuld fdivd fsqrtd [exp] [log] [pow] [sin] [erf]
ARM11 (raspi) 2012 armv6, vfp 1 700MHz 362Mflops 28/70/-/- Mflops - 8/2 9/2 34 48 305 540 830 260 185
ARM Cortex-A5 (htc desire c) 2012 armv7-a, vfpv4, neon 1 600MHz 573Mflops 55/152/-/- Mflops - 4/1 7/4 32 35 265 385 645 215 170
cpu year instruction core clock peak flops add/mul flops lin algebra
naive/eigen/ atlas/goto
mov add imul idiv movapd addpd mulpd divpd sqrtpd [exp] [log] [pow] [sin] [erf]
Core 2 Wolfdale 2007 x86-64, x87, sse4.1 2 2.33GHz 19Gflops 19Gflops 1.6/11/15/16 Gflops 1/0.33 1/0.33 5/2 34-88 1/0.33 3/1 5/1 6-21 6-20 118 125 422 88 81
Core i5 / i7 Nehalem (Bloomfield) 2008 x86-64, x87, sse4 4 2.67GHz 43Gflops 42Gflops 1.7/21/33/40 Gflops 1/0.33 1/0.33 3/2 37-100/26-86 1 3/1 5/1 7-22 7-32 100 114 404 90 68
Core i5 / i7 Sandy Bridge 2011 x86-64, x87, sse4, avx 4 3.0GHz 96Gflops 95Gflops 3.0/27/70/45 Gflops 1/0.33 1/0.33 3/1 40-103/25-84 1 3/1 5/1 10-22 10-21 94 108 418 88 62
References:

Calculation of peak flops

To calculate peak flops (double precision) we assume an algorithm which uses the same amount of addition as multiplication and no other operation. Furthermore we assume that operations are independent of each other and can be fully pipelined and all variables are kept in registers so no memory access is necessary and then the formula to calculate is: and the time for a combined and independent add/mul seems to be max(add,mul) for a Pentium and onwards and add+mul for anything up to the 80486. The vector multiplier is:
instruction vector multiplier
x87 1
sse2 2
avx 4

Realistic values for flops

This very much depends on the algorithm and how much memory is used and how well memory can be cached. However, even if we can assume all values are stored in registers, the result can still vary enormously. For example if we need to execute additions and multiplications but results are dependent on each other then pipelining can't be used, sse2/avx packed operations can't be used and all needs to run on one core. So we get
flops = 1/average(appsd, mulsd) * clock * 1 * 1 = 1/4 * clock = 667 Mflops
on a 2.67Mhz Nehalem cpu for example which is way off the 43Gflops peak. Worse even if we need to use lots of divisions, assuming a worst case latency of 22 cycles we'd get a mere 121 Mflops.
Matrix-Matrix operations can be easily sse2/avx vectorised, pipelined and parallelised onto all cores, and often use alternating additions and multiplications, so that if efficient memory management is implemented maximises caching, close to peak performance can be achieved. That's why netlib benchmarks normally get fairly close to peak performance. For highly efficient implementations of linear algebra and matrix operations see below.
References: