Assembly instruction timings
27 Nov 2011Instruction tables
The following tables show the numbers of cycles it takes for certain operations to be executed. It is shown in the format of latency / throughput where latency means the actual time the instruction takes from start to finish. Due to the concept of pipelining another instruction can be started before the first one is finished. This of course requires that the second operation must not depend on the result of the first operation. The throughput gives the time between two results if they are perfectly pipelined, i.e. the best performance possible. The tables only contain a very few (but important) instructions and only values for register to register operations (using the full register width, e.g. 64-bit on a x86-64, and 80-bit for x87 operations). For a much more comprehensive overview see the references below.cpu | year | instruction | clock | peak flops | add/mul flops | lin algebra naive/eigen/ atlas/goto |
mov | add | imul | idiv | fxch | fadd | fmul | fdiv | fsqrt | fsin | f2xm1 | fyl2x | addsd | mulsd | addpd | mulpd |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Intel 8086+87 | 1978 | x86-16, x87 | 5MHz | 45kflops | 2 | 3 | 128-154 | 165-184 | 10-15 | 70-100 | 130-145 | 193-203 | 180-186 | 310-630 | 900-1100 | |||||||
Intel 80286+87 | 1982 | x86-16, x87 | 16MHz | 145kflops | 2 | 2 | 21 | 25 | 10-15 | 70-100 | 130-145 | 193-203 | 180-186 | 310-630 | 900-1100 | |||||||
Intel 80386+87 | 1985 | x86-32, x87 | 25MHz | 725kflops | 977kflops | 278/-/-/- kflops | 2 | 2 | 9-38 | 43 | 18 | 23-34 | 46-57 | 88-91 | 122-129 | 122-771 | 211-476 | 120-538 | ||||
Intel 80486 | 1989 | x86-32, x87 | 66MHz | 5.5Mflops | 4.9Mflops | 2.8/3.6/-/- Mflops | 1 | 1 | 13-42 | 43 | 4 | 8-20 | 16 | 73 | 83-87 | 257-354 | 140-279 | 196-329 | ||||
Pentium | 1993 | x86-32, x87 | 100MHz | 100Mflops | 81Mflops | 10/22/-/- Mflops | 1 | 1 | 10 | 46 | 0 | 3/1 | 3/2 | 39 | 70 | 16-126 | 13-57 | 22-111 | ||||
Pentium MMX | 1996 | x86-32, x87 | 200MHz | 200Mflops | 163Mflops | 13/31/80/- Mflops | 1 | 1 | 9 | 46 | 0 | 3/1 | 3/2 | 39/37 | 70/68 | 65-100 | 53-59 | 103 | ||||
Pentium II | 1997 | x86-32, x87 | 300MHz | 300Mflops | 295Mflops | 30/-/-/172 Mflops | 1 | 1 | 4/1 | 39/37 | 0 | 3/1 | 5/2 | 38/37 | 69 | 27-103 | 66 | 103 | ||||
Pentium III | 1999 | x86-32, x87, sse | 500MHz | 500Mflops | 1 | 1 | 4/1 | 39/37 | 0 | 3/1 | 5/2 | 38/37 | 69 | 27-103 | 66 | 103 | ||||||
Pentium 4 Northwood | 2002 | x86-32, x87, sse2 | 2.4GHz | 4.8Gflops | 4.8Gflops | 0.2/1.0/3.1/3.7 Gflops | 0.5-1.5 / 0.25 | 0.5-1.5 / 0.25 | 16/8 | 50/23 | 0 | 5/1 | 7/2 | 43 | 43 | 180/170 | 165/63 | 200/90 | 4/2 | 6/2 | 4/2 | 6/2 |
Atom Diamondville | 2008 | x86-32, x87, ssse3 | 1.67GHz | 1.7Gflops | 1.6Gflops | 0.2/0.4/0.5/1.0 Gflops | 1 | 1 | 5/2 | 61 | 1 | 5/1 | 5/2 | 71 | 71 | 260 | 100 | 220 | 5/1 | 5/2 | 6/6 | 9/9 |
cpu | year | instruction | core | clock | peak flops | add/mul flops | lin algebra naive/eigen/ atlas/goto |
mov | add | mul | idiv | fcpyd | faddd | fmuld | fdivd | fsqrtd | [exp] | [log] | [pow] | [sin] | [erf] |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ARM11 (raspi) | 2012 | armv6, vfp | 1 | 700MHz | 362Mflops | 28/70/-/- Mflops | - | 8/2 | 9/2 | 34 | 48 | 305 | 540 | 830 | 260 | 185 | |||||
ARM Cortex-A5 (htc desire c) | 2012 | armv7-a, vfpv4, neon | 1 | 600MHz | 573Mflops | 55/152/-/- Mflops | - | 4/1 | 7/4 | 32 | 35 | 265 | 385 | 645 | 215 | 170 |
cpu | year | instruction | core | clock | peak flops | add/mul flops | lin algebra naive/eigen/ atlas/goto |
mov | add | imul | idiv | movapd | addpd | mulpd | divpd | sqrtpd | [exp] | [log] | [pow] | [sin] | [erf] |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Core 2 Wolfdale | 2007 | x86-64, x87, sse4.1 | 2 | 2.33GHz | 19Gflops | 19Gflops | 1.6/11/15/16 Gflops | 1/0.33 | 1/0.33 | 5/2 | 34-88 | 1/0.33 | 3/1 | 5/1 | 6-21 | 6-20 | 118 | 125 | 422 | 88 | 81 |
Core i5 / i7 Nehalem (Bloomfield) | 2008 | x86-64, x87, sse4 | 4 | 2.67GHz | 43Gflops | 42Gflops | 1.7/21/33/40 Gflops | 1/0.33 | 1/0.33 | 3/2 | 37-100/26-86 | 1 | 3/1 | 5/1 | 7-22 | 7-32 | 100 | 114 | 404 | 90 | 68 |
Core i5 / i7 Sandy Bridge | 2011 | x86-64, x87, sse4, avx | 4 | 3.0GHz | 96Gflops | 95Gflops | 3.0/27/70/45 Gflops | 1/0.33 | 1/0.33 | 3/1 | 40-103/25-84 | 1 | 3/1 | 5/1 | 10-22 | 10-21 | 94 | 108 | 418 | 88 | 62 |
- Unknown source (possibly John Allen) x86 instruction timings and x87 instruction timings, 2002
- Agner Fog's optimisation manuals, in particular instruction_tables.pdf, 2008
- Wikipedia page on x87 performance
- google answer on the performance of the 8087
- Stanford University cpu db
- own benchmark:
used to get all timing values for ARM cpu's and timings for all
soft implementations of
exp
,log
, etc; values might not be as reliable
Calculation of peak flops
To calculate peak flops (double precision) we assume an algorithm which uses the same amount of addition as multiplication and no other operation. Furthermore we assume that operations are independent of each other and can be fully pipelined and all variables are kept in registers so no memory access is necessary and then the formula to calculate is:(double prec) peak flops = 2/combinded_pipelined_double_prec(add, mul) * clock * vector_multiplier * cores
add/mul
seems to be
max(add,mul)
for a Pentium and onwards and add+mul
for anything up to the 80486.
The vector multiplier is:
instruction | vector multiplier |
x87 | 1 |
sse2 | 2 |
avx | 4 |
Realistic values for flops
This very much depends on the algorithm and how much memory is used and how well memory can be cached. However, even if we can assume all values are stored in registers, the result can still vary enormously. For example if we need to execute additions and multiplications but results are dependent on each other then pipelining can't be used, sse2/avx packed operations can't be used and all needs to run on one core. So we getflops = 1/average(appsd, mulsd) * clock * 1 * 1 = 1/4 * clock = 667 Mflops
on a 2.67Mhz Nehalem cpu for example which is way off the 43Gflops peak. Worse even if we need to use lots of divisions, assuming a worst case latency of 22 cycles we'd get a mere
121 Mflops
.
Matrix-Matrix operations can be easily sse2/avx vectorised, pipelined and parallelised onto all cores, and often use alternating additions and multiplications, so that if efficient memory management is implemented maximises caching, close to peak performance can be achieved. That's why netlib benchmarks normally get fairly close to peak performance. For highly efficient implementations of linear algebra and matrix operations see below.
References: