fpu benchmarks
15 May 2012[fpu benchmark | linear algebra benchmark]
fpu benchmark: fpu.tar.bz2
Here, the main objective is to design a benchmark which purely measures the speed of floating point operations without being hampered by memory access, hence all operations should be performed on registers only. A very simple test would be afor
loop where the same value
say 'a
' is added again and again.
for(int k=0; k<num; k++) { x0+=a; }The gcc-compiler will not simplify this to a multiplication (unless options -ffast-math or -funsafe-math-optimizations are given), as this would change the result due to roundoff errors. Depending on whether we use the sse2 or the x87 instruction set the simple
for
loop
generates the following assembly:
$ g++-4.4.1 -S -fverbose-asm -masm=intel -Wall -O3 -funroll-loops fpu.cpp $ cat fpu.s ... .L1263: addsd xmm1, xmm0 # x0+=a (x0 is stored in xmm1, a is in xmm0) add rax, 8 # k+=8 addsd xmm1, xmm0 # x0+=a cmp rbx, rax # compare num, k addsd xmm1, xmm0 # x0+=a addsd xmm1, xmm0 # x0+=a addsd xmm1, xmm0 # x0+=a addsd xmm1, xmm0 # x0+=a addsd xmm1, xmm0 # x0+=a addsd xmm1, xmm0 # x0+=a ja .L1263 # jump conditional to .L1263 ...
$ g++-4.4.1 -S -fverbose-asm -masm=intel -m32 -Wall -O3 -funroll-loops fpu.cpp $ cat fpu.s ... .L1181: fadd st, st(1) # x0+=a (x0 is stored in st, a is in st(1)) add eax, 8 # k+=8 cmp ebx, eax # compare num, k fadd st, st(1) # x0+=a fadd st, st(1) # x0+=a fadd st, st(1) # x0+=a fadd st, st(1) # x0+=a fadd st, st(1) # x0+=a fadd st, st(1) # x0+=a fadd st, st(1) # x0+=a ja .L1181 # jump conditional to .L1181 ...Note, on most architectures the integer and floating point operations are executed in parallel so the integer counter
(k=0;k<num;k++)
does not add any overhead and even on older architectures where it may not
execute in parrallel, the unrolling of the loop reduces the overhead.
The simple loop will still not achieve maximum performance as it does not make use of pipelining, because we need to get the result first before another addition can be started. On modern Intel cpu's one addition requires 3 and a multiplication 5 cpu cycles. With the following loop we can make use of a two-stage pipeline:
// make sure x0, x1 are initialised with different values or the compiler // might simplify for(int k=0; k<num; k++) { x0+=a; x1+=a; }This translates to the following assembly, again depending on whether we use the sse2 or the x87 instruction set we get:
$ g++-4.4.1 -S -fverbose-asm -masm=intel -Wall -O3 -funroll-loops fpu.cpp $ cat fpu.s ... .L1379: addsd xmm2, xmm0 # x0+=a add rax, 8 # k+=8 addsd xmm1, xmm0 # x1+=a cmp rbx, rax # compare num, k addsd xmm2, xmm0 # x0+=a addsd xmm1, xmm0 # x1+=a addsd xmm2, xmm0 # x0+=a addsd xmm1, xmm0 # x1+=a addsd xmm2, xmm0 # x0+=a addsd xmm1, xmm0 # x1+=a addsd xmm2, xmm0 # x0+=a addsd xmm1, xmm0 # x1+=a addsd xmm2, xmm0 # x0+=a addsd xmm1, xmm0 # x1+=a addsd xmm2, xmm0 # x0+=a addsd xmm1, xmm0 # x1+=a addsd xmm2, xmm0 # x0+=a addsd xmm1, xmm0 # x1+=a ja .L1379 # conditional jump ...
$ g++-4.4.1 -S -fverbose-asm -masm=intel -m32 -Wall -O3 -funroll-loops fpu.cpp $ cat fpu.s ... .L1143: fadd st(1), st # x0+=a add eax, 8 # k+=8 cmp ebx, eax # compare num, k fadd st(3), st # x1+=a fadd st(1), st # x0+=a fadd st(3), st # x1+=a fadd st(1), st # x0+=a fadd st(3), st # x1+=a fadd st(1), st # x0+=a fadd st(3), st # x1+=a fadd st(1), st # x0+=a fadd st(3), st # x1+=a fadd st(1), st # x0+=a fadd st(3), st # x1+=a fadd st(1), st # x0+=a fadd st(3), st # x1+=a fadd st(1), st # x0+=a fadd st(3), st # x1+=a ja .L1143 # conditional jump ...Using templates we can write a general add-routine where we have
Num
independent additions:
template <size_t Num> double add(double a, size_t ops) { double X[Num]; // and initialise randomly const size_t loops = ops/Num; for(size_t k=0; k<loops; k++) { for(size_t i=0; i<Num; i++){ X[i]+=a; } } // return something }A moderately recent version of the gcc compiler will unwind the inner loop and keep all values of
X[]
within registers if
at all possible, however, old compiler versions like gcc 2.95 will not!
Below is the assembly output of add<2>()
.
$ g++-4.4.1 -S -fverbose-asm -masm=intel -Wall -O3 -funroll-loops fpu.cpp $ cat fpu.s ... # identical as the simple for loop with two variables as above .L1034: addsd xmm2, xmm0 add rax, 8 addsd xmm1, xmm0 cmp rbx, rax addsd xmm2, xmm0 addsd xmm1, xmm0 addsd xmm2, xmm0 addsd xmm1, xmm0 addsd xmm2, xmm0 addsd xmm1, xmm0 addsd xmm2, xmm0 addsd xmm1, xmm0 addsd xmm2, xmm0 addsd xmm1, xmm0 addsd xmm2, xmm0 addsd xmm1, xmm0 addsd xmm2, xmm0 addsd xmm1, xmm0 ja .L1034 .L1033: ...
$ g++-4.4.1 -S -fverbose-asm -masm=intel -m32 -Wall -O3 -funroll-loops fpu.cpp $ cat fpu.s ... # not identical as the simple for loop with two variables # here the compiler makes use of fxch which costs no additional run-time # on most pentium architectures, however it degrades performance # considerably on the atom architecture where fxch costs 1 cpu cycle .L1003: fadd st, st(1) fxch st(3) add eax, 8 cmp ebx, eax fadd st, st(1) fxch st(3) fadd st, st(1) fxch st(3) fadd st, st(1) fxch st(3) fadd st, st(1) fxch st(3) fadd st, st(1) fxch st(3) fadd st, st(1) fxch st(3) fadd st, st(1) fxch st(3) fadd st, st(1) fxch st(3) fadd st, st(1) fxch st(3) fadd st, st(1) fxch st(3) fadd st, st(1) fxch st(3) fadd st, st(1) fxch st(3) fadd st, st(1) fxch st(3) fadd st, st(1) fxch st(3) fadd st, st(1) ja .L1046 ...Sample output of the programme generated from
fpu.cpp
:
execute by $ ./fpu <cpu MHz> <mill ops>
.
See source code for description of each individual test.
user@pentium_mmx:~ $ ./fpu 200 100 # cpu clock is 200MHz, do 100 mill ops ... calculating: x<k> + a + a + ... + a - (x<k>+n*a), with a=1.521700 addx87: 1.507s, res=-1.206e-04, 66.349 Mflops, 0.332 flops/cycle add1: 1.632s, res=-1.207e-04, 61.268 Mflops, 0.306 flops/cycle add2: 0.848s, res= 7.831e-06, 0.118 Gflops, 0.590 flops/cycle add<1>: 1.570s, res=-1.207e-04, 63.709 Mflops, 0.319 flops/cycle add<2>: 0.847s, res= 7.816e-06, 0.118 Gflops, 0.590 flops/cycle add<3>: 0.586s, res= 1.878e-05, 0.171 Gflops, 0.853 flops/cycle add<4>: 0.596s, res=-1.319e-05, 0.168 Gflops, 0.839 flops/cycle add<5>: 0.552s, res=-2.651e-05, 0.181 Gflops, 0.905 flops/cycle ... user@nehalem:~ $ ./fpu 2667 100 # cpu clock is 2.67GHz, do 100 mill ops ... calculating: x<k> + a + a + ... + a - (x<k>+n*a), with a=1.521700 addx87: 0.118s, res=-1.206e-04, 0.845 Gflops, 0.317 flops/cycle add1: 0.115s, res=-2.540e-01, 0.871 Gflops, 0.327 flops/cycle add2: 0.057s, res= 1.175e-02, 1.743 Gflops, 0.654 flops/cycle add<1>: 0.114s, res=-2.540e-01, 0.874 Gflops, 0.328 flops/cycle add<2>: 0.057s, res= 1.175e-02, 1.748 Gflops, 0.656 flops/cycle add<3>: 0.038s, res= 3.682e-02, 2.622 Gflops, 0.983 flops/cycle add<4>: 0.038s, res=-2.601e-02, 2.623 Gflops, 0.984 flops/cycle add<5>: 0.038s, res=-5.064e-02, 2.627 Gflops, 0.985 flops/cycle ...
- download fpu.tar.bz2
linear algebra benchmark: linalg.tar.bz2
For a more applied test which also makes extensive use of RAM memory we solve a linear equation system. This is the same kind of benchmark as the Linpack Benchmark as used for the Top 500 Supercomputer list. These tests require optimised versions of the BLAS routines, like ATLAS or GotoBLAS2. A more elegant interface to linear algebra routines is provided by Eigen which is a header-only C++ implementation. All three libraries are required to compile the benchmark below.- download linalg.tar.bz2