fpu benchmarks

15 May 2012
[fpu benchmark | linear algebra benchmark]

fpu benchmark: fpu.tar.bz2

Here, the main objective is to design a benchmark which purely measures the speed of floating point operations without being hampered by memory access, hence all operations should be performed on registers only. A very simple test would be a for loop where the same value say 'a' is added again and again.
for(int k=0; k<num; k++) {
   x0+=a;
}
The gcc-compiler will not simplify this to a multiplication (unless options -ffast-math or -funsafe-math-optimizations are given), as this would change the result due to roundoff errors. Depending on whether we use the sse2 or the x87 instruction set the simple for loop generates the following assembly:
$ g++-4.4.1 -S -fverbose-asm -masm=intel -Wall -O3 -funroll-loops fpu.cpp
$ cat fpu.s
...
.L1263:
        addsd   xmm1, xmm0      # x0+=a (x0 is stored in xmm1, a is in xmm0)
        add     rax, 8          # k+=8
        addsd   xmm1, xmm0      # x0+=a
        cmp     rbx, rax        # compare num, k
        addsd   xmm1, xmm0      # x0+=a
        addsd   xmm1, xmm0      # x0+=a
        addsd   xmm1, xmm0      # x0+=a
        addsd   xmm1, xmm0      # x0+=a
        addsd   xmm1, xmm0      # x0+=a
        addsd   xmm1, xmm0      # x0+=a
	ja      .L1263  	# jump conditional to .L1263
...
$ g++-4.4.1 -S -fverbose-asm -masm=intel -m32 -Wall -O3 -funroll-loops fpu.cpp
$ cat fpu.s
...
.L1181:
        fadd    st, st(1)       # x0+=a (x0 is stored in st, a is in st(1))
        add     eax, 8          # k+=8
        cmp     ebx, eax        # compare num, k
        fadd    st, st(1)       # x0+=a
        fadd    st, st(1)       # x0+=a
        fadd    st, st(1)       # x0+=a
        fadd    st, st(1)       # x0+=a
        fadd    st, st(1)       # x0+=a
        fadd    st, st(1)       # x0+=a
        fadd    st, st(1)       # x0+=a
        ja      .L1181          # jump conditional to .L1181
...
Note, on most architectures the integer and floating point operations are executed in parallel so the integer counter (k=0;k<num;k++) does not add any overhead and even on older architectures where it may not execute in parrallel, the unrolling of the loop reduces the overhead.
The simple loop will still not achieve maximum performance as it does not make use of pipelining, because we need to get the result first before another addition can be started. On modern Intel cpu's one addition requires 3 and a multiplication 5 cpu cycles. With the following loop we can make use of a two-stage pipeline:
// make sure x0, x1 are initialised with different values or the compiler
// might simplify
for(int k=0; k<num; k++) {
   x0+=a;
   x1+=a;
}
This translates to the following assembly, again depending on whether we use the sse2 or the x87 instruction set we get:
$ g++-4.4.1 -S -fverbose-asm -masm=intel -Wall -O3 -funroll-loops fpu.cpp
$ cat fpu.s
...
 .L1379:
        addsd   xmm2, xmm0      # x0+=a
        add     rax, 8          # k+=8
        addsd   xmm1, xmm0      # x1+=a
        cmp     rbx, rax        # compare num, k
        addsd   xmm2, xmm0      # x0+=a
        addsd   xmm1, xmm0      # x1+=a
        addsd   xmm2, xmm0      # x0+=a
        addsd   xmm1, xmm0      # x1+=a
        addsd   xmm2, xmm0      # x0+=a
        addsd   xmm1, xmm0      # x1+=a
        addsd   xmm2, xmm0      # x0+=a
        addsd   xmm1, xmm0      # x1+=a
        addsd   xmm2, xmm0      # x0+=a
        addsd   xmm1, xmm0      # x1+=a
        addsd   xmm2, xmm0      # x0+=a
        addsd   xmm1, xmm0      # x1+=a
        addsd   xmm2, xmm0      # x0+=a
        addsd   xmm1, xmm0      # x1+=a
	ja      .L1379          # conditional jump
...
$ g++-4.4.1 -S -fverbose-asm -masm=intel -m32 -Wall -O3 -funroll-loops fpu.cpp
$ cat fpu.s
...
.L1143:
        fadd    st(1), st       # x0+=a
        add     eax, 8          # k+=8
        cmp     ebx, eax        # compare num, k
        fadd    st(3), st       # x1+=a
        fadd    st(1), st       # x0+=a
        fadd    st(3), st       # x1+=a
        fadd    st(1), st       # x0+=a
        fadd    st(3), st       # x1+=a
        fadd    st(1), st       # x0+=a
        fadd    st(3), st       # x1+=a
        fadd    st(1), st       # x0+=a
        fadd    st(3), st       # x1+=a
        fadd    st(1), st       # x0+=a
        fadd    st(3), st       # x1+=a
        fadd    st(1), st       # x0+=a
        fadd    st(3), st       # x1+=a
        fadd    st(1), st       # x0+=a
        fadd    st(3), st       # x1+=a
	ja      .L1143          # conditional jump
...
Using templates we can write a general add-routine where we have Num independent additions:
template <size_t Num>
double add(double a, size_t ops) {
   double   X[Num];  // and initialise randomly
   const size_t loops = ops/Num;
   for(size_t k=0; k<loops; k++) {
      for(size_t i=0; i<Num; i++){ X[i]+=a; }
   }
   // return something
}
A moderately recent version of the gcc compiler will unwind the inner loop and keep all values of X[] within registers if at all possible, however, old compiler versions like gcc 2.95 will not! Below is the assembly output of add<2>().
$ g++-4.4.1 -S -fverbose-asm -masm=intel -Wall -O3 -funroll-loops fpu.cpp
$ cat fpu.s
...
# identical as the simple for loop with two variables as above
 .L1034:
        addsd   xmm2, xmm0
        add     rax, 8
        addsd   xmm1, xmm0
        cmp     rbx, rax
        addsd   xmm2, xmm0
        addsd   xmm1, xmm0
        addsd   xmm2, xmm0
        addsd   xmm1, xmm0
        addsd   xmm2, xmm0
        addsd   xmm1, xmm0
        addsd   xmm2, xmm0
        addsd   xmm1, xmm0
        addsd   xmm2, xmm0
        addsd   xmm1, xmm0
        addsd   xmm2, xmm0
        addsd   xmm1, xmm0
        addsd   xmm2, xmm0
        addsd   xmm1, xmm0
        ja      .L1034
.L1033:
...
$ g++-4.4.1 -S -fverbose-asm -masm=intel -m32 -Wall -O3 -funroll-loops fpu.cpp
$ cat fpu.s
...
# not identical as the simple for loop with two variables
# here the compiler makes use of fxch which costs no additional run-time
# on most pentium architectures, however it degrades performance
# considerably on the atom architecture where fxch costs 1 cpu cycle
.L1003:
        fadd    st, st(1)
        fxch    st(3)
        add     eax, 8
        cmp     ebx, eax
        fadd    st, st(1)
        fxch    st(3)
        fadd    st, st(1)
        fxch    st(3)
        fadd    st, st(1)
        fxch    st(3)
        fadd    st, st(1)
        fxch    st(3)
        fadd    st, st(1)
        fxch    st(3)
        fadd    st, st(1)
        fxch    st(3)
        fadd    st, st(1)
        fxch    st(3)
        fadd    st, st(1)
        fxch    st(3)
        fadd    st, st(1)
        fxch    st(3)
        fadd    st, st(1)
        fxch    st(3)
        fadd    st, st(1)
        fxch    st(3)
        fadd    st, st(1)
        fxch    st(3)
        fadd    st, st(1)
        fxch    st(3)
        fadd    st, st(1)
        fxch    st(3)
        fadd    st, st(1)
	ja      .L1046
...
Sample output of the programme generated from fpu.cpp: execute by $ ./fpu <cpu MHz> <mill ops>. See source code for description of each individual test.
 user@pentium_mmx:~ $ ./fpu 200 100	# cpu clock is 200MHz, do 100 mill ops
 ...
 calculating: x<k> + a + a + ... + a - (x<k>+n*a), with a=1.521700
 addx87: 1.507s, res=-1.206e-04,  66.349 Mflops, 0.332 flops/cycle
 add1:   1.632s, res=-1.207e-04,  61.268 Mflops, 0.306 flops/cycle
 add2:   0.848s, res= 7.831e-06,   0.118 Gflops, 0.590 flops/cycle
 add<1>: 1.570s, res=-1.207e-04,  63.709 Mflops, 0.319 flops/cycle
 add<2>: 0.847s, res= 7.816e-06,   0.118 Gflops, 0.590 flops/cycle
 add<3>: 0.586s, res= 1.878e-05,   0.171 Gflops, 0.853 flops/cycle
 add<4>: 0.596s, res=-1.319e-05,   0.168 Gflops, 0.839 flops/cycle
 add<5>: 0.552s, res=-2.651e-05,   0.181 Gflops, 0.905 flops/cycle
 ...

 user@nehalem:~ $ ./fpu 2667 100	# cpu clock is 2.67GHz, do 100 mill ops
 ...
 calculating: x<k> + a + a + ... + a - (x<k>+n*a), with a=1.521700
 addx87: 0.118s, res=-1.206e-04,   0.845 Gflops, 0.317 flops/cycle
 add1:   0.115s, res=-2.540e-01,   0.871 Gflops, 0.327 flops/cycle
 add2:   0.057s, res= 1.175e-02,   1.743 Gflops, 0.654 flops/cycle
 add<1>: 0.114s, res=-2.540e-01,   0.874 Gflops, 0.328 flops/cycle
 add<2>: 0.057s, res= 1.175e-02,   1.748 Gflops, 0.656 flops/cycle
 add<3>: 0.038s, res= 3.682e-02,   2.622 Gflops, 0.983 flops/cycle
 add<4>: 0.038s, res=-2.601e-02,   2.623 Gflops, 0.984 flops/cycle
 add<5>: 0.038s, res=-5.064e-02,   2.627 Gflops, 0.985 flops/cycle
 ...

linear algebra benchmark: linalg.tar.bz2

For a more applied test which also makes extensive use of RAM memory we solve a linear equation system. This is the same kind of benchmark as the Linpack Benchmark as used for the Top 500 Supercomputer list. These tests require optimised versions of the BLAS routines, like ATLAS or GotoBLAS2. A more elegant interface to linear algebra routines is provided by Eigen which is a header-only C++ implementation. All three libraries are required to compile the benchmark below.