Saturday, January 2, 2010

sse benchmarks

At the end of 2008 I published a small project on sourceforge.
The project is older then that; 2 years ago I needed to optimize some operations with vectors of floats so I wrote a simple program to measure the performance of normal C and assembler SSE. I wrote my own program because, to my surprise, I could not find easily measurements on the kind of performance one can get using SSE on simple scenarios.
The results were interesting, so I decided to publish the project. Unfortunately I am quite slow sometimes, so this took longer than I thought.
I chose sourceforge for the project, but this may have been a mistake; apparently I can't add a documentation page to the project - it needs to be approved first.
Anyway, I will write here about my results.
The project can be found at http://sourceforge.net/projects/vector-ops-sse/

vector_ops_sse

These are some bechmarks for basic vector operations: sum, multiplication, division, dot product. The purpose is to compare the performance of C, SSE assembler, and SSE intrinsics.
The conclusion is that in most cases it is a good idea to use SSE intrinsics to rewrite some math intensive kernel. You can get a reasonable speedup (lets say 2x) with reasonable effort. Most of the time, there is no need to use directly SSE assembler.
The code is quite simple, and there are also some additional “experiments” (some matrix operations) that are a little more complicated.
Each vector operation is repeated 10 times and the best running time is displayed.
We are interested in the number of cycles needed for each operation, so the result does not depend on the processor frequency (but of course the running time does, as faster processor will have smaller duration cycles). This makes it easy to compare different processor architectures. I have included the results for a Pentium M processor, a Core 2 Duo processor, and an AMD Turion x2 processor. The test runs only on one core.
The project was compiled with maximum optimization enabled so the C version is as fast as possible.
The command line looks like this:
/Ox /Ob2 /Oi /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_ATL_STATIC_REGISTRY" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MD /Gy /arch:SSE /fp:fast /Yu"stdafx.h" /Fo"Release\\" /Fd"Release\vc90.pdb" /W3 /nologo /c /Zi /TP /errorReport:prompt
The results:
Pentium M, 1.5 Ghz
Repeat each operation 10 times.
Vector size is 1024.
vector - vector multiplication. a = b * c
C: 3533 cycles (3.5 cycles/multiplication)
isics: 1340 cycles (1.3 cycles/multiplication)
asm: 1606 cycles (1.6 cycles/multiplication)

vector - vector multiplication in place. a = a * b
C: 3533 cycles (3.5 cycles/multiplication)
isics: 1340 cycles (1.3 cycles/multiplication)
asm: 1606 cycles (1.6 cycles/multiplication)

vector - vector sum. a= b + c
isics: 1340 cycles (1.3 cycles/sum)

vector - vector div. a = b / c
c: 17475 cycles (17 cycles/div)
i: 8772 cycles (8.6 cycles/div)
asm: 8796 cycles (8.6 cycles/div)

vector dot product. r = dot(a, b)
C: 4160 cycles (4.1 cycles/madd)

isics: 1658 cycles (1.6 cycles/madd)

asm: 1251 cycles (1.2 cycles/madd)
 
Intel Core 2 Duo, model E4400 – 2.0 Ghz.
Repeat each operation 10 times.
Vector size is 1024.
vector - vector multiplication. a = b * c
C: 2800 cycles (2.7 cycles/multiplication)
isics: 600 cycles (0.59 cycles/multiplication)
asm: 860 cycles (0.84 cycles/multiplication)

vector - vector multiplication in place. a = a * b
C: 2800 cycles (2.7 cycles/multiplication)

isics: 600 cycles (0.59 cycles/multiplication)

asm: 880 cycles (0.86 cycles/multiplication)

vector - vector sum. a= b + c
isics: 600 cycles (0.59 cycles/sum)

vector - vector div. a = b / c
c: 17600 cycles (17 cycles/div)

i: 4470 cycles (4.4 cycles/div)

asm: 4500 cycles (4.4 cycles/div)

vector dot product. r = dot(a, b)
C: 2130 cycles (2.1 cycles/madd)
isics: 1660 cycles (1.6 cycles/madd)
asm: 610 cycles (0.6 cycles/madd)

AMD Turion X2 2Ghz
Repeat each operation 10 times.
Vector size is 1024.
vector - vector multiplication. a = b * c
C: 4015 cycles (3.9 cycles/multiplication)
isics: 1675 cycles (1.6 cycles/multiplication)
asm: 1690 cycles (1.7 cycles/multiplication)

vector - vector multiplication in place. a = a * b
C: 4020 cycles (3.9 cycles/multiplication)

isics: 1675 cycles (1.6 cycles/multiplication)

asm: 1690 cycles (1.7 cycles/multiplication)

vector - vector sum. a= b + c
isics: 1675 cycles (1.6 cycles/sum)

vector - vector div. a = b / c
c: 13745 cycles (13 cycles/div)

i: 7725 cycles (7.5 cycles/div)
asm: 7745 cycles (7.6 cycles/div)

vector dot product. r = dot(a, b)
C: 2660 cycles (2.6 cycles/madd)
isics: 1925 cycles (1.9 cycles/madd)
asm: 1210 cycles (1.2 cycles/madd)

Some basic observations:
The Amd Turion results are very similar to Intel Pentium M results.
The Core 2 is almost 2 times as fast as the others at the same frequency (at SSE operations).
The SSE intrinsics/asm versions of the operations are 2-4 times faster then the C version.
These functions can be used as the basis for a vectorized math library. The Intel MKL is very fast but is not free. There is a free library hidden somewhere on the Intel site – “The Intel Math Approximate library” (or smth like that) that can provide other examples of SSE operations on vectors: transcedental functions, square roots, etc.

Limitation: The intrinsic and assembler versions of vector operations work only when the vectors are aligned to 16 bytes (4 floats) and the size is multiple of 4. This is needed to ensure that we can use the aligned version of the SSE instructions.
An example for a general purpose function that will work with any input data is provided - vector_multiply - that tries to use the intrinsic version when it is possible or fallback to the C version.

No comments: