Sunday, January 31, 2010

convolutional neural networks adventures

One of my past projects was a handwriting math equation editor.
For it I needed a handwriting recognition library that should recognize a wide range of mathematical symbols. I could not find a recognizer like this; the solution was to write my own. I quickly wrote a DTW algorithm (a viterbi matcher) using a couple of sample of my own writing as templates. For the immediate purpose of a demonstration it was enough, but my interest was piqued and I spent a long time looking at various algorithms and their strengths and weaknesses.

After a time I decided to investigate Convolutional Neural Networks. They are very roughly inspired from the way the animal vision works. The authority on the subject is Yann LeCun who introduced them in the 90’s. Some years ago, Patrice Simard from Microsoft worked on some newer implementations. I could not find the actual code for any of their articles (a much too often encountered situation in computer science). As luck has it, there is a nice article on codeproject by Mike O’Neill about his implementation of CNN, and the best part is that in this case the code is available. The problem with O’neill-s code is that is quite slow; the code is not written for speed and it makes heavy use of pointers to describe the CNN structure.
I quickly wrote a prototype in C# that was quite simple and fast. Alas, it was not fast enough for my impatience; to properly train a CNN, at least 10-20 epochs are needed. My prototype was taking 5 minutes for each epoch – which was a lot because I needed to make lots of experiments to find and correct bugs.
That was the occasion when I found how difficult is to debug machine learning programs. If a normal program has a bug, then its output is completely fucked up and you know that something is wrong. If a machine learning program has an error, the algorithm will learn around it, and it will still spill out a reasonable answer. The difference is that the capability of the program is reduced, and the error rate is bigger – the algorithm cannot learn as well.
Fortunately, there a couple of techniques that can be applied to test the correctness of a NN implementation. Unfortunately I was too lazy to implement them.

After thinking a little about how to optimize the initial C# version, I proceeded to rewrite the library in C, with the intention to add some SSE fragments in the future (which I did).
After I played with it a little (quite pleasant activity) and solved some little bugs, made some measurements and optimized here and there (caching was a big problem) I was satisfied with results. By then I no longer had enough free time because of a new job, so I stashed the project for later.

Time passed and I started thinking about putting some of my projects on the net, and the cnn project looked like a good candidate.
Once again I’ve corrected some bugs then cleaned up the code and the interface for the trainer, wrote some documentation, and added the project to codeplex. I called the project fastCNN.
For the interested parties, the whole project (source + binaries + documentation) is at http://fastcnn.codeplex.com/.
During development, I used the MNIST dataset (handwritten digits) to evaluate the network performance and guess at the presence/absence of bugs.
For my other projects I started gathering data for handwritten mathematical symbols.
Another application that I though about is OCR. I was planning on working on an OCR application for pictures of text made with digital cameras, but never found the time and energy. CNN’s are a great match for this kind of application.

Yann LeCun publications

Saturday, January 2, 2010

sse benchmarks

At the end of 2008 I published a small project on sourceforge.
The project is older then that; 2 years ago I needed to optimize some operations with vectors of floats so I wrote a simple program to measure the performance of normal C and assembler SSE. I wrote my own program because, to my surprise, I could not find easily measurements on the kind of performance one can get using SSE on simple scenarios.
The results were interesting, so I decided to publish the project. Unfortunately I am quite slow sometimes, so this took longer than I thought.
I chose sourceforge for the project, but this may have been a mistake; apparently I can't add a documentation page to the project - it needs to be approved first.
Anyway, I will write here about my results.
The project can be found at http://sourceforge.net/projects/vector-ops-sse/

vector_ops_sse

These are some bechmarks for basic vector operations: sum, multiplication, division, dot product. The purpose is to compare the performance of C, SSE assembler, and SSE intrinsics.
The conclusion is that in most cases it is a good idea to use SSE intrinsics to rewrite some math intensive kernel. You can get a reasonable speedup (lets say 2x) with reasonable effort. Most of the time, there is no need to use directly SSE assembler.
The code is quite simple, and there are also some additional “experiments” (some matrix operations) that are a little more complicated.
Each vector operation is repeated 10 times and the best running time is displayed.
We are interested in the number of cycles needed for each operation, so the result does not depend on the processor frequency (but of course the running time does, as faster processor will have smaller duration cycles). This makes it easy to compare different processor architectures. I have included the results for a Pentium M processor, a Core 2 Duo processor, and an AMD Turion x2 processor. The test runs only on one core.
The project was compiled with maximum optimization enabled so the C version is as fast as possible.
The command line looks like this:
/Ox /Ob2 /Oi /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_ATL_STATIC_REGISTRY" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MD /Gy /arch:SSE /fp:fast /Yu"stdafx.h" /Fo"Release\\" /Fd"Release\vc90.pdb" /W3 /nologo /c /Zi /TP /errorReport:prompt
The results:
Pentium M, 1.5 Ghz
Repeat each operation 10 times.
Vector size is 1024.
vector - vector multiplication. a = b * c
C: 3533 cycles (3.5 cycles/multiplication)
isics: 1340 cycles (1.3 cycles/multiplication)
asm: 1606 cycles (1.6 cycles/multiplication)

vector - vector multiplication in place. a = a * b
C: 3533 cycles (3.5 cycles/multiplication)
isics: 1340 cycles (1.3 cycles/multiplication)
asm: 1606 cycles (1.6 cycles/multiplication)

vector - vector sum. a= b + c
isics: 1340 cycles (1.3 cycles/sum)

vector - vector div. a = b / c
c: 17475 cycles (17 cycles/div)
i: 8772 cycles (8.6 cycles/div)
asm: 8796 cycles (8.6 cycles/div)

vector dot product. r = dot(a, b)
C: 4160 cycles (4.1 cycles/madd)

isics: 1658 cycles (1.6 cycles/madd)

asm: 1251 cycles (1.2 cycles/madd)
 
Intel Core 2 Duo, model E4400 – 2.0 Ghz.
Repeat each operation 10 times.
Vector size is 1024.
vector - vector multiplication. a = b * c
C: 2800 cycles (2.7 cycles/multiplication)
isics: 600 cycles (0.59 cycles/multiplication)
asm: 860 cycles (0.84 cycles/multiplication)

vector - vector multiplication in place. a = a * b
C: 2800 cycles (2.7 cycles/multiplication)

isics: 600 cycles (0.59 cycles/multiplication)

asm: 880 cycles (0.86 cycles/multiplication)

vector - vector sum. a= b + c
isics: 600 cycles (0.59 cycles/sum)

vector - vector div. a = b / c
c: 17600 cycles (17 cycles/div)

i: 4470 cycles (4.4 cycles/div)

asm: 4500 cycles (4.4 cycles/div)

vector dot product. r = dot(a, b)
C: 2130 cycles (2.1 cycles/madd)
isics: 1660 cycles (1.6 cycles/madd)
asm: 610 cycles (0.6 cycles/madd)

AMD Turion X2 2Ghz
Repeat each operation 10 times.
Vector size is 1024.
vector - vector multiplication. a = b * c
C: 4015 cycles (3.9 cycles/multiplication)
isics: 1675 cycles (1.6 cycles/multiplication)
asm: 1690 cycles (1.7 cycles/multiplication)

vector - vector multiplication in place. a = a * b
C: 4020 cycles (3.9 cycles/multiplication)

isics: 1675 cycles (1.6 cycles/multiplication)

asm: 1690 cycles (1.7 cycles/multiplication)

vector - vector sum. a= b + c
isics: 1675 cycles (1.6 cycles/sum)

vector - vector div. a = b / c
c: 13745 cycles (13 cycles/div)

i: 7725 cycles (7.5 cycles/div)
asm: 7745 cycles (7.6 cycles/div)

vector dot product. r = dot(a, b)
C: 2660 cycles (2.6 cycles/madd)
isics: 1925 cycles (1.9 cycles/madd)
asm: 1210 cycles (1.2 cycles/madd)

Some basic observations:
The Amd Turion results are very similar to Intel Pentium M results.
The Core 2 is almost 2 times as fast as the others at the same frequency (at SSE operations).
The SSE intrinsics/asm versions of the operations are 2-4 times faster then the C version.
These functions can be used as the basis for a vectorized math library. The Intel MKL is very fast but is not free. There is a free library hidden somewhere on the Intel site – “The Intel Math Approximate library” (or smth like that) that can provide other examples of SSE operations on vectors: transcedental functions, square roots, etc.

Limitation: The intrinsic and assembler versions of vector operations work only when the vectors are aligned to 16 bytes (4 floats) and the size is multiple of 4. This is needed to ensure that we can use the aligned version of the SSE instructions.
An example for a general purpose function that will work with any input data is provided - vector_multiply - that tries to use the intrinsic version when it is possible or fallback to the C version.