Seth Price
2007-03-04 22:06:45 UTC
I haven't had the time to work on it in the last week, but I still
intend on doing so. I thought I'd update people with what I've been
thinking. First, I would like to create a test suite I can use for
regression testing. I've added a method for setting the seed, so in
the future people can run regression tests without having to define
FANN_NO_SEED. You can test the same library you've compiled for use.
Makes sense to me. I'm still working on this part.
Re: vectorizing while maintaining code readability (can I interest
you in teraflops?):
The most difficult part of writing vector code is creating the proper
memory structures to operate on. In the case of AltiVec, you can (and
I have) automatically vectorize code. But if you look at the result,
the processor will often spend much of its time translating to and
from the correct memory structures.
A quick example: I vectorized some FORTRAN code and got a 1.5x speed
improvement on AltiVec. Not terrible, but it should be better. So I
looked at the code that was generated by the auto-vectorizer. The
vector processor spent roughly *half* its time translating from
memory that wasn't 16 byte aligned. The problem gets worse when you
attempt to operate on a vector of values (like neuron sums,
activations, or weights) that are stored in structures with unused
data surrounding each value.
The conclusion here is that the most important part of creating
vector code is creating the correct memory structure.
I've decided the best way to do this is to create macros for loading/
setting weights and other neuron values. The macros used can be
chosen at compile time along with the processing method used.
Something like these:
fann_neuron_wieght_store()
fann_neuron_wieght_load()
fann_neuron_sum_store()
fann_neuron_sum_load()
...
They will be customized to use data structures that are optimized for
fann_run(), so the utility functions that operate on the nets will be
a bit slower, but fann_run() will be more adjustable. You've probably
heard the 90%/10% rule, where most programs spend 90% of their time
in 10% of the code. These macros would be used in the code where the
processor spends the least amount of time (90% of the code), and the
code would be maintainable. This will allow people to create custom,
fast, solutions for fann_run().
Therefore, someone can create data structures that could be used with
scalar, AltiVec, SSE, and even GPU processors. The possibility of
using a GPU is particularly interesting, because it opens FANN up to
operating on the scale of *teraflops* with consumer hardware. See
also this recent slashdot story and associated links:
http://hardware.slashdot.org/article.pl?sid=07/03/01/1519254
I've been doing a bunch of reading, and I believe that this is quite
doable once these macros are in place (see also: http://
www.gpgpu.org/ ). I would think that the ability to easily use
massively parallel systems will open up new areas of ANN usage. (All
this would probably also work with the Cell processor, but I'm not
sure if that would be useful to users of libfann.)
To give people an idea of the speedups I'm talking about:
- Scalar
I expect this to run about the same speed as the original code.
Packing neurons into a struct of arrays instead of an array of
structs might speed the code up because of better cache usage. But
using macros for neuron access in 90% of the code will slow things
down a tad.
- AltiVec
In the least, it can operate on four floats at the same time, which
yields a ~400% speed improvement. Lucky for PowerPC users, AltiVec
also has hardware inverse, inverse exp(), inverse sqrt(), and some
other useful functions. A patch for FANN v1.2 is optimized for
AltiVec, and it was "between 5 and 20 (36 in one case!) times as
fast". That's 500%-2000% speedup in real world tests.
- SSE
SSE is similar to AltiVec, but is missing some hardware features
(like the inverse exp()). I would guess that SSE would end up closer
to a ~400% speedup.
- GPU
This is where things really get interesting, and more so in the near
future. Most modern GPUs have a texture processor that can do
calculations using floating point numbers. They operate on vectors of
4 floating point values, and they can operate on many vectors
simultaneously (8-32 pipelines?). As with all vector processors,
their greatest speedup will occur with many interconnects between
layers.
Imagine you want to use a 32x32 pixel display as an input to your
net. That's 1024 input neurons. Two fully connected 1024 neuron
layers will have 1.04 million interconnects (weights). As currently
programmed, it would take 1.04 million passes to calculate the
resulting sums. A GPU would treat the weights as a 1024x256 pixel
texture and the neuron sums as a 256x1 pixel texture. It could
operate on 128 weights at a time in floating point, resulting in
12800% speedup (theoretically).
The same number of interconnects are used if the input is 64x64
pixels and your hidden layer is 256 neurons. Or the input is ~74x74
RGB pixels and the hidden layer is 64 neurons. By my calculations
that's enough processing power to simulate a fruit fly brain in
realtime. ;)
Thoughts? Misunderstandings? Miscalculations? If I worked on getting
the scalar macros working, is there a graphics programmer out there
who would get it running on GPUs (OpenGL)? Papers on working with
massive ANNs?
Apologies if I'm confused on the GPU details. IANAGP
(I Am Not A GPU Programmer)
~Seth
Appendix A:
Fruit fly brain calculations
100,000 neurons total
* 1024 connections per neuron
* 100 operations per connection calculation
* 100 neuron firing events per second
~= 1 teraflops
http://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons
intend on doing so. I thought I'd update people with what I've been
thinking. First, I would like to create a test suite I can use for
regression testing. I've added a method for setting the seed, so in
the future people can run regression tests without having to define
FANN_NO_SEED. You can test the same library you've compiled for use.
Makes sense to me. I'm still working on this part.
Re: vectorizing while maintaining code readability (can I interest
you in teraflops?):
The most difficult part of writing vector code is creating the proper
memory structures to operate on. In the case of AltiVec, you can (and
I have) automatically vectorize code. But if you look at the result,
the processor will often spend much of its time translating to and
from the correct memory structures.
A quick example: I vectorized some FORTRAN code and got a 1.5x speed
improvement on AltiVec. Not terrible, but it should be better. So I
looked at the code that was generated by the auto-vectorizer. The
vector processor spent roughly *half* its time translating from
memory that wasn't 16 byte aligned. The problem gets worse when you
attempt to operate on a vector of values (like neuron sums,
activations, or weights) that are stored in structures with unused
data surrounding each value.
The conclusion here is that the most important part of creating
vector code is creating the correct memory structure.
I've decided the best way to do this is to create macros for loading/
setting weights and other neuron values. The macros used can be
chosen at compile time along with the processing method used.
Something like these:
fann_neuron_wieght_store()
fann_neuron_wieght_load()
fann_neuron_sum_store()
fann_neuron_sum_load()
...
They will be customized to use data structures that are optimized for
fann_run(), so the utility functions that operate on the nets will be
a bit slower, but fann_run() will be more adjustable. You've probably
heard the 90%/10% rule, where most programs spend 90% of their time
in 10% of the code. These macros would be used in the code where the
processor spends the least amount of time (90% of the code), and the
code would be maintainable. This will allow people to create custom,
fast, solutions for fann_run().
Therefore, someone can create data structures that could be used with
scalar, AltiVec, SSE, and even GPU processors. The possibility of
using a GPU is particularly interesting, because it opens FANN up to
operating on the scale of *teraflops* with consumer hardware. See
also this recent slashdot story and associated links:
http://hardware.slashdot.org/article.pl?sid=07/03/01/1519254
I've been doing a bunch of reading, and I believe that this is quite
doable once these macros are in place (see also: http://
www.gpgpu.org/ ). I would think that the ability to easily use
massively parallel systems will open up new areas of ANN usage. (All
this would probably also work with the Cell processor, but I'm not
sure if that would be useful to users of libfann.)
To give people an idea of the speedups I'm talking about:
- Scalar
I expect this to run about the same speed as the original code.
Packing neurons into a struct of arrays instead of an array of
structs might speed the code up because of better cache usage. But
using macros for neuron access in 90% of the code will slow things
down a tad.
- AltiVec
In the least, it can operate on four floats at the same time, which
yields a ~400% speed improvement. Lucky for PowerPC users, AltiVec
also has hardware inverse, inverse exp(), inverse sqrt(), and some
other useful functions. A patch for FANN v1.2 is optimized for
AltiVec, and it was "between 5 and 20 (36 in one case!) times as
fast". That's 500%-2000% speedup in real world tests.
- SSE
SSE is similar to AltiVec, but is missing some hardware features
(like the inverse exp()). I would guess that SSE would end up closer
to a ~400% speedup.
- GPU
This is where things really get interesting, and more so in the near
future. Most modern GPUs have a texture processor that can do
calculations using floating point numbers. They operate on vectors of
4 floating point values, and they can operate on many vectors
simultaneously (8-32 pipelines?). As with all vector processors,
their greatest speedup will occur with many interconnects between
layers.
Imagine you want to use a 32x32 pixel display as an input to your
net. That's 1024 input neurons. Two fully connected 1024 neuron
layers will have 1.04 million interconnects (weights). As currently
programmed, it would take 1.04 million passes to calculate the
resulting sums. A GPU would treat the weights as a 1024x256 pixel
texture and the neuron sums as a 256x1 pixel texture. It could
operate on 128 weights at a time in floating point, resulting in
12800% speedup (theoretically).
The same number of interconnects are used if the input is 64x64
pixels and your hidden layer is 256 neurons. Or the input is ~74x74
RGB pixels and the hidden layer is 64 neurons. By my calculations
that's enough processing power to simulate a fruit fly brain in
realtime. ;)
Thoughts? Misunderstandings? Miscalculations? If I worked on getting
the scalar macros working, is there a graphics programmer out there
who would get it running on GPUs (OpenGL)? Papers on working with
massive ANNs?
Apologies if I'm confused on the GPU details. IANAGP
(I Am Not A GPU Programmer)
~Seth
Appendix A:
Fruit fly brain calculations
100,000 neurons total
* 1024 connections per neuron
* 100 operations per connection calculation
* 100 neuron firing events per second
~= 1 teraflops
http://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons