Vectorizing FANN

Discussion:

Vectorizing FANN

Seth Price

2007-03-04 22:06:45 UTC

I haven't had the time to work on it in the last week, but I still
intend on doing so. I thought I'd update people with what I've been
thinking. First, I would like to create a test suite I can use for
regression testing. I've added a method for setting the seed, so in
the future people can run regression tests without having to define
FANN_NO_SEED. You can test the same library you've compiled for use.
Makes sense to me. I'm still working on this part.

Re: vectorizing while maintaining code readability (can I interest
you in teraflops?):

The most difficult part of writing vector code is creating the proper
memory structures to operate on. In the case of AltiVec, you can (and
I have) automatically vectorize code. But if you look at the result,
the processor will often spend much of its time translating to and
from the correct memory structures.

A quick example: I vectorized some FORTRAN code and got a 1.5x speed
improvement on AltiVec. Not terrible, but it should be better. So I
looked at the code that was generated by the auto-vectorizer. The
vector processor spent roughly *half* its time translating from
memory that wasn't 16 byte aligned. The problem gets worse when you
attempt to operate on a vector of values (like neuron sums,
activations, or weights) that are stored in structures with unused
data surrounding each value.

The conclusion here is that the most important part of creating
vector code is creating the correct memory structure.

I've decided the best way to do this is to create macros for loading/
setting weights and other neuron values. The macros used can be
chosen at compile time along with the processing method used.
Something like these:
fann_neuron_wieght_store()
fann_neuron_wieght_load()
fann_neuron_sum_store()
fann_neuron_sum_load()
...

They will be customized to use data structures that are optimized for
fann_run(), so the utility functions that operate on the nets will be
a bit slower, but fann_run() will be more adjustable. You've probably
heard the 90%/10% rule, where most programs spend 90% of their time
in 10% of the code. These macros would be used in the code where the
processor spends the least amount of time (90% of the code), and the
code would be maintainable. This will allow people to create custom,
fast, solutions for fann_run().

Therefore, someone can create data structures that could be used with
scalar, AltiVec, SSE, and even GPU processors. The possibility of
using a GPU is particularly interesting, because it opens FANN up to
operating on the scale of *teraflops* with consumer hardware. See
also this recent slashdot story and associated links:
http://hardware.slashdot.org/article.pl?sid=07/03/01/1519254

I've been doing a bunch of reading, and I believe that this is quite
doable once these macros are in place (see also: http://
www.gpgpu.org/ ). I would think that the ability to easily use
massively parallel systems will open up new areas of ANN usage. (All
this would probably also work with the Cell processor, but I'm not
sure if that would be useful to users of libfann.)

To give people an idea of the speedups I'm talking about:
- Scalar
I expect this to run about the same speed as the original code.
Packing neurons into a struct of arrays instead of an array of
structs might speed the code up because of better cache usage. But
using macros for neuron access in 90% of the code will slow things
down a tad.

- AltiVec
In the least, it can operate on four floats at the same time, which
yields a ~400% speed improvement. Lucky for PowerPC users, AltiVec
also has hardware inverse, inverse exp(), inverse sqrt(), and some
other useful functions. A patch for FANN v1.2 is optimized for
AltiVec, and it was "between 5 and 20 (36 in one case!) times as
fast". That's 500%-2000% speedup in real world tests.

- SSE
SSE is similar to AltiVec, but is missing some hardware features
(like the inverse exp()). I would guess that SSE would end up closer
to a ~400% speedup.

- GPU
This is where things really get interesting, and more so in the near
future. Most modern GPUs have a texture processor that can do
calculations using floating point numbers. They operate on vectors of
4 floating point values, and they can operate on many vectors
simultaneously (8-32 pipelines?). As with all vector processors,
their greatest speedup will occur with many interconnects between
layers.

Imagine you want to use a 32x32 pixel display as an input to your
net. That's 1024 input neurons. Two fully connected 1024 neuron
layers will have 1.04 million interconnects (weights). As currently
programmed, it would take 1.04 million passes to calculate the
resulting sums. A GPU would treat the weights as a 1024x256 pixel
texture and the neuron sums as a 256x1 pixel texture. It could
operate on 128 weights at a time in floating point, resulting in
12800% speedup (theoretically).

The same number of interconnects are used if the input is 64x64
pixels and your hidden layer is 256 neurons. Or the input is ~74x74
RGB pixels and the hidden layer is 64 neurons. By my calculations
that's enough processing power to simulate a fruit fly brain in
realtime. ;)

Thoughts? Misunderstandings? Miscalculations? If I worked on getting
the scalar macros working, is there a graphics programmer out there
who would get it running on GPUs (OpenGL)? Papers on working with
massive ANNs?

Apologies if I'm confused on the GPU details. IANAGP
(I Am Not A GPU Programmer)
~Seth

Appendix A:
Fruit fly brain calculations
100,000 neurons total
* 1024 connections per neuron
* 100 operations per connection calculation
* 100 neuron firing events per second
~= 1 teraflops

http://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons

Marcel Serour

2007-03-04 22:52:11 UTC

Permalink

the gpu idea is very good, recently I was reading some material about this
too!
if I can be of any help I would be very glad, but I'm not sure if I have
proper fann knowledge yet..

I haven't had the time to work on it in the last week, but I still intend
on doing so. I thought I'd update people with what I've been thinking.
First, I would like to create a test suite I can use for regression testing.
I've added a method for setting the seed, so in the future people can run
regression tests without having to define FANN_NO_SEED. You can test the
same library you've compiled for use. Makes sense to me. I'm still working
on this part.
Re: vectorizing while maintaining code readability (can I interest you in
The most difficult part of writing vector code is creating the proper
memory structures to operate on. In the case of AltiVec, you can (and I
have) automatically vectorize code. But if you look at the result, the
processor will often spend much of its time translating to and from the
correct memory structures.
A quick example: I vectorized some FORTRAN code and got a 1.5x speed
improvement on AltiVec. Not terrible, but it should be better. So I looked
at the code that was generated by the auto-vectorizer. The vector processor
spent roughly *half* its time translating from memory that wasn't 16 byte
aligned. The problem gets worse when you attempt to operate on a vector of
values (like neuron sums, activations, or weights) that are stored in
structures with unused data surrounding each value.
The conclusion here is that the most important part of creating vector
code is creating the correct memory structure.
I've decided the best way to do this is to create macros for
loading/setting weights and other neuron values. The macros used can be
chosen at compile time along with the processing method used. Something like
fann_neuron_wieght_store()
fann_neuron_wieght_load()
fann_neuron_sum_store()
fann_neuron_sum_load()
...
They will be customized to use data structures that are optimized for
fann_run(), so the utility functions that operate on the nets will be a bit
slower, but fann_run() will be more adjustable. You've probably heard the
90%/10% rule, where most programs spend 90% of their time in 10% of the
code. These macros would be used in the code where the processor spends the
least amount of time (90% of the code), and the code would be maintainable.
This will allow people to create custom, fast, solutions for fann_run().
Therefore, someone can create data structures that could be used with
scalar, AltiVec, SSE, and even GPU processors. The possibility of using a
GPU is particularly interesting, because it opens FANN up to operating on
the scale of *teraflops* with consumer hardware. See also this recent
http://hardware.slashdot.org/article.pl?sid=07/03/01/1519254
I've been doing a bunch of reading, and I believe that this is quite
doable once these macros are in place (see also: http://www.gpgpu.org/ ).
I would think that the ability to easily use massively parallel systems will
open up new areas of ANN usage. (All this would probably also work with the
Cell processor, but I'm not sure if that would be useful to users of
libfann.)
- Scalar
I expect this to run about the same speed as the original code. Packing
neurons into a struct of arrays instead of an array of structs might speed
the code up because of better cache usage. But using macros for neuron
access in 90% of the code will slow things down a tad.
- AltiVec
In the least, it can operate on four floats at the same time, which yields
a ~400% speed improvement. Lucky for PowerPC users, AltiVec also has
hardware inverse, inverse exp(), inverse sqrt(), and some other useful
functions. A patch for FANN v1.2 is optimized for AltiVec, and it was
"between 5 and 20 (36 in one case!) times as fast". That's 500%-2000%
speedup in real world tests.
- SSE
SSE is similar to AltiVec, but is missing some hardware features (like the
inverse exp()). I would guess that SSE would end up closer to a ~400%
speedup.
- GPU
This is where things really get interesting, and more so in the near
future. Most modern GPUs have a texture processor that can do calculations
using floating point numbers. They operate on vectors of 4 floating point
values, and they can operate on many vectors simultaneously (8-32
pipelines?). As with all vector processors, their greatest speedup
will occur with many interconnects between layers.
Imagine you want to use a 32x32 pixel display as an input to your net.
That's 1024 input neurons. Two fully connected 1024 neuron layers will have
1.04 million interconnects (weights). As currently programmed, it would
take 1.04 million passes to calculate the resulting sums. A GPU would
treat the weights as a 1024x256 pixel texture and the neuron sums as a 256x1
pixel texture. It could operate on 128 weights at a time in floating point,
resulting in 12800% speedup (theoretically).
The same number of interconnects are used if the input is 64x64 pixels and
your hidden layer is 256 neurons. Or the input is ~74x74 RGB pixels and the
hidden layer is 64 neurons. By my calculations that's enough processing
power to simulate a fruit fly brain in realtime. ;)
Thoughts? Misunderstandings? Miscalculations? If I worked on getting the
scalar macros working, is there a graphics programmer out there who would
get it running on GPUs (OpenGL)? Papers on working with massive ANNs?
Apologies if I'm confused on the GPU details. IANAGP
(I Am Not A GPU Programmer)
~Seth
Fruit fly brain calculations
100,000 neurons total
* 1024 connections per neuron
* 100 operations per connection calculation
* 100 neuron firing events per second
~= 1 teraflops
http://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general

Vincenzo Di Massa

2007-03-05 11:54:59 UTC

Permalink

What about partecipating to Google Summer of Code?

I could apply as a student a d work on this the whole summer (payd be Google).

It really something I woul like to do... If someone else is willing to to
apply for SoC 2007 I could also be mentor (but I'd really prefere beeing the
student).

The deadline is 12th March.. so we have not that big amount of time.

We just need a list of proposals and mentors.
Look here

http://code.google.com/support/bin/answer.py?answer=60266&topic=10726

http://code.google.com/support/bin/topic.py?topic=10442

Regards
Vincenzo

Post by Seth Price
I haven't had the time to work on it in the last week, but I still
intend on doing so. I thought I'd update people with what I've been
thinking. First, I would like to create a test suite I can use for
regression testing. I've added a method for setting the seed, so in
the future people can run regression tests without having to define
FANN_NO_SEED. You can test the same library you've compiled for use.
Makes sense to me. I'm still working on this part.
Re: vectorizing while maintaining code readability (can I interest
The most difficult part of writing vector code is creating the proper
memory structures to operate on. In the case of AltiVec, you can (and
I have) automatically vectorize code. But if you look at the result,
the processor will often spend much of its time translating to and
from the correct memory structures.
A quick example: I vectorized some FORTRAN code and got a 1.5x speed
improvement on AltiVec. Not terrible, but it should be better. So I
looked at the code that was generated by the auto-vectorizer. The
vector processor spent roughly *half* its time translating from
memory that wasn't 16 byte aligned. The problem gets worse when you
attempt to operate on a vector of values (like neuron sums,
activations, or weights) that are stored in structures with unused
data surrounding each value.
The conclusion here is that the most important part of creating
vector code is creating the correct memory structure.
I've decided the best way to do this is to create macros for loading/
setting weights and other neuron values. The macros used can be
chosen at compile time along with the processing method used.
fann_neuron_wieght_store()
fann_neuron_wieght_load()
fann_neuron_sum_store()
fann_neuron_sum_load()
...
They will be customized to use data structures that are optimized for
fann_run(), so the utility functions that operate on the nets will be
a bit slower, but fann_run() will be more adjustable. You've probably
heard the 90%/10% rule, where most programs spend 90% of their time
in 10% of the code. These macros would be used in the code where the
processor spends the least amount of time (90% of the code), and the
code would be maintainable. This will allow people to create custom,
fast, solutions for fann_run().
Therefore, someone can create data structures that could be used with
scalar, AltiVec, SSE, and even GPU processors. The possibility of
using a GPU is particularly interesting, because it opens FANN up to
operating on the scale of *teraflops* with consumer hardware. See
http://hardware.slashdot.org/article.pl?sid=07/03/01/1519254
I've been doing a bunch of reading, and I believe that this is quite
doable once these macros are in place (see also: http://
www.gpgpu.org/ ). I would think that the ability to easily use
massively parallel systems will open up new areas of ANN usage. (All
this would probably also work with the Cell processor, but I'm not
sure if that would be useful to users of libfann.)
- Scalar
I expect this to run about the same speed as the original code.
Packing neurons into a struct of arrays instead of an array of
structs might speed the code up because of better cache usage. But
using macros for neuron access in 90% of the code will slow things
down a tad.
- AltiVec
In the least, it can operate on four floats at the same time, which
yields a ~400% speed improvement. Lucky for PowerPC users, AltiVec
also has hardware inverse, inverse exp(), inverse sqrt(), and some
other useful functions. A patch for FANN v1.2 is optimized for
AltiVec, and it was "between 5 and 20 (36 in one case!) times as
fast". That's 500%-2000% speedup in real world tests.
- SSE
SSE is similar to AltiVec, but is missing some hardware features
(like the inverse exp()). I would guess that SSE would end up closer
to a ~400% speedup.
- GPU
This is where things really get interesting, and more so in the near
future. Most modern GPUs have a texture processor that can do
calculations using floating point numbers. They operate on vectors of
4 floating point values, and they can operate on many vectors
simultaneously (8-32 pipelines?). As with all vector processors,
their greatest speedup will occur with many interconnects between
layers.
Imagine you want to use a 32x32 pixel display as an input to your
net. That's 1024 input neurons. Two fully connected 1024 neuron
layers will have 1.04 million interconnects (weights). As currently
programmed, it would take 1.04 million passes to calculate the
resulting sums. A GPU would treat the weights as a 1024x256 pixel
texture and the neuron sums as a 256x1 pixel texture. It could
operate on 128 weights at a time in floating point, resulting in
12800% speedup (theoretically).
The same number of interconnects are used if the input is 64x64
pixels and your hidden layer is 256 neurons. Or the input is ~74x74
RGB pixels and the hidden layer is 64 neurons. By my calculations
that's enough processing power to simulate a fruit fly brain in
realtime. ;)
Thoughts? Misunderstandings? Miscalculations? If I worked on getting
the scalar macros working, is there a graphics programmer out there
who would get it running on GPUs (OpenGL)? Papers on working with
massive ANNs?
Apologies if I'm confused on the GPU details. IANAGP
(I Am Not A GPU Programmer)
~Seth
Fruit fly brain calculations
100,000 neurons total
* 1024 connections per neuron
* 100 operations per connection calculation
* 100 neuron firing events per second
~= 1 teraflops
http://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

Seth Price

2007-03-05 14:54:04 UTC

Permalink

That's a good idea. I'd encourage you to apply. I don't think I would
be the best applicant because I have other projects going on. And I'm
not currently a student.
~Seth

Post by Vincenzo Di Massa
What about partecipating to Google Summer of Code?
I could apply as a student a d work on this the whole summer (payd be Google).
It really something I woul like to do... If someone else is willing to to
apply for SoC 2007 I could also be mentor (but I'd really prefere beeing the
student).
The deadline is 12th March.. so we have not that big amount of time.
We just need a list of proposals and mentors.
Look here
http://code.google.com/support/bin/answer.py?answer=60266&topic=10726
http://code.google.com/support/bin/topic.py?topic=10442
Regards
Vincenzo

----------------------------------------------------------------------
---
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?
page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general

Vincenzo Di Massa

2007-03-05 15:20:35 UTC

Permalink

Post by Seth Price
That's a good idea. I'd encourage you to apply. I don't think I would
be the best applicant because I have other projects going on. And I'm
not currently a student.
~Seth

To apply to Summer Of Code we first need to apply as an organization/group:
Steffen can you do the paperwork to apply? Are you willing to apply? I can
help if you need support. I have already been a SoC student twice...

The stipend fot the students is 4500$ for the mentors it is 500$... it sounds
worth a bit of paperwork :-)

Regards
Vincenzo

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

Steffen Nissen

2007-03-05 20:58:01 UTC

Permalink

I think it sounds like a great idea.

I would be happy to apply as a mentor, however most of my summer will be
spent on writing my thesis, but as always I will help where I can. It does
also seem like google would very much like that a backup mentor is in place
if I am on vacation etc., so that would be nice too.

Another suggestion is that I apply as an organization and that another
person applies as mentor, then I will be able to be the backup mentor.

Please see http://code.google.com/soc/tos.html for more information.

Please let me know if anyone is interested in applying as mentors or
students.

Best Regards,
Steffen

Post by Vincenzo Di Massa

Post by Seth Price
That's a good idea. I'd encourage you to apply. I don't think I would
be the best applicant because I have other projects going on. And I'm
not currently a student.
~Seth

To apply to Summer Of Code we first need to apply as an
Steffen can you do the paperwork to apply? Are you willing to apply? I can
help if you need support. I have already been a SoC student twice...
The stipend fot the students is 4500$ for the mentors it is 500$... it sounds
worth a bit of paperwork :-)
Regards
Vincenzo
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general

--
Steffen Nissen - http://MySpace.com/SteffenNissen
Project Administrator - Fast Artificial Neural Network Library (fann)
http://fann.sf.net

Vincenzo Di Massa

2007-03-05 23:58:25 UTC

Permalink

I'd prefer to be a student, as I told.

Could Seth be the backup (or a second one) mentor?
freegoldbar?

Post by Steffen Nissen
I think it sounds like a great idea.
I would be happy to apply as a mentor, however most of my summer will be
spent on writing my thesis, but as always I will help where I can. It does
also seem like google would very much like that a backup mentor is in place
if I am on vacation etc., so that would be nice too.
Another suggestion is that I apply as an organization and that another
person applies as mentor, then I will be able to be the backup mentor.
Please see http://code.google.com/soc/tos.html for more information.
Please let me know if anyone is interested in applying as mentors or
students.
Best Regards,
Steffen

Post by Vincenzo Di Massa

Post by Seth Price
That's a good idea. I'd encourage you to apply. I don't think I would
be the best applicant because I have other projects going on. And I'm
not currently a student.
~Seth

To apply to Summer Of Code we first need to apply as an
Steffen can you do the paperwork to apply? Are you willing to apply? I
can help if you need support. I have already been a SoC student twice...
The stipend fot the students is 4500$ for the mentors it is 500$... it sounds
worth a bit of paperwork :-)
Regards
Vincenzo
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general

Seth Price

2007-03-06 02:28:08 UTC

Permalink

I don't think I'd be comfortable being classified as a primary
mentor. While I would like to watch and help with the development
process, I don't have much experience working with GPU programming or
medium-sized C packages. Rather I'm also learning as I go.

I may also be out-of-contact for a few weeks this summer. There is a
small chance I will be teaching English in China for a few weeks, and
I will be moving across the country.

I *can*, however, help specifically with vector instructions,
processor architecture from a programmers perspective, and memory
layout. I have a BS degree in Computer Science from a top ten
university (http://www.cs.wisc.edu). I'm also well versed with some
low-level optimization tools on my desktop machine, so I can help there.

If you (Vincenzo & Steffen) feel I'm qualified, I would be willing to
be the backup (or secondary) mentor.
~Seth

Post by Vincenzo Di Massa
I'd prefer to be a student, as I told.
Could Seth be the backup (or a second one) mentor?
freegoldbar?

Post by Vincenzo Di Massa

Post by Seth Price
That's a good idea. I'd encourage you to apply. I don't think I would
be the best applicant because I have other projects going on. And I'm
not currently a student.
~Seth

To apply to Summer Of Code we first need to apply as an
Steffen can you do the paperwork to apply? Are you willing to apply? I
can help if you need support. I have already been a SoC student twice...
The stipend fot the students is 4500$ for the mentors it is 500
$... it
sounds
worth a bit of paperwork :-)
Regards
Vincenzo
--------------------------------------------------------------------
-----
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to
share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?
page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general

----------------------------------------------------------------------
---
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?
page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general