Vectorizing FANN

Discussion:

Vectorizing FANN

Seth Price

2007-02-21 19:46:36 UTC

Hey all, I'm interested in vectorizing FANN for AltiVec and (maybe)
SSE. Has anyone looked into this before? I would guess that I can get
4x the speed for the AltiVec version.

The innermost loop of fann_run (fann.c:630 - fann.c:663) is a good
candidate. I can see that it has already been tweaked fairly well.
For example: I would like to change this:
i = num_connections & 3; /* same as modulo 4 */
switch (i)
{
case 3:
neuron_sum += fann_mult(weights[2], neurons[2].value);
case 2:
neuron_sum += fann_mult(weights[1], neurons[1].value);
case 1:
neuron_sum += fann_mult(weights[0], neurons[0].value);
case 0:
break;
}

for(; i != num_connections; i += 4)
{
neuron_sum +=
fann_mult(weights[i], neurons[i].value) +
fann_mult(weights[i + 1], neurons[i + 1].value) +
fann_mult(weights[i + 2], neurons[i + 2].value) +
fann_mult(weights[i + 3], neurons[i + 3].value);
}

Into something more like this:
vecUnion wgts;
vecUnion conn;
vecUnion sum = (vecUnion) (0.0, 0.0, 0.0, 0.0);
vector float * wgtsPtr;
vector float * connPtr;

wgtsPtr = (vecUnion *) &(weights[0]);
connPtr = (vecUnion *) &(neurons.value[0]);

for(i = 0; i < num_connections; i += 4){
wgts = vec_ld(0, wgtsPtr++);
conn = vec_ld(0, connPtr++);
sum = vec_madd(wgts, conn, sum);
}

neuron_sum = sum.f[0] + sum.f[1] + sum.f[2] + sum.f[3];

Many fewer operations with the same amount of work done. But, as
always, the more optimization occurs, the less recognizable it is.
There are plenty of registers left for unrolling. ;) I'm not as
familiar with SSE, but from what I've seen, it should be almost the
same (with the same problems). The multiply/add becomes two
instructions instead of one.

My biggest issue with this lies in the structure of the neurons. To
get the best speed out of them, I would need them to be packed
together, so I can treat them like vectors the entire way through.
Thus "neurons.value[0]" instead of the original "neurons[0].value".

The other issue is the byte alignment of the data. To work correctly,
the addresses of "weights[0]" and "neurons.value[0]" need to be
aligned on a 16 (or 32) byte boundary and padded with zeros would be
good. Both of these changes may make the scalar version faster, also.

I was thinking that someone more familiar with libfann could make
these changes. Thoughts?

Disclaimer: it has been two years since I've worked with this sort of
programming, so I may be a little rusty.
~Seth

Steffen Nissen

2007-02-21 20:14:29 UTC

Permalink

Please take a look at this optimization for the 64 bit power pc, it uses
vectorization.

http://prdownloads.sourceforge.net/fann/fann_optimize_ppc64_1_2_0.tgz?download

Steffen

Hey all, I'm interested in vectorizing FANN for AltiVec and (maybe) SSE.
Has anyone looked into this before? I would guess that I can get 4x the
speed for the AltiVec version.
The innermost loop of fann_run (fann.c:630 - fann.c:663) is a good
candidate. I can see that it has already been tweaked fairly well. For
i = num_connections & 3; /* same as modulo 4 */
switch (i)
{
neuron_sum += fann_mult(weights[2], neurons[2].value);
neuron_sum += fann_mult(weights[1], neurons[1].value);
neuron_sum += fann_mult(weights[0], neurons[0].value);
break;
}
for(; i != num_connections; i += 4)
{
neuron_sum +=
fann_mult(weights[i], neurons[i].value) +
fann_mult(weights[i + 1], neurons[i + 1].value) +
fann_mult(weights[i + 2], neurons[i + 2].value) +
fann_mult(weights[i + 3], neurons[i + 3].value);
}
vecUnion wgts;
vecUnion conn;
vecUnion sum = (vecUnion) (0.0, 0.0, 0.0, 0.0);
vector float * wgtsPtr;
vector float * connPtr;
wgtsPtr = (vecUnion *) &(weights[0]);
connPtr = (vecUnion *) &(neurons.value[0]);
for(i = 0; i < num_connections; i += 4){
wgts = vec_ld(0, wgtsPtr++);
conn = vec_ld(0, connPtr++);
sum = vec_madd(wgts, conn, sum);
}
neuron_sum = sum.f[0] + sum.f[1] + sum.f[2] + sum.f[3];
Many fewer operations with the same amount of work done. But, as always,
the more optimization occurs, the less recognizable it is. There are plenty
of registers left for unrolling. ;) I'm not as familiar with SSE, but from
what I've seen, it should be almost the same (with the same problems). The
multiply/add becomes two instructions instead of one.
My biggest issue with this lies in the structure of the neurons. To get
the best speed out of them, I would need them to be packed together, so I
can treat them like vectors the entire way through. Thus "neurons.value[0]"
instead of the original "neurons[0].value".
The other issue is the byte alignment of the data. To work correctly, the
addresses of "weights[0]" and "neurons.value[0]" need to be aligned on a
16 (or 32) byte boundary and padded with zeros would be good. Both of these
changes may make the scalar version faster, also.
I was thinking that someone more familiar with libfann could make these
changes. Thoughts?
Disclaimer: it has been two years since I've worked with this sort of
programming, so I may be a little rusty.
~Seth
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general

--
Steffen Nissen - http://MySpace.com/SteffenNissen
Project Administrator - Fast Artificial Neural Network Library (fann)
http://fann.sf.net

Seth Price

2007-02-21 20:28:11 UTC

Permalink

Awesome! I could have used that a few months ago....

I'll contact the author and ask about updating the patch to the
latest 2.1 source.

BTW, thanks very much for your library.
~Seth

Post by Steffen Nissen
Please take a look at this optimization for the 64 bit power pc, it
uses vectorization.
http://prdownloads.sourceforge.net/fann/
fann_optimize_ppc64_1_2_0.tgz?download
Steffen
Hey all, I'm interested in vectorizing FANN for AltiVec and (maybe)
SSE. Has anyone looked into this before? I would guess that I can
get 4x the speed for the AltiVec version.
The innermost loop of fann_run ( fann.c:630 - fann.c:663) is a good
candidate. I can see that it has already been tweaked fairly well.
i = num_connections & 3; /* same as modulo 4 */
switch (i)
{
neuron_sum += fann_mult(weights[ 2], neurons
[2].value);
neuron_sum += fann_mult(weights[1], neurons[1].value);
neuron_sum += fann_mult(weights[0], neurons
[ 0].value);
break;
}
for(; i != num_connections; i += 4)
{
neuron_sum +=
fann_mult(weights[i], neurons[i].value) +
fann_mult(weights[i + 1], neurons[i + 1].value) +
fann_mult(weights[i + 2], neurons[i + 2 ].value) +
fann_mult(weights[i + 3], neurons[i + 3].value);
}
vecUnion wgts;
vecUnion conn;
vecUnion sum = (vecUnion) ( 0.0, 0.0, 0.0 , 0.0);
vector float * wgtsPtr;
vector float * connPtr;
wgtsPtr = (vecUnion *) &(weights[0]);
connPtr = (vecUnion *) &(neurons.value[0]);
for(i = 0; i < num_connections; i += 4){
wgts = vec_ld(0, wgtsPtr++);
conn = vec_ld(0, connPtr++);
sum = vec_madd(wgts, conn, sum);
}
neuron_sum = sum.f[0] + sum.f[ 1] + sum.f[2] + sum.f[3 ];
Many fewer operations with the same amount of work done. But, as
always, the more optimization occurs, the less recognizable it is.
There are plenty of registers left for unrolling. ;) I'm not as
familiar with SSE, but from what I've seen, it should be almost the
same (with the same problems). The multiply/add becomes two
instructions instead of one.
My biggest issue with this lies in the structure of the neurons. To
get the best speed out of them, I would need them to be packed
together, so I can treat them like vectors the entire way through.
Thus " neurons.value[0]" instead of the original "neurons[0].value ".
The other issue is the byte alignment of the data. To work
correctly, the addresses of "weights[0 ]" and "neurons.value[0]"
need to be aligned on a 16 (or 32) byte boundary and padded with
zeros would be good. Both of these changes may make the scalar
version faster, also.
I was thinking that someone more familiar with libfann could make
these changes. Thoughts?
Disclaimer: it has been two years since I've worked with this sort
of programming, so I may be a little rusty.
~Seth
----------------------------------------------------------------------
---
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net 's Techsay panel and you'll get the chance to
share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?
page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general
--
Steffen Nissen - http://MySpace.com/SteffenNissen
Project Administrator - Fast Artificial Neural Network Library (fann)
http://fann.sf.net
----------------------------------------------------------------------
---
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to
share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?
page=join.php&p=sourceforge&CID=DEVDEV________________________________
_______________
Fann-general mailing list
https://lists.sourceforge.net/lists/listinfo/fann-general

Seth Price

2007-02-22 06:42:53 UTC

Permalink

I've been trying to update the vectorization package for FANN v2.1b,
I've applied all of the changes, but all of the configure and make
stuff has me stumped. I attempted to apply all of the changes listed
in the patch, but it won't compile (No rule to make target
`aligned_malloc.h'). I've uploaded my updated files to:

http://pricepages.org/fann/src.zip

Could someone please take a look and tell me how to get this thing to
at least test for altivec.h and compile without it? (Like it should
on regular PCs.) Even better if someone could get this compiling on a
Mac with altivec support.

SSE support: Now that the data structures are setup for vectorized
processing, it should be much easier to add SSE support on PCs. Apple
has detailed instructions on translating from AltiVec to SSE
instructions here:

http://developer.apple.com/documentation/Performance/Conceptual/
Accelerate_sse_migration/migration_intro/chapter_1_section_1.html

Thanks,
Seth

Vincenzo Di Massa

2007-02-22 09:32:22 UTC

Permalink

Has anyone tried to compile fann using gcc and something like:
-O3 -g0 -march=pentium4 -msse2 -mfpmath=sse -finline-limit=2400 -funroll-loops

This gives me a 200%-400% speedup on other projects I'm working on.
I have no time to do the testing at the moment...

Regards
Vincenzo

Post by Seth Price
I've been trying to update the vectorization package for FANN v2.1b,
I've applied all of the changes, but all of the configure and make
stuff has me stumped. I attempted to apply all of the changes listed
in the patch, but it won't compile (No rule to make target
http://pricepages.org/fann/src.zip
Could someone please take a look and tell me how to get this thing to
at least test for altivec.h and compile without it? (Like it should
on regular PCs.) Even better if someone could get this compiling on a
Mac with altivec support.
SSE support: Now that the data structures are setup for vectorized
processing, it should be much easier to add SSE support on PCs. Apple
has detailed instructions on translating from AltiVec to SSE
http://developer.apple.com/documentation/Performance/Conceptual/
Accelerate_sse_migration/migration_intro/chapter_1_section_1.html
Thanks,
Seth

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV