Seth Price
2007-02-21 19:46:36 UTC
Hey all, I'm interested in vectorizing FANN for AltiVec and (maybe)
SSE. Has anyone looked into this before? I would guess that I can get
4x the speed for the AltiVec version.
The innermost loop of fann_run (fann.c:630 - fann.c:663) is a good
candidate. I can see that it has already been tweaked fairly well.
For example: I would like to change this:
i = num_connections & 3; /* same as modulo 4 */
switch (i)
{
case 3:
neuron_sum += fann_mult(weights[2], neurons[2].value);
case 2:
neuron_sum += fann_mult(weights[1], neurons[1].value);
case 1:
neuron_sum += fann_mult(weights[0], neurons[0].value);
case 0:
break;
}
for(; i != num_connections; i += 4)
{
neuron_sum +=
fann_mult(weights[i], neurons[i].value) +
fann_mult(weights[i + 1], neurons[i + 1].value) +
fann_mult(weights[i + 2], neurons[i + 2].value) +
fann_mult(weights[i + 3], neurons[i + 3].value);
}
Into something more like this:
vecUnion wgts;
vecUnion conn;
vecUnion sum = (vecUnion) (0.0, 0.0, 0.0, 0.0);
vector float * wgtsPtr;
vector float * connPtr;
wgtsPtr = (vecUnion *) &(weights[0]);
connPtr = (vecUnion *) &(neurons.value[0]);
for(i = 0; i < num_connections; i += 4){
wgts = vec_ld(0, wgtsPtr++);
conn = vec_ld(0, connPtr++);
sum = vec_madd(wgts, conn, sum);
}
neuron_sum = sum.f[0] + sum.f[1] + sum.f[2] + sum.f[3];
Many fewer operations with the same amount of work done. But, as
always, the more optimization occurs, the less recognizable it is.
There are plenty of registers left for unrolling. ;) I'm not as
familiar with SSE, but from what I've seen, it should be almost the
same (with the same problems). The multiply/add becomes two
instructions instead of one.
My biggest issue with this lies in the structure of the neurons. To
get the best speed out of them, I would need them to be packed
together, so I can treat them like vectors the entire way through.
Thus "neurons.value[0]" instead of the original "neurons[0].value".
The other issue is the byte alignment of the data. To work correctly,
the addresses of "weights[0]" and "neurons.value[0]" need to be
aligned on a 16 (or 32) byte boundary and padded with zeros would be
good. Both of these changes may make the scalar version faster, also.
I was thinking that someone more familiar with libfann could make
these changes. Thoughts?
Disclaimer: it has been two years since I've worked with this sort of
programming, so I may be a little rusty.
~Seth
SSE. Has anyone looked into this before? I would guess that I can get
4x the speed for the AltiVec version.
The innermost loop of fann_run (fann.c:630 - fann.c:663) is a good
candidate. I can see that it has already been tweaked fairly well.
For example: I would like to change this:
i = num_connections & 3; /* same as modulo 4 */
switch (i)
{
case 3:
neuron_sum += fann_mult(weights[2], neurons[2].value);
case 2:
neuron_sum += fann_mult(weights[1], neurons[1].value);
case 1:
neuron_sum += fann_mult(weights[0], neurons[0].value);
case 0:
break;
}
for(; i != num_connections; i += 4)
{
neuron_sum +=
fann_mult(weights[i], neurons[i].value) +
fann_mult(weights[i + 1], neurons[i + 1].value) +
fann_mult(weights[i + 2], neurons[i + 2].value) +
fann_mult(weights[i + 3], neurons[i + 3].value);
}
Into something more like this:
vecUnion wgts;
vecUnion conn;
vecUnion sum = (vecUnion) (0.0, 0.0, 0.0, 0.0);
vector float * wgtsPtr;
vector float * connPtr;
wgtsPtr = (vecUnion *) &(weights[0]);
connPtr = (vecUnion *) &(neurons.value[0]);
for(i = 0; i < num_connections; i += 4){
wgts = vec_ld(0, wgtsPtr++);
conn = vec_ld(0, connPtr++);
sum = vec_madd(wgts, conn, sum);
}
neuron_sum = sum.f[0] + sum.f[1] + sum.f[2] + sum.f[3];
Many fewer operations with the same amount of work done. But, as
always, the more optimization occurs, the less recognizable it is.
There are plenty of registers left for unrolling. ;) I'm not as
familiar with SSE, but from what I've seen, it should be almost the
same (with the same problems). The multiply/add becomes two
instructions instead of one.
My biggest issue with this lies in the structure of the neurons. To
get the best speed out of them, I would need them to be packed
together, so I can treat them like vectors the entire way through.
Thus "neurons.value[0]" instead of the original "neurons[0].value".
The other issue is the byte alignment of the data. To work correctly,
the addresses of "weights[0]" and "neurons.value[0]" need to be
aligned on a 16 (or 32) byte boundary and padded with zeros would be
good. Both of these changes may make the scalar version faster, also.
I was thinking that someone more familiar with libfann could make
these changes. Thoughts?
Disclaimer: it has been two years since I've worked with this sort of
programming, so I may be a little rusty.
~Seth