I discussed this originally in my Cortex-A7 FFTW benchmarks, but I am repeating it in it's own blog post for clarity as I believe it's an important thing to understand.
I noticed that when enabling the FMA capabilities of FFTW, the performance actually decreased. I thought to myself "but the ARM VFPv4 supports FMA so this should be faster that doing separate multiply and add operations..." so I did a little bit of research as to why this is the case.
In the computation of an FFT, two of the common operations are:
t0 = a + b * c
t1 = a - b * c
The way that the NEON FMA instruction works, however, is not conducive solving this. This is what happens when you use the NEON FMA instruction:
t0 = a
t0 += b * c
t1 = a
t1 -= b * c
Since ARM is a RISC architecture, the instructions are less flexible and generally take a fixed number of operands. For mathematical operations, it makes sense most of the time to use two operands. Because of this limitation, the FMA can still only take 2 operands and so it is used as shown above. Notice that we have to use up two move instructions for initially setting t0 and t1. It turns out that in this specific case it's faster to just use Multiplies and Adds:
t = b * c
t0 = a + t
t1 = a - t
All in all, the FMA version does 2 Moves, 2 FMA's. The optimal version does 1 Multiply and 2 Adds. It's a small difference, one which the compiler may or may not take note of and optimise, but when done a significant number of times it makes a difference which is what we see in the FFTW benchmarks, for example. There will be cases when this instruction does indeed make a difference, but it's important to bear in mind what's going on behind the scenes.