Thursday 19 December 2013

FFTW Benchmarks on Cortex-A7

The FFT algorithm has many scientific uses. The most obvious uses are in radio astronomy, for the frequency analysis of signals and is vital to Software Defined Radio (SDR) which is used extensively in the Square Kilometer Array (SKA). In line with the goals of the MAC Project, I am curious about how well an ARM processor (specifically the Cortex-A7) can do FFT - which leads to these benchmarks.

I discovered some existing benchmarks of FFTW done on the Cortex-A8 and A9 by Vesperix here. I used their modified FFTW 3.2.2 for ARM NEON and also ran benchmarks using the latest official version of FFTW: 3.3.3. Both sets of results are presented below with a short discussion afterwards.

I was unable to get the FFTW 3.3.3 NEON version working. I was repeatedly hit by a segmentation fault which I think is due to different memory alignment in the newer NEON and VPPv4 FPU's. I will post these specific benchmarks when the error is resolved.

System Specifications

The tests were run on a Cubieboard2 with the following specifications:
  • Allwinner A20 Dual-Core Cortex-A7 SoC @ ~1GHz
  • VFPv4 and NEONv2 FPU
  • 256 kB L2 Cache
  • 1 GB DDRIII RAM
  • 8GB Class 10 MicroSD Card
  • sunxi kernel 3.4.67+
  • Linaro 13.04 (with GCC 4.7.3)

Benchmark Methodology

I am only presenting the results for a complex 1D FFT with powers of two and non-powers of two. These are the types of FFTs that are most useful to radio astronomy since signal phase and amplitude are represented as a complex number. I ran several sets of benchmarks with various optimisations for comparison, each of which I will describe below.

I first tested the Vesperix FFTW 3.2.2 and then the FFTW 3.3.3. In all cases I used the following configure flags for single precision and the only available timer on the ARM processor:

--enable-single --with-slow-timer

I ran non-SIMD (no NEON) test without any extra flags, and NEON SIMD tests with the flag below:

--enable-neon

I also tried out the fused multiply-add flag since the Cortex-A7 has this instruction in the VFPv4 FPU but I found that this flag actually caused performance to decrease! A short description of why this is can be found in the FFTW 3.3.3 tests section.

--enable-fma

In all cases I modified the configure script to optimise for the CPU with the '-mcpu=cortex-a7' flag. I also modified the configure script to try out different GCC FPU options where appropriate, but in general I am only presenting the fastest results in this post. The options I tried are listed below for reference:

-mfpu=neon
-mfpu=neon-vfpv4
-mfpu=vfpv4-d16
-mfpu=vfpv3-d16

I repeated the tests with NEON on a threaded version of FFTW to see at which point multiple threads (on multiple cores) makes a difference and by how much. To enable the threaded version, FFTW must be recompiled with the flag below. Note that FFTW can be compiled once with this flag and used in both a threaded or unthreaded way.

--enable-threads

I plan on running an MPI version with more threads (4 to 16) on our Cubieboard and Wandboard clusters at a later stage.

I used the script provided by Vesperix to automate the benchmarks. For the threaded tests I modified the script to contain the '-onthreads=2'. The number can be adjusted to suit the number of cores available on the system. The modified script is shown below.

#!/bin/sh
for TYPE in 'c'; do
  for PLACE in 'i' 'o'; do
    echo "$TYPE $PLACE 1-D powers of two (2 threads)"
    for SIZE in '2' '4' '8' '16' '32' '64' '128' '256' '512' '1024' '2048' '4096' \
                '8192' '16384' '32768' '65536' '131072' '262144' '524288' '1048576' '2097152'; do
      ./bench -onthreads=2 $OPTS ${PLACE}${TYPE}${SIZE}
    done
  done
  for PLACE in 'i' 'o'; do
    echo "$TYPE $PLACE 1-D powers of two (1 thread)"
    for SIZE in '2' '4' '8' '16' '32' '64' '128' '256' '512' '1024' '2048' '4096' \
                '8192' '16384' '32768' '65536' '131072' '262144' '524288' '1048576' '2097152'; do
      ./bench $OPTS ${PLACE}${TYPE}${SIZE}
    done
  done
done

MFLOPS Result Interpretation:

The result provided by FFTW is 'MFLOPS': this is not true MFLOPS. It is estimated by FFTW based on an assumption of algorithmic complexity for the standard Cooley-Tukey FFT algorithm:

Although not necessarily totally accurate in the classic FLOPS sense, it is calculated the same way in all cases, so it works as a way to compare between runs. For comparison to other algorithms, I would rather use the actual time the algorithm takes to run on a specific FFT size (N).

FFTW 3.2.2 (Vesperix):

Please examine the various graphs below. Clearly, NEON makes quite a large difference and is a 'no-brainer' for any application. I am showing one set of non-power of two benchmarks to illustrate why they should not be used.



I ran a test of the threaded version of FFTW 3.2.2 and the results are promising for a scaled-up system. The Cubieboard2 is only a dual-core system but I plan on running MPI tests with more cores at a future date.



FFTW 3.3.3 (Official):

I was unable to get the NEON version of FFTW 3.3.3 working. I was able to run benchmarks of the scalar version of the code which shows a performance improvement over the 3.2.2 scalar results. I compiled one graph comparing all the different scalar versions, with FMA instructions and without.



Note how the FMA versions have slightly lower performance. In the Benchmark Methodology section I mentioned that the --enable-fma flag actually causes performance to decrease. The reason for this is not intuitive as one would think that a Fused Multiply Add (FMA) instruction would save cycles as it replaces separate Multiply and Add instructions. In the computation of an FFT, two of the common operations are:

t0 = a + b * c
t1 = a - b * c

The way that the NEON FMA instruction works, however, is not conducive solving this. This is what happens when you use the NEON FMA:

t0 = a
t0 += b * c
t1 = a
t1 -= b * c

Notice that we have to use up two move instructions for initially setting t0 and t1. It turns out that in this specific case it's faster to just use Multiplies and Adds:

t = b * c
t0 = a + t
t1 = a - t

All in all, the FMA version does 2 Moves, 2 FMA's. The optimal version does 1 Multiply and 2 Adds. It's a small difference, one which the compiler may or may not take note of and optimise, but when done a significant number of times it makes a difference.

Conclusion

The results from this set of benchmarks are very similar to those attained by Vesperix on Cortex-A9 boards. The multi-threaded version is also significantly better for larger FFT sizes. The results at different FFT sizes are very dependant on the processor implementation details such as cache sizes and memory access times. With smaller FFT's the overhead associated with calculating the FFT is a large factor and this is clearly visible up to sizes of 128.

The scalar results for FFTW 3.3.3 are better than those from 3.2.2 so it is logical to assume that the newer version's NEON performance will be better as well.

Since FFTW works by creating a 'plan' before actually calculating the FFT, it chooses to not use more than one thread in the multi-threaded version before a certain FFT size. This is clearly visible as it chooses to use multi-threading at greater than size 128. The overhead associated with doing this causes the result to be poor at size 256 and based on the results, multi threading should only be enabled for sizes over 1024.

The power usage of the Cortex-A7 processor is lower than that of the Cortex-A9 and so if a large cluster of these devices is used for a computational task such as radio astronomy, one could speculate that it may be worthwhile to use more Cortex-A7's over fewer Cortex-A9's since the performance is similar. 


No comments:

Post a Comment