HEP Group Blog: March 2014

Friday 28 March 2014

Wandboard PCI-Express Connector PCB Photos

Today I took delivery of the first (and hopefully last if it's bug-free) version of the dual Wandboard PCI-Express adapter. I plan on having it soldered and ready for testing by early next week. Check out the photos!

Wednesday 12 March 2014

Wandboard PCI-Express Adapter: Update

It's been a while since I last posted about the Wandboard PCI-Express adapter I have been working on... I decided to redesign the PCB to be more compact. This saves manufacturing costs and it looks better, in my opinion.

The PCB has been sent for manufacture so hopefully in a week or two I can post some photos! Shortly after that - assuming everything goes according to plan - I'll post some results for the PCI-Express performance of the Freescale i.MX6 SoC. I don't think the PCI-Express interface to the Wandboard has been tested by anyone, so hopefully it works...

Tuesday 11 March 2014

NAS Benchmarks on ARM

The NAS Parallel Benchmarks (link) are a comprehensive suite of benchmarks to test supercomputers, maintained by NASA. They were originally based on computational fluid dynamics (in 1994) and expanded over time to cover many different problem types as well as many problem sizes; from very small problems that run in a few seconds for testing purposes, to large problems that can take hours on a supercomputer!

Since these benchmarks cover a range of problems, most interestingly a specific Embarrassingly Parallel benchmark, it is important to test their performance on ARM. Luckily the task of building the benchmark suite on ARM is straightforward. I will document it here for those who are interested. I will write about performance tweaks and compiler flags in a later post once I have had more time to experiment.

Installation (Single Processor Test)

Download a copy of the source code from the web site linked above. Unzip the source into a directory on your ARM system.
You should already have a full suite of compilers (gcc) installed on your system, as well as MPICH or other MPI library.
Navigate into the NPB3.3-MPI directory. Please read the README.install text document for some details. There is a short document in each benchmark directory with some details about that specific benchmark.
Navigate into the 'config' directory.
Run this command to use the template for the build: cp make.def.template make.def
Then run this command to use the template for the suites: cp suite.def.template suite.def
You now need to customize the make.def file to your system. Your modifications should be the same as mine if you are running Linux (Linaro) on ARM. Scroll through the file and adjust the lines as below:

MPIF77 = mpif77

FFLAGS = O3

MPICC = mpicc

Un-comment include ../config/make.dummy

Note that we uncommented the make.dummy file. This means that true MPI will not be used, and all of the benchmarks will only run with single processor as a simple test.
The template suite.def file is fine for this proof-of-concept.
Return to the root directory of NAS with ../
Type make suite and wait for the build to complete. If something goes wrong there may be an issue with a dependency.

Installation (Multi-Processor MPI)

To install a true MPI version, follow the steps above, except leave the make.dummy commented. You should also modify the suite.def file to suit the number of processors (processes) you would like to run.

To run a multi-processor version type:
mpirun -np 4 ./bin/ep.S.4
For a 4 processor version of EP with a size of S. Obviously the benchmark must be compiled for the correct number of processors. You need to update the command accordingly.

You can selectively compile a single test at a time. Please see the README.install file - it's really quite simple.

Thursday 6 March 2014

Does the Fused Multiply-Add (FMA) instruction make a difference?

I discussed this originally in my Cortex-A7 FFTW benchmarks, but I am repeating it in it's own blog post for clarity as I believe it's an important thing to understand.

I noticed that when enabling the FMA capabilities of FFTW, the performance actually decreased. I thought to myself "but the ARM VFPv4 supports FMA so this should be faster that doing separate multiply and add operations..." so I did a little bit of research as to why this is the case.

In the computation of an FFT, two of the common operations are:

t0 = a + b * c
t1 = a - b * c

The way that the NEON FMA instruction works, however, is not conducive solving this. This is what happens when you use the NEON FMA instruction:

t0 = a
t0 += b * c
t1 = a
t1 -= b * c

Since ARM is a RISC architecture, the instructions are less flexible and generally take a fixed number of operands. For mathematical operations, it makes sense most of the time to use two operands. Because of this limitation, the FMA can still only take 2 operands and so it is used as shown above. Notice that we have to use up two move instructions for initially setting t0 and t1. It turns out that in this specific case it's faster to just use Multiplies and Adds:

t = b * c
t0 = a + t
t1 = a - t

All in all, the FMA version does 2 Moves, 2 FMA's. The optimal version does 1 Multiply and 2 Adds. It's a small difference, one which the compiler may or may not take note of and optimise, but when done a significant number of times it makes a difference which is what we see in the FFTW benchmarks, for example. There will be cases when this instruction does indeed make a difference, but it's important to bear in mind what's going on behind the scenes.

High Speed PCB Routing

I have been quite busy designing high-speed PCB's for my PCI-Express research. I found this video on YouTube from Texas Instruments which provides an excellent overview on high speed routing and I recommend it to anyone interested.

To reiterate the issues one usually faces when designing high speed circuit boards (from the video above):

Timing: the lengths of the tracks must be similar enough for the electrical signals to arrive at the receiver at the same time. ~0.6 times the speed of light is too slow!
Signal Integrity: The shape of the signal needs to be right when it arrives at the receiver.
Noise: There can be a lot of crosstalk and noise on a high speed PCB and this noise can adversely affect signals.

To 'solve' these concerns:

Maintain the correct impedance from the transmitter to the receiver. This is not always trivial and so this is usually the biggest problem!
Matched lengths minimise signal skew.
Leave space around the traces to minimise noise. More space makes an exponential difference.

Pages