HEP Group Blog: October 2014

I have been experimenting with how well gcc (4.8.2) works with compiling a simple dot product of two vectors with 8 elements in each. I have discovered some interesting things (whether they are bugs or quirks I don't know) and I wanted to share my findings...

The first thing to know is how to generate assembly with gcc. Simply add -S to the gcc or g++ command. I am using Eclipse CDT and I have created several build configurations (some of which generate assembly for inspection).

I have been experimenting with the Eigen Library since it aparently supports ARM NEON. What I found is that it works extremely well with x86 SIMD instructions but not so well with ARM NEON. It seems that well coded C++ is better for NEON that Eigen...

x86 SSE2 with Eigen

movaps 32(%rdi), %xmm0

movaps 16(%rdi), %xmm1

mulps 32(%rsi), %xmm0

mulps 16(%rsi), %xmm1

addps %xmm0, %xmm1

movaps %xmm1, %xmm0

movhlps %xmm1, %xmm0

addps %xmm1, %xmm0

movaps %xmm0, %xmm1

shufps $1, %xmm0, %xmm1

addss %xmm1, %xmm0

x86 SSE2 without Eigen

movq (%rdi), %rdx

movq (%rsi), %rax

movss (%rdx), %xmm1

mulss (%rax), %xmm1

movss 4(%rdx), %xmm0

mulss 4(%rax), %xmm0

addss .LC0(%rip), %xmm1

addss %xmm0, %xmm1

movss 8(%rdx), %xmm0

mulss 8(%rax), %xmm0

addss %xmm0, %xmm1

movss 12(%rdx), %xmm0

mulss 12(%rax), %xmm0

addss %xmm0, %xmm1

movss 16(%rdx), %xmm0

mulss 16(%rax), %xmm0

addss %xmm0, %xmm1

movss 20(%rdx), %xmm0

mulss 20(%rax), %xmm0

addss %xmm0, %xmm1

ARM NEON-VFPv4 with Eigen

flds s0, .L2

ldr r3, [r1]

ldr r2, [r0]

flds s15, [r3]

flds s14, [r2]

vfma.f32 s0, s14, s15

flds s6, [r2, #4]

flds s7, [r3, #4]

flds s8, [r2, #8]

flds s9, [r3, #8]

flds s10, [r2, #12]

flds s11, [r3, #12]

flds s12, [r2, #16]

flds s13, [r3, #16]

flds s14, [r2, #20]

flds s15, [r3, #20]

vfma.f32 s0, s6, s7

vfma.f32 s0, s8, s9

vfma.f32 s0, s10, s11

vfma.f32 s0, s12, s13

vfma.f32 s0, s14, s15

ARM NEON-VFPv4 without Eigen

ldr r3, [r0]

vmov.f32 q8, #0.0 @ v4sf

ldr r2, [r1]

vld1.32 {q9}, [r3]!

vld1.32 {q10}, [r2]!

vmul.f32 q10, q10, q9

vst1.64 {d20-d21}, [sp:64]

vld1.32 {q9}, [r3]

vld1.32 {q11}, [r2]

vmov q12, q10 @ v4sf

vfma.f32 q12, q11, q9

vadd.f32 d18, d24, d25

vpadd.f32 d16, d18, d18

vmov.32 r3, d16[0]

It took quite some effort to get the non-Eigen ARM code to be better then the Eigen code. The "naive" version with a simple dot-product for-loop (shown below) was similar to what Eigen produced. The a and b variables have been __restrict__ed and are pointers to aligned memory.

for (int i = 0; i < 8; i++)

{

out += a[i] * b[i];

}

The results are not what I would expect. I decided to split these two operations into two separate loops and I got the NEON version shown above!

float prods[8];

for (int i = 0; i < 8; i++)

{

prods[i] = a[i] * b[i];

}

for (int i = 0; i < 8; i++)

{

out += prods[i];

}

I should also add that without -funsafe-math-optimizations, the auto-vectorization doesn't work. I'm going to keep working on it to see if I can shed a few more instructions, but so far so good!

Pages

Friday 31 October 2014

Adventures with ARM GCC Auto-Vectorization

x86 SSE2 with Eigen

x86 SSE2 without Eigen

ARM NEON-VFPv4 with Eigen

ARM NEON-VFPv4 without Eigen