Repeating Rust Floating-Point Benchmark on AWS Graviton2 (AARCH64) Instance

January 08, 2022

I was curious about graviton2 instances and how Rust optimization passes may perform on it. Graviton2 is an ARM AARCH64 CPU so good performance on x86 systems (Intel,AMD) does not guarantee the same for Graviton2. Since Intel and AMD have a huge combined market share and both are x86 based it is not uncommon for optimization passes to only be aggressively tested on these systems leaving out others. I wanted to know what this means for Rust on Graviton2.

The Graviton2 System

I used an AWS c6g.xlarge system. I give the output of lscpu below

Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              4   
On-line CPU(s) list: 0-3 
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Model:               1
BogoMIPS:            243.75
L1d cache:           64K 
L1i cache:           64K 
L2 cache:            1024K
L3 cache:            32768K
NUMA node0 CPU(s):   0-3 
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

The Benchmark

Here I repeat a benchmark I previously did comparing equivalent Rust and C++ floating-point intensive code. For in-depth detail of the benchmark I refer you to that blog post, but I will summarize each of the benchmarked steps below:

  1. C++ reference compiled with G++10 (g++ -Ofast -ftree-vectorize main.cpp -o main)
  2. Rust reference implementation (direct translation of (1)) (rustc -C target-cpu=native -C opt-level=3 -O main.rs)
  3. Rust implementation with loops translated to iterators
  4. Rust implementation with loops translated to iterators and improved reduction

I ran the resulting executable for problem sizes n=256 to n=16777216. I show results in the figure below

Rust Underperforming on AARCH64

In my previous iteration of this benchmark on x86 systems I found that by translating loops to iterators in Rust and manually improving the accumulation ordering in the reduction resulted in code of equivalent performance to C++ with aggressive compiler optimizations. In the case of Graviton2 however I find that the C++ reference implementation compiled with aggressive optimizations outperforms even my best Rust implementation by a factor of about 2X.

Both versions appeared to vectorize. Quickly looking through the binaries for use of AARCH64 NEON vector registers matching the regexp v*\.2d (the 2d suffix indicates 2 double precision floating point number) found the following:

Vectorization in the C++ Binary

  400d5c:	6f00e402 	movi	v2.2d, #0x0
  400d78:	4ee0cc81 	fmls	v1.2d, v4.2d, v0.2d
  400d7c:	6e60dc20 	fmul	v0.2d, v1.2d, v0.2d
  400d80:	4e61cc02 	fmla	v2.2d, v0.2d, v1.2d
  400d84:	4e60cc03 	fmla	v3.2d, v0.2d, v0.2d
  400d90:	7e70d863 	faddp	d3, v3.2d
  400d98:	7e70d840 	faddp	d0, v2.2d
  400ed4:	4e080464 	dup	v4.2d, v3.d[0]
  400eec:	4ee1cc02 	fmls	v2.2d, v0.2d, v1.2d
  400ef0:	4e62cc81 	fmla	v1.2d, v4.2d, v2.2d
  400fb0:	6f00e408 	movi	v8.2d, #0x0
  400fbc:	4e60d508 	fadd	v8.2d, v8.2d, v0.2d
  400fc8:	7e70d908 	faddp	d8, v8.2d
  40148c:	4e080401 	dup	v1.2d, v0.d[0]

Vectorization in the Rust Binary

objdump -d main |grep "v*\.2d"
    7e58:	4e080d01 	dup	v1.2d, x8
    7e64:	6e61d800 	ucvtf	v0.2d, v0.2d
    7ea4:	4ee0f800 	fabs	v0.2d, v0.2d
    7ea8:	4e61d400 	fadd	v0.2d, v0.2d, v1.2d
    7ec4:	4e080ee0 	dup	v0.2d, x23
    7ec8:	4ee08421 	add	v1.2d, v1.2d, v0.2d
    7f34:	6f00e400 	movi	v0.2d, #0x0
    80c4:	6f00e400 	movi	v0.2d, #0x0
    8218:	6e61dc01 	fmul	v1.2d, v0.2d, v1.2d
    8220:	4ee1d441 	fsub	v1.2d, v2.2d, v1.2d
    8228:	6e61dc00 	fmul	v0.2d, v0.2d, v1.2d
    822c:	6e60dc21 	fmul	v1.2d, v1.2d, v0.2d
    8230:	6e60dc00 	fmul	v0.2d, v0.2d, v0.2d
    8234:	4e60d440 	fadd	v0.2d, v2.2d, v0.2d
    8238:	4e61d461 	fadd	v1.2d, v3.2d, v1.2d
    8254:	6e61dc01 	fmul	v1.2d, v0.2d, v1.2d
    8258:	4ee1d441 	fsub	v1.2d, v2.2d, v1.2d
    8260:	6e61dc00 	fmul	v0.2d, v0.2d, v1.2d
    8264:	6e60dc21 	fmul	v1.2d, v1.2d, v0.2d
    8268:	6e60dc00 	fmul	v0.2d, v0.2d, v0.2d
    826c:	4e60d440 	fadd	v0.2d, v2.2d, v0.2d
    8274:	4e61d460 	fadd	v0.2d, v3.2d, v1.2d
    8608:	4e080401 	dup	v1.2d, v0.d[0]
    862c:	6e65dc65 	fmul	v5.2d, v3.2d, v5.2d
    8630:	6e64dc44 	fmul	v4.2d, v2.2d, v4.2d
    8634:	4ee4d4c4 	fsub	v4.2d, v6.2d, v4.2d
    863c:	4ee5d4c5 	fsub	v5.2d, v6.2d, v5.2d
    8640:	6e64dc24 	fmul	v4.2d, v1.2d, v4.2d
    8644:	6e65dc25 	fmul	v5.2d, v1.2d, v5.2d
    8648:	4e64d442 	fadd	v2.2d, v2.2d, v4.2d
    864c:	4e65d463 	fadd	v3.2d, v3.2d, v5.2d
    91e8:	6f00e400 	movi	v0.2d, #0x0
    fae8:	6f00e400 	movi	v0.2d, #0x0
    fb40:	4e080d61 	dup	v1.2d, x11
    fc30:	4e080d20 	dup	v0.2d, x9
   128dc:	6f00e402 	movi	v2.2d, #0x0
   12fd4:	6f00e400 	movi	v0.2d, #0x0
   13e70:	6f00e400 	movi	v0.2d, #0x0
   140c8:	6f00e400 	movi	v0.2d, #0x0
   168e8:	4e080d00 	dup	v0.2d, x8
   17434:	6f00e400 	movi	v0.2d, #0x0
   176dc:	6f00e400 	movi	v0.2d, #0x0
   17c7c:	6f00e400 	movi	v0.2d, #0x0
   17c84:	6f00e401 	movi	v1.2d, #0x0
   17c8c:	4c408d22 	ld2	{v2.2d, v3.2d}, [x9]
   17c90:	4c408d64 	ld2	{v4.2d, v5.2d}, [x11]
   17c9c:	4ee18441 	add	v1.2d, v2.2d, v1.2d
   17ca0:	4ee08480 	add	v0.2d, v4.2d, v0.2d
   17ca8:	4ee08420 	add	v0.2d, v1.2d, v0.2d
   17cac:	5ef1b800 	addp	d0, v0.2d
   18954:	6f00e400 	movi	v0.2d, #0x0
   1895c:	6f00e401 	movi	v1.2d, #0x0
   18964:	4c408d42 	ld2	{v2.2d, v3.2d}, [x10]
   18968:	4c408d84 	ld2	{v4.2d, v5.2d}, [x12]
   18974:	4ee18441 	add	v1.2d, v2.2d, v1.2d
   18978:	4ee08480 	add	v0.2d, v4.2d, v0.2d
   18980:	4ee08420 	add	v0.2d, v1.2d, v0.2d
   18984:	5ef1b800 	addp	d0, v0.2d
   19108:	6f00e400 	movi	v0.2d, #0x0
   19614:	6f00e400 	movi	v0.2d, #0x0
   197e0:	6f00e400 	movi	v0.2d, #0x0
   199e8:	6f00e400 	movi	v0.2d, #0x0
   19c1c:	6f00e400 	movi	v0.2d, #0x0
   1a160:	6f00e400 	movi	v0.2d, #0x0
   1a1a8:	6f00e400 	movi	v0.2d, #0x0
   1a324:	6f00e400 	movi	v0.2d, #0x0
   1a384:	6f00e400 	movi	v0.2d, #0x0
   1a540:	6f00e400 	movi	v0.2d, #0x0
   1a568:	6f00e400 	movi	v0.2d, #0x0
   1a58c:	6f00e400 	movi	v0.2d, #0x0
   1a5cc:	6f00e400 	movi	v0.2d, #0x0
   1a714:	6f00e400 	movi	v0.2d, #0x0
   1aa94:	6f00e400 	movi	v0.2d, #0x0
   1abd8:	6f00e400 	movi	v0.2d, #0x0
   1ae20:	6f00e400 	movi	v0.2d, #0x0
   1ae90:	6f00e400 	movi	v0.2d, #0x0
   1b160:	6f00e400 	movi	v0.2d, #0x0
   1b17c:	6f00e400 	movi	v0.2d, #0x0
   1b8a0:	6f00e400 	movi	v0.2d, #0x0
   1bb20:	4e080d20 	dup	v0.2d, x9
   1c008:	6f00e400 	movi	v0.2d, #0x0
   1d374:	6f00e400 	movi	v0.2d, #0x0
   1d388:	6f00e400 	movi	v0.2d, #0x0
   1d3a0:	6f00e400 	movi	v0.2d, #0x0
   1e384:	6f00e400 	movi	v0.2d, #0x0
   1ef74:	6f00e400 	movi	v0.2d, #0x0
   1f764:	6f00e400 	movi	v0.2d, #0x0
   1f838:	6f00e400 	movi	v0.2d, #0x0
   1f8cc:	6f00e400 	movi	v0.2d, #0x0
   203c8:	4e080d00 	dup	v0.2d, x8
   2046c:	4e080d00 	dup	v0.2d, x8
   204f4:	6f00e400 	movi	v0.2d, #0x0
   2058c:	6f00e400 	movi	v0.2d, #0x0
   20828:	6f00e400 	movi	v0.2d, #0x0
   210d8:	6f00e400 	movi	v0.2d, #0x0
   2112c:	6f00e400 	movi	v0.2d, #0x0
   211d8:	6f00e400 	movi	v0.2d, #0x0
   21cfc:	6f00e400 	movi	v0.2d, #0x0
   23594:	6f00e400 	movi	v0.2d, #0x0
   23d08:	6f00e400 	movi	v0.2d, #0x0
   23d9c:	6f00e400 	movi	v0.2d, #0x0
   23ef4:	6f00e400 	movi	v0.2d, #0x0
   24994:	6f00e400 	movi	v0.2d, #0x0
   249e8:	6f00e400 	movi	v0.2d, #0x0
   24a18:	6f00e400 	movi	v0.2d, #0x0
   24b38:	6f00e400 	movi	v0.2d, #0x0
   24cf8:	6f00e400 	movi	v0.2d, #0x0
   251ec:	6f00e400 	movi	v0.2d, #0x0
   25588:	4e080ee0 	dup	v0.2d, x23
   26258:	4e080da0 	dup	v0.2d, x13
   26284:	6f00e401 	movi	v1.2d, #0x0
   26294:	6f00e402 	movi	v2.2d, #0x0
   262c0:	4ee08421 	add	v1.2d, v1.2d, v0.2d
   262c4:	4ee08442 	add	v2.2d, v2.2d, v0.2d
   262cc:	4c008d31 	st2	{v17.2d, v18.2d}, [x9]
   262dc:	4c008d85 	st2	{v5.2d, v6.2d}, [x12]
   262e4:	4ee18440 	add	v0.2d, v2.2d, v1.2d
   262e8:	5ef1b800 	addp	d0, v0.2d
   26a5c:	6f00e400 	movi	v0.2d, #0x0
   2829c:	4e080d80 	dup	v0.2d, x12
   28418:	6f00e400 	movi	v0.2d, #0x0
   29478:	6f00e400 	movi	v0.2d, #0x0
   2a578:	6f00e400 	movi	v0.2d, #0x0
   2a5b0:	6f00e400 	movi	v0.2d, #0x0
   2abc8:	6f00e400 	movi	v0.2d, #0x0
   2abd4:	4e080da2 	dup	v2.2d, x13
   2abdc:	6f00e403 	movi	v3.2d, #0x0
   2ac08:	2f20a484 	uxtl	v4.2d, v4.2s
   2ac0c:	2f20a4a5 	uxtl	v5.2d, v5.2s
   2ac1c:	4ee48400 	add	v0.2d, v0.2d, v4.2d
   2ac20:	4ee58463 	add	v3.2d, v3.2d, v5.2d
   2ac2c:	4ee08460 	add	v0.2d, v3.2d, v0.2d
   2ac30:	5ef1b800 	addp	d0, v0.2d
   2c36c:	6f00e400 	movi	v0.2d, #0x0
   31418:	6f00e400 	movi	v0.2d, #0x0
   31bc0:	6f00e400 	movi	v0.2d, #0x0
   32aa8:	6f00e400 	movi	v0.2d, #0x0
   33d30:	6f00e400 	movi	v0.2d, #0x0
   35f74:	6f00e400 	movi	v0.2d, #0x0
   35f80:	4e080da2 	dup	v2.2d, x13
   35f88:	6f00e403 	movi	v3.2d, #0x0
   35fb4:	2f20a484 	uxtl	v4.2d, v4.2s
   35fb8:	2f20a4a5 	uxtl	v5.2d, v5.2s
   35fc8:	4ee48400 	add	v0.2d, v0.2d, v4.2d
   35fcc:	4ee58463 	add	v3.2d, v3.2d, v5.2d
   35fd8:	4ee08460 	add	v0.2d, v3.2d, v0.2d
   35fdc:	5ef1b800 	addp	d0, v0.2d
   36534:	6f00e400 	movi	v0.2d, #0x0
   36540:	4e080d82 	dup	v2.2d, x12
   36548:	6f00e403 	movi	v3.2d, #0x0
   36574:	2f20a484 	uxtl	v4.2d, v4.2s
   36578:	2f20a4a5 	uxtl	v5.2d, v5.2s
   36588:	4ee48400 	add	v0.2d, v0.2d, v4.2d
   3658c:	4ee58463 	add	v3.2d, v3.2d, v5.2d
   36598:	4ee08460 	add	v0.2d, v3.2d, v0.2d
   3659c:	5ef1b800 	addp	d0, v0.2d
   37bb0:	6f00e400 	movi	v0.2d, #0x0
   37bbc:	4e080d82 	dup	v2.2d, x12
   37bc4:	6f00e403 	movi	v3.2d, #0x0
   37bf0:	2f20a484 	uxtl	v4.2d, v4.2s
   37bf4:	2f20a4a5 	uxtl	v5.2d, v5.2s
   37bf8:	6ee44444 	ushl	v4.2d, v2.2d, v4.2d
   37bfc:	6ee54445 	ushl	v5.2d, v2.2d, v5.2d
   37c74:	6f00e400 	movi	v0.2d, #0x0
   37c80:	4e080d82 	dup	v2.2d, x12
   37c88:	6f00e403 	movi	v3.2d, #0x0
   37cb4:	2f20a484 	uxtl	v4.2d, v4.2s
   37cb8:	2f20a4a5 	uxtl	v5.2d, v5.2s
   37cbc:	6ee44444 	ushl	v4.2d, v2.2d, v4.2d
   37cc0:	6ee54445 	ushl	v5.2d, v2.2d, v5.2d
   37f8c:	6f00e400 	movi	v0.2d, #0x0
   38970:	6f00e400 	movi	v0.2d, #0x0

I think this may come down to floating point optimizations that change results, such as the ability to use fused multiply add. On the C++ binary I can see FMA used liberally

Use of FMA in the C++ binary:

objdump -d main |grep fmadd |wc -l
16

Use of FMA in the Rust binary:

objdump -d main |grep fmadd |wc -l
0

Overall the Rust binary was a bit bigger (3.7MB) than the corresponding C++ binary (72KB) and also it does not make use of fused-multiply-add instructions.

Perhaps there is a tradeoff made here where the 2X theoretical speedup of vectorization is overcome by the 2X theoretical speedup of a (scalar) fused-multiply-add. This would explain why the C++ binary has much fewer vector instructions but much more FMADD instructions.

Potential Fix

I think a potential way to fix this will be to build Rust with a custom LLVM and attempt to output LLVM bitcode with the --emit-llvm option. I may experiment with this at a later date.