I was curious about graviton2 instances and how Rust optimization passes may perform on it. Graviton2 is an ARM AARCH64 CPU so good performance on x86 systems (Intel,AMD) does not guarantee the same for Graviton2. Since Intel and AMD have a huge combined market share and both are x86 based it is not uncommon for optimization passes to only be aggressively tested on these systems leaving out others. I wanted to know what this means for Rust on Graviton2.
I used an AWS c6g.xlarge
system. I give the output of lscpu
below
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Model: 1
BogoMIPS: 243.75
L1d cache: 64K
L1i cache: 64K
L2 cache: 1024K
L3 cache: 32768K
NUMA node0 CPU(s): 0-3
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Here I repeat a benchmark I previously did comparing equivalent Rust and C++ floating-point intensive code. For in-depth detail of the benchmark I refer you to that blog post, but I will summarize each of the benchmarked steps below:
g++ -Ofast -ftree-vectorize main.cpp -o main
)rustc -C target-cpu=native -C opt-level=3 -O main.rs
)I ran the resulting executable for problem sizes n=256
to n=16777216
. I show results in the figure below
In my previous iteration of this benchmark on x86 systems I found that by translating loops to iterators in Rust and manually improving the accumulation ordering in the reduction resulted in code of equivalent performance to C++ with aggressive compiler optimizations. In the case of Graviton2 however I find that the C++ reference implementation compiled with aggressive optimizations outperforms even my best Rust implementation by a factor of about 2X
.
Both versions appeared to vectorize. Quickly looking through the binaries for use of AARCH64 NEON vector registers matching the regexp v*\.2d
(the 2d
suffix indicates 2 double precision floating point number) found the following:
Vectorization in the C++ Binary
400d5c: 6f00e402 movi v2.2d, #0x0
400d78: 4ee0cc81 fmls v1.2d, v4.2d, v0.2d
400d7c: 6e60dc20 fmul v0.2d, v1.2d, v0.2d
400d80: 4e61cc02 fmla v2.2d, v0.2d, v1.2d
400d84: 4e60cc03 fmla v3.2d, v0.2d, v0.2d
400d90: 7e70d863 faddp d3, v3.2d
400d98: 7e70d840 faddp d0, v2.2d
400ed4: 4e080464 dup v4.2d, v3.d[0]
400eec: 4ee1cc02 fmls v2.2d, v0.2d, v1.2d
400ef0: 4e62cc81 fmla v1.2d, v4.2d, v2.2d
400fb0: 6f00e408 movi v8.2d, #0x0
400fbc: 4e60d508 fadd v8.2d, v8.2d, v0.2d
400fc8: 7e70d908 faddp d8, v8.2d
40148c: 4e080401 dup v1.2d, v0.d[0]
Vectorization in the Rust Binary
objdump -d main |grep "v*\.2d"
7e58: 4e080d01 dup v1.2d, x8
7e64: 6e61d800 ucvtf v0.2d, v0.2d
7ea4: 4ee0f800 fabs v0.2d, v0.2d
7ea8: 4e61d400 fadd v0.2d, v0.2d, v1.2d
7ec4: 4e080ee0 dup v0.2d, x23
7ec8: 4ee08421 add v1.2d, v1.2d, v0.2d
7f34: 6f00e400 movi v0.2d, #0x0
80c4: 6f00e400 movi v0.2d, #0x0
8218: 6e61dc01 fmul v1.2d, v0.2d, v1.2d
8220: 4ee1d441 fsub v1.2d, v2.2d, v1.2d
8228: 6e61dc00 fmul v0.2d, v0.2d, v1.2d
822c: 6e60dc21 fmul v1.2d, v1.2d, v0.2d
8230: 6e60dc00 fmul v0.2d, v0.2d, v0.2d
8234: 4e60d440 fadd v0.2d, v2.2d, v0.2d
8238: 4e61d461 fadd v1.2d, v3.2d, v1.2d
8254: 6e61dc01 fmul v1.2d, v0.2d, v1.2d
8258: 4ee1d441 fsub v1.2d, v2.2d, v1.2d
8260: 6e61dc00 fmul v0.2d, v0.2d, v1.2d
8264: 6e60dc21 fmul v1.2d, v1.2d, v0.2d
8268: 6e60dc00 fmul v0.2d, v0.2d, v0.2d
826c: 4e60d440 fadd v0.2d, v2.2d, v0.2d
8274: 4e61d460 fadd v0.2d, v3.2d, v1.2d
8608: 4e080401 dup v1.2d, v0.d[0]
862c: 6e65dc65 fmul v5.2d, v3.2d, v5.2d
8630: 6e64dc44 fmul v4.2d, v2.2d, v4.2d
8634: 4ee4d4c4 fsub v4.2d, v6.2d, v4.2d
863c: 4ee5d4c5 fsub v5.2d, v6.2d, v5.2d
8640: 6e64dc24 fmul v4.2d, v1.2d, v4.2d
8644: 6e65dc25 fmul v5.2d, v1.2d, v5.2d
8648: 4e64d442 fadd v2.2d, v2.2d, v4.2d
864c: 4e65d463 fadd v3.2d, v3.2d, v5.2d
91e8: 6f00e400 movi v0.2d, #0x0
fae8: 6f00e400 movi v0.2d, #0x0
fb40: 4e080d61 dup v1.2d, x11
fc30: 4e080d20 dup v0.2d, x9
128dc: 6f00e402 movi v2.2d, #0x0
12fd4: 6f00e400 movi v0.2d, #0x0
13e70: 6f00e400 movi v0.2d, #0x0
140c8: 6f00e400 movi v0.2d, #0x0
168e8: 4e080d00 dup v0.2d, x8
17434: 6f00e400 movi v0.2d, #0x0
176dc: 6f00e400 movi v0.2d, #0x0
17c7c: 6f00e400 movi v0.2d, #0x0
17c84: 6f00e401 movi v1.2d, #0x0
17c8c: 4c408d22 ld2 {v2.2d, v3.2d}, [x9]
17c90: 4c408d64 ld2 {v4.2d, v5.2d}, [x11]
17c9c: 4ee18441 add v1.2d, v2.2d, v1.2d
17ca0: 4ee08480 add v0.2d, v4.2d, v0.2d
17ca8: 4ee08420 add v0.2d, v1.2d, v0.2d
17cac: 5ef1b800 addp d0, v0.2d
18954: 6f00e400 movi v0.2d, #0x0
1895c: 6f00e401 movi v1.2d, #0x0
18964: 4c408d42 ld2 {v2.2d, v3.2d}, [x10]
18968: 4c408d84 ld2 {v4.2d, v5.2d}, [x12]
18974: 4ee18441 add v1.2d, v2.2d, v1.2d
18978: 4ee08480 add v0.2d, v4.2d, v0.2d
18980: 4ee08420 add v0.2d, v1.2d, v0.2d
18984: 5ef1b800 addp d0, v0.2d
19108: 6f00e400 movi v0.2d, #0x0
19614: 6f00e400 movi v0.2d, #0x0
197e0: 6f00e400 movi v0.2d, #0x0
199e8: 6f00e400 movi v0.2d, #0x0
19c1c: 6f00e400 movi v0.2d, #0x0
1a160: 6f00e400 movi v0.2d, #0x0
1a1a8: 6f00e400 movi v0.2d, #0x0
1a324: 6f00e400 movi v0.2d, #0x0
1a384: 6f00e400 movi v0.2d, #0x0
1a540: 6f00e400 movi v0.2d, #0x0
1a568: 6f00e400 movi v0.2d, #0x0
1a58c: 6f00e400 movi v0.2d, #0x0
1a5cc: 6f00e400 movi v0.2d, #0x0
1a714: 6f00e400 movi v0.2d, #0x0
1aa94: 6f00e400 movi v0.2d, #0x0
1abd8: 6f00e400 movi v0.2d, #0x0
1ae20: 6f00e400 movi v0.2d, #0x0
1ae90: 6f00e400 movi v0.2d, #0x0
1b160: 6f00e400 movi v0.2d, #0x0
1b17c: 6f00e400 movi v0.2d, #0x0
1b8a0: 6f00e400 movi v0.2d, #0x0
1bb20: 4e080d20 dup v0.2d, x9
1c008: 6f00e400 movi v0.2d, #0x0
1d374: 6f00e400 movi v0.2d, #0x0
1d388: 6f00e400 movi v0.2d, #0x0
1d3a0: 6f00e400 movi v0.2d, #0x0
1e384: 6f00e400 movi v0.2d, #0x0
1ef74: 6f00e400 movi v0.2d, #0x0
1f764: 6f00e400 movi v0.2d, #0x0
1f838: 6f00e400 movi v0.2d, #0x0
1f8cc: 6f00e400 movi v0.2d, #0x0
203c8: 4e080d00 dup v0.2d, x8
2046c: 4e080d00 dup v0.2d, x8
204f4: 6f00e400 movi v0.2d, #0x0
2058c: 6f00e400 movi v0.2d, #0x0
20828: 6f00e400 movi v0.2d, #0x0
210d8: 6f00e400 movi v0.2d, #0x0
2112c: 6f00e400 movi v0.2d, #0x0
211d8: 6f00e400 movi v0.2d, #0x0
21cfc: 6f00e400 movi v0.2d, #0x0
23594: 6f00e400 movi v0.2d, #0x0
23d08: 6f00e400 movi v0.2d, #0x0
23d9c: 6f00e400 movi v0.2d, #0x0
23ef4: 6f00e400 movi v0.2d, #0x0
24994: 6f00e400 movi v0.2d, #0x0
249e8: 6f00e400 movi v0.2d, #0x0
24a18: 6f00e400 movi v0.2d, #0x0
24b38: 6f00e400 movi v0.2d, #0x0
24cf8: 6f00e400 movi v0.2d, #0x0
251ec: 6f00e400 movi v0.2d, #0x0
25588: 4e080ee0 dup v0.2d, x23
26258: 4e080da0 dup v0.2d, x13
26284: 6f00e401 movi v1.2d, #0x0
26294: 6f00e402 movi v2.2d, #0x0
262c0: 4ee08421 add v1.2d, v1.2d, v0.2d
262c4: 4ee08442 add v2.2d, v2.2d, v0.2d
262cc: 4c008d31 st2 {v17.2d, v18.2d}, [x9]
262dc: 4c008d85 st2 {v5.2d, v6.2d}, [x12]
262e4: 4ee18440 add v0.2d, v2.2d, v1.2d
262e8: 5ef1b800 addp d0, v0.2d
26a5c: 6f00e400 movi v0.2d, #0x0
2829c: 4e080d80 dup v0.2d, x12
28418: 6f00e400 movi v0.2d, #0x0
29478: 6f00e400 movi v0.2d, #0x0
2a578: 6f00e400 movi v0.2d, #0x0
2a5b0: 6f00e400 movi v0.2d, #0x0
2abc8: 6f00e400 movi v0.2d, #0x0
2abd4: 4e080da2 dup v2.2d, x13
2abdc: 6f00e403 movi v3.2d, #0x0
2ac08: 2f20a484 uxtl v4.2d, v4.2s
2ac0c: 2f20a4a5 uxtl v5.2d, v5.2s
2ac1c: 4ee48400 add v0.2d, v0.2d, v4.2d
2ac20: 4ee58463 add v3.2d, v3.2d, v5.2d
2ac2c: 4ee08460 add v0.2d, v3.2d, v0.2d
2ac30: 5ef1b800 addp d0, v0.2d
2c36c: 6f00e400 movi v0.2d, #0x0
31418: 6f00e400 movi v0.2d, #0x0
31bc0: 6f00e400 movi v0.2d, #0x0
32aa8: 6f00e400 movi v0.2d, #0x0
33d30: 6f00e400 movi v0.2d, #0x0
35f74: 6f00e400 movi v0.2d, #0x0
35f80: 4e080da2 dup v2.2d, x13
35f88: 6f00e403 movi v3.2d, #0x0
35fb4: 2f20a484 uxtl v4.2d, v4.2s
35fb8: 2f20a4a5 uxtl v5.2d, v5.2s
35fc8: 4ee48400 add v0.2d, v0.2d, v4.2d
35fcc: 4ee58463 add v3.2d, v3.2d, v5.2d
35fd8: 4ee08460 add v0.2d, v3.2d, v0.2d
35fdc: 5ef1b800 addp d0, v0.2d
36534: 6f00e400 movi v0.2d, #0x0
36540: 4e080d82 dup v2.2d, x12
36548: 6f00e403 movi v3.2d, #0x0
36574: 2f20a484 uxtl v4.2d, v4.2s
36578: 2f20a4a5 uxtl v5.2d, v5.2s
36588: 4ee48400 add v0.2d, v0.2d, v4.2d
3658c: 4ee58463 add v3.2d, v3.2d, v5.2d
36598: 4ee08460 add v0.2d, v3.2d, v0.2d
3659c: 5ef1b800 addp d0, v0.2d
37bb0: 6f00e400 movi v0.2d, #0x0
37bbc: 4e080d82 dup v2.2d, x12
37bc4: 6f00e403 movi v3.2d, #0x0
37bf0: 2f20a484 uxtl v4.2d, v4.2s
37bf4: 2f20a4a5 uxtl v5.2d, v5.2s
37bf8: 6ee44444 ushl v4.2d, v2.2d, v4.2d
37bfc: 6ee54445 ushl v5.2d, v2.2d, v5.2d
37c74: 6f00e400 movi v0.2d, #0x0
37c80: 4e080d82 dup v2.2d, x12
37c88: 6f00e403 movi v3.2d, #0x0
37cb4: 2f20a484 uxtl v4.2d, v4.2s
37cb8: 2f20a4a5 uxtl v5.2d, v5.2s
37cbc: 6ee44444 ushl v4.2d, v2.2d, v4.2d
37cc0: 6ee54445 ushl v5.2d, v2.2d, v5.2d
37f8c: 6f00e400 movi v0.2d, #0x0
38970: 6f00e400 movi v0.2d, #0x0
I think this may come down to floating point optimizations that change results, such as the ability to use fused multiply add. On the C++ binary I can see FMA used liberally
Use of FMA in the C++ binary:
objdump -d main |grep fmadd |wc -l
16
Use of FMA in the Rust binary:
objdump -d main |grep fmadd |wc -l
0
Overall the Rust binary was a bit bigger (3.7MB
) than the corresponding C++ binary (72KB
) and also it does not make use of fused-multiply-add instructions.
Perhaps there is a tradeoff made here where the 2X
theoretical speedup of vectorization is overcome by the 2X
theoretical speedup of a (scalar) fused-multiply-add. This would explain why the C++ binary has much fewer vector instructions but much more FMADD instructions.
I think a potential way to fix this will be to build Rust with a custom LLVM and attempt to output LLVM bitcode with the --emit-llvm
option. I may experiment with this at a later date.