Here I document my (partially successful) attempts to get Rust to perform at parity with C++ on a floating point benchmark of my own devising on an AWS Graviton2 instance. This follows some observations I made about on Graviton2 in my previous post. In that post I suggested that the ARM fused multiply-add operation provided a better performance opportunity than did its vectorization, however this requires a fast-math style compiler flag which Rust currently does not have because it can change the bitwise result of floating point math compared to unoptimized reference. Since Rust does not currently have this flag I used a feature of Rust to emit LLVM bitcode so that I can run my own optimization passes using the LLVM infrastructure directly, then link it all back together for an executable. I will demonstrate results for that now after setting the stage for the benchmark and the system I ran it on. Note that the end of this blog post will contain appdendices on some details about the benchmark and how I did custom LLVM passes on the graviton2 system.
I used an AWS c6g.xlarge
system. I give the output of lscpu
below
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Model: 1
BogoMIPS: 243.75
L1d cache: 64K
L1i cache: 64K
L2 cache: 1024K
L3 cache: 32768K
NUMA node0 CPU(s): 0-3
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Here I repeat a benchmark I previously did comparing equivalent Rust and C++ floating-point intensive code. For in-depth detail of the benchmark I refer you to that blog post, but I will summarize each of the benchmarked steps below:
g++ -Ofast -ftree-vectorize main.cpp -o main
)rustc -C target-cpu=native -C opt-level=3 -O main.rs
)rustc --emit=llvm-ir
to get llvm bitcode which I then ran under --ffast-math
-like optimization passesFor step (5) above the flags I used to llvm opt were: --enable-no-signed-zeros-fp-math --fp-contract=fast --enable-unsafe-fp-math --enable-no-nans-fp-math --enable-no-infs-fp-math -O3
. See appendices for more details on the mechanics of actually achieving a fully linked executable on the graviton2 system (it was not very straightforward).
I ran the resulting executable for problem sizes n=256
to n=16777216
. I show results in the figure below
For easier comparison to the C++ reference I took the C++ reference timing and divided it into the other timings, resulting in a slowdown plot. The C++ reference sits at 1.0
and anything above 1.0
indicates slower than C++ by that factor.
Since Rust and Clang are both frontends to LLVM it’s also useful to run these experiments with clang++ instead of g++. I show these below
Absolute timing plot
Slowdowns plot
The story is a little mixed comparing against clang++. Rust outperforms for some problem sizes, and underperforms for others.
In my previous post on graviton2 I stated the following:
I _think_ this may come down to floating point
optimizations that change results, such as the ability
to use fused multiply add. On the C++ binary I can see
FMA used liberally
[...]
[...]
Perhaps there is a tradeoff made here where the `2X`
theoretical speedup of vectorization is overcome by the
`2X` theoretical speedup of a (scalar) fused-multiply-add.
This would explain why the C++ binary has much fewer vector
instructions but much more FMADD instructions.
I think the results here do lend some evidence in favor of my FMA theory. For this particular application on graviton2 it appears to be far more important for optimal performance than vectorization is (in contrast to Intel systems with avx512, on which the FMA did not help as much). Unfortunately there is no way yet for the Rust compiler to automatically issue FMA even when the LLVM tools can.
I was able to improve performance over manually tuned Rust code by directly emitting LLVM bitcode and then using the LLVM infrastructure to optimize and link the result. This let me use more aggressive floating point optimizations which in particular allow emitting FMA operations, something Rust conservatively does not allow right now. Unfortunately even after doing this the resulting code still was as much as 1.2X
slower than the C++ reference, depending on the input data size.
To make this post more self-contained I include some details here of how I did everything including the sources for the benchmarks.
C++ reference
#include <vector>
#include <chrono>
#include <cmath>
#include <iostream>
using vec = std::vector<double>;
using namespace std::chrono;
int main(int argc,char** argv){
if(argc<2){
std::cout<<"Usage: ./cplusplus n"<<std::endl;
return 1;
}
/*Size of the arrays to iterate with.*/
size_t n=std::atoi(argv[1]);
/*The minimum number of runs to do.*/
size_t nruns=100;
/*Allocate and initialize data to
* be used in iterations.*/
vec a(n,0.0);
vec b(n,0.0);
vec c(n,0.0);
for(size_t i=0;i<n;i++){
a[i]=std::abs( std::sin( double(i) ) )+0.00001;
b[i]=std::cos( double(i) );
c[i]=std::cos( double(i) );
}
/*How many iterations we did.*/
size_t count=0;
/*Initialize timers.*/
steady_clock::time_point t1 = steady_clock::now();
steady_clock::time_point t2 = steady_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
/*Iterate until approximately 1 second has elapsed.*/
while(time_span.count()<=1.0){
count+=1;
for(size_t r_=0;r_<nruns;r_++){
double beta=0.0;
double r=0.0;
for(size_t i=0;i<n;i++){
r+=(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
beta+=a[i]*(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
}
for(size_t i=0;i<n;i++){
b[i]=b[i]+(r/beta)*(c[i]-a[i]*b[i]);
}
}
t2 = steady_clock::now();
time_span = duration_cast<duration<double>>(t2 - t1);
}
/*Normalized time = elapsed_time/(problem_size*count*nruns.*/
std::cout<<"Normalized Average time = "<<(time_span.count()/(n*count*nruns))<<std::endl;
/*Some compilers may recognize we don't use
* a,b, or c for anything and elide the entire
* above computation. Thus I take a summary
* value and print it out.*/
double sumb=0.0;
for(size_t i=0;i<n;i++){
sumb+=b[i];
}
std::cout<<"sumb="<<sumb<<std::endl;
}
Naive Rust translation
use std::env;
fn main(){
use std::time::{Instant};
let args: Vec<String> = env::args().collect();
let n=args[1].parse::<usize>().unwrap();
let nruns=100;
let mut a : Vec<f64> = vec![0.0;n];
let mut b : Vec<f64> = vec![0.0;n];
let mut c : Vec<f64> = vec![0.0;n];
for i in 0..n{
a[i]=(i as f64).sin().abs()+0.00001;
b[i]=(i as f64).cos();
c[i]=(i as f64).cos();
}
let mut count : usize =0;
let now = Instant::now();
while now.elapsed().as_secs_f64()<=1.0 {
count+=1;
for _ in 0..nruns{
let mut beta=0.0;
let mut r=0.0;
for i in 0..n{
r+=(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
beta+=a[i]*(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
}
for i in 0..n{
b[i]=b[i]+(r/beta)*(c[i]-a[i]*b[i])
}
}
}
println!("Normalized Average time = {}",now.elapsed().as_secs_f64()/((count as f64)*(n as f64)*(nruns as f64)));
let mut sumb=0.0;
for i in 0..n{
sumb+=b[i];
}
println!("sumb={}",sumb);
}
Rust translation with loops converted to iterators
use std::env;
fn main(){
use std::time::{Instant};
let args: Vec<String> = env::args().collect();
let n=args[1].parse::<usize>().unwrap();
let nruns=100;
let mut a : Vec<f64> = vec![0.0;n];
let mut b : Vec<f64> = vec![0.0;n];
let mut c : Vec<f64> = vec![0.0;n];
for i in 0..n{
a[i]=(i as f64).sin().abs()+0.00001;
b[i]=(i as f64).cos();
c[i]=(i as f64).cos();
}
let mut count : usize =0;
let now = Instant::now();
while now.elapsed().as_secs_f64()<=1.0 {
count+=1;
for _ in 0..nruns{
let (beta,r) = (&a).iter().zip(&b).zip(&c).fold( (0.0,0.0),
|acc,x|{
let ((ai,bi),ci)=x;
let (beta_tmp,r_tmp) = acc;
let res = ci - ai*bi;
let ares = ai*res;
(beta_tmp+ares*ares,r_tmp+res*ares)
});
let rinvbeta = r/beta;
for ((ai,bi),ci) in (&a).iter().zip(b.iter_mut()).zip(&c) {
*bi = *bi + rinvbeta*(ci-ai*(*bi));
}
}
}
println!("Normalized Average time = {}",now.elapsed().as_secs_f64()/((count as f64)*(n as f64)*(nruns as f64)));
let mut sumb=0.0;
for i in 0..n{
sumb+=b[i];
}
println!("sumb={}",sumb);
}
Rust translation with loops converted to iterators and partialy unrolled reduction
use std::env;
fn main(){
use std::time::{Instant};
let args: Vec<String> = env::args().collect();
let n=args[1].parse::<usize>().unwrap();
let nruns=100;
const CHUNKSIZE : usize = 32;
let mut a : Vec<f64> = vec![0.0;n];
let mut b : Vec<f64> = vec![0.0;n];
let mut c : Vec<f64> = vec![0.0;n];
for i in 0..n{
a[i]=(i as f64).sin().abs()+0.00001;
b[i]=(i as f64).cos();
c[i]=(i as f64).cos();
}
let mut count : usize =0;
let now = Instant::now();
let mut beta_vec : [f64;CHUNKSIZE] = [0.0;CHUNKSIZE];
let mut r_vec : [f64;CHUNKSIZE] = [0.0;CHUNKSIZE];
while now.elapsed().as_secs_f64()<=1.0 {
count+=1;
for _ in 0..nruns{
//Initialize partial reduction arrays
for bv in beta_vec.iter_mut(){ *bv=0.0; }
for rv in (r_vec).iter_mut(){ *rv=0.0; }
//Form iterator over chunks of
//input arrays
let outer_iter =
(&a).chunks_exact(CHUNKSIZE)
.zip( (&b).chunks_exact(CHUNKSIZE))
.zip( (&c).chunks_exact(CHUNKSIZE));
//Get remainder iterator
let outer_iter_remainder =
(&a).chunks_exact(CHUNKSIZE).remainder().iter()
.zip( (&b).chunks_exact(CHUNKSIZE).remainder().iter())
.zip( (&c).chunks_exact(CHUNKSIZE).remainder().iter());
//Loop over all chunks and form partial reductions
for ((avec,bvec),cvec) in outer_iter{
let inner_itter = avec.iter()
.zip(bvec.iter())
.zip(cvec.iter())
.zip(beta_vec.iter_mut())
.zip(r_vec.iter_mut());
for ((((ai,bi),ci),betai),ri) in inner_itter{
let res = ci - ai*bi;
let ares = ai*res;
*betai += ares*ares;
*ri += res*ares;
}
}
//Form remainder reduction
let mut beta = 0.0;
let mut r = 0.0;
for ((ai,bi),ci) in outer_iter_remainder {
let res = ci - ai*bi;
let ares = ai*res;
beta+=ares*ares;
r+=res*ares;
}
//Loop over partial reductions to form final reduction
beta += beta_vec.iter().fold(0.0,|acc,x|{acc+x});
r += r_vec.iter().fold(0.0,|acc,x|{acc+x});
let rinvbeta = r/beta;
for ((ai,bi),ci) in (&a).iter().zip(b.iter_mut()).zip(&c) {
*bi = *bi + rinvbeta*(ci-ai*(*bi));
}
}
}
println!("Normalized Average time = {}",now.elapsed().as_secs_f64()/((count as f64)*(n as f64)*(nruns as f64)));
let mut sumb=0.0;
for i in 0..n{
sumb+=b[i];
}
println!("sumb={}",sumb);
}
Getting custom LLVM passes on an AWS Graviton2 instance proved more challenging than I initially expected. The Amazon AARCH64 AMI has very old system libraries and GNU utilities. I ran into many dependency issues. Here is a rough outline of what I did:
ncurses,glibc,make
becuase the versions the AMI came with were too oldncurses
you need to configure with ./configure --with-termlib --with-shared --with-versioned-syms --with-abi-version=5
ncurses,glibc
built I then added libm.so
and libtinfo.so.5
to LD_PRELOAD
so that they come before the system equivalentsclang,opt,llvm-as,llc
should all execute without dependency complaints./configure --llvm-root=/path/to/llvm
we must then set codegen-tests=false
in config.toml
and then run x.py
FileCheck
which Rust uses in its buildrustc
binary will be under build/stage1
and the libraries you have to link against in build/stage1-std
(see makefile below)I copy/paste a sample Makefile which takes a main.rs
all the way through its LLVM transformations and final linking
RUST=/home/ec2-user/custom_rust/rust/build/aarch64-unknown-linux-gnu/stage1/bin/rustc
RUSTFLAGS=-C target-cpu=native -C opt-level=3 --emit=llvm-ir
LLC=$(HOME)/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/llc
LLAS=$(HOME)/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/llvm-as
CLANG=$(HOME)/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/clang
OPT=$(HOME)/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/opt
RUSTLIB=/home/ec2-user/custom_rust/rust/build/aarch64-unknown-linux-gnu/stage1-std/aarch64-unknown-linux-gnu/release
main : main.o
$(CLANG) -L$(RUSTLIB) main.o -lm -lstd -o main
main.o : main.bc opt
$(LLC) -filetype=obj main.bc
opt : main.bc
$(OPT) --enable-no-signed-zeros-fp-math --fp-contract=fast --enable-unsafe-fp-math --enable-no-nans-fp-math --enable-no-infs-fp-math -O3 main.bc -o main.bc
main.bc : main.ll
$(LLAS) main.ll
main.ll : main.rs
$(RUST) $(RUSTFLAGS) -O main.rs
.PHONHY : clean
clean :
rm -rf ./main
rm -rf ./main.ll
rm -rf ./main.s
rm -rf ./main.o
rm -rf ./main.bc
When all is said and done here were the resulting versions of rust
and llvm
:
>/home/ec2-user/custom_rust/rust/build/aarch64-unknown-linux-gnu/stage1/bin/rustc --version
rustc 1.60.0-dev
Rust came from commit 02fe61b381c2dedc0071e1aacfbe91e0bad1f656
from its repository
>/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/clang --version
clang version 13.0.0 (http://git.linaro.org/toolchain/jenkins-scripts.git d04b0fafc2d906fd9b2e8e55efb35c9cf7114e68)
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/ec2-user/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin