Rust Underperforms g++ On Graviton2 for Floating-Point Intensive Code

January 12, 2022

Here I document my (partially successful) attempts to get Rust to perform at parity with C++ on a floating point benchmark of my own devising on an AWS Graviton2 instance. This follows some observations I made about on Graviton2 in my previous post. In that post I suggested that the ARM fused multiply-add operation provided a better performance opportunity than did its vectorization, however this requires a fast-math style compiler flag which Rust currently does not have because it can change the bitwise result of floating point math compared to unoptimized reference. Since Rust does not currently have this flag I used a feature of Rust to emit LLVM bitcode so that I can run my own optimization passes using the LLVM infrastructure directly, then link it all back together for an executable. I will demonstrate results for that now after setting the stage for the benchmark and the system I ran it on. Note that the end of this blog post will contain appdendices on some details about the benchmark and how I did custom LLVM passes on the graviton2 system.

The Graviton2 System

I used an AWS c6g.xlarge system. I give the output of lscpu below

Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              4   
On-line CPU(s) list: 0-3 
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Model:               1
BogoMIPS:            243.75
L1d cache:           64K 
L1i cache:           64K 
L2 cache:            1024K
L3 cache:            32768K
NUMA node0 CPU(s):   0-3 
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

The Benchmark

Here I repeat a benchmark I previously did comparing equivalent Rust and C++ floating-point intensive code. For in-depth detail of the benchmark I refer you to that blog post, but I will summarize each of the benchmarked steps below:

  1. C++ reference compiled with G++10 (g++ -Ofast -ftree-vectorize main.cpp -o main)
  2. Rust reference implementation (direct translation of (1)) (rustc -C target-cpu=native -C opt-level=3 -O main.rs)
  3. Same as (2) but with loops translated to iterators
  4. Same as (3) but partially unrolled the reduction to allow compiler to vectorize
  5. Same as (4) but used rustc --emit=llvm-ir to get llvm bitcode which I then ran under --ffast-math-like optimization passes

For step (5) above the flags I used to llvm opt were: --enable-no-signed-zeros-fp-math --fp-contract=fast --enable-unsafe-fp-math --enable-no-nans-fp-math --enable-no-infs-fp-math -O3. See appendices for more details on the mechanics of actually achieving a fully linked executable on the graviton2 system (it was not very straightforward).

I ran the resulting executable for problem sizes n=256 to n=16777216. I show results in the figure below

For easier comparison to the C++ reference I took the C++ reference timing and divided it into the other timings, resulting in a slowdown plot. The C++ reference sits at 1.0 and anything above 1.0 indicates slower than C++ by that factor.

Comparing against clang++

Since Rust and Clang are both frontends to LLVM it’s also useful to run these experiments with clang++ instead of g++. I show these below

Absolute timing plot

Slowdowns plot

The story is a little mixed comparing against clang++. Rust outperforms for some problem sizes, and underperforms for others.

Rust Underperforming on AARCH64 (For now)

In my previous post on graviton2 I stated the following:

I _think_ this may come down to floating point 
optimizations that change results, such as the ability 
to use fused multiply add. On the C++ binary I can see 
FMA used liberally
[...]
[...]
Perhaps there is a tradeoff made here where the  `2X` 
theoretical speedup of vectorization is overcome by the 
`2X` theoretical speedup of a (scalar) fused-multiply-add. 
This would explain why the C++ binary has much fewer vector 
instructions but much more FMADD instructions.

I think the results here do lend some evidence in favor of my FMA theory. For this particular application on graviton2 it appears to be far more important for optimal performance than vectorization is (in contrast to Intel systems with avx512, on which the FMA did not help as much). Unfortunately there is no way yet for the Rust compiler to automatically issue FMA even when the LLVM tools can.

Conclusions

I was able to improve performance over manually tuned Rust code by directly emitting LLVM bitcode and then using the LLVM infrastructure to optimize and link the result. This let me use more aggressive floating point optimizations which in particular allow emitting FMA operations, something Rust conservatively does not allow right now. Unfortunately even after doing this the resulting code still was as much as 1.2X slower than the C++ reference, depending on the input data size.

Appendices

To make this post more self-contained I include some details here of how I did everything including the sources for the benchmarks.

Benchmark Sources

C++ reference

#include <vector>
#include <chrono>
#include <cmath>
#include <iostream>


using vec = std::vector<double>;
using namespace std::chrono;
int main(int argc,char** argv){
  if(argc<2){
    std::cout<<"Usage: ./cplusplus n"<<std::endl;
    return 1;
  }

  /*Size of the arrays to iterate with.*/
  size_t n=std::atoi(argv[1]);
  /*The minimum number of runs to do.*/
  size_t nruns=100;

  /*Allocate and initialize data to 
   * be used in iterations.*/
  vec a(n,0.0);
  vec b(n,0.0);
  vec c(n,0.0);
  for(size_t i=0;i<n;i++){
    a[i]=std::abs( std::sin( double(i) ) )+0.00001;
    b[i]=std::cos( double(i) );
    c[i]=std::cos( double(i) );
  }

  /*How many iterations we did.*/
  size_t count=0;
  /*Initialize timers.*/
  steady_clock::time_point t1 = steady_clock::now();
  steady_clock::time_point t2 = steady_clock::now();
  duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
  /*Iterate until approximately 1 second has elapsed.*/
  while(time_span.count()<=1.0){
    count+=1;
    for(size_t r_=0;r_<nruns;r_++){
      double beta=0.0;
      double r=0.0;
      for(size_t i=0;i<n;i++){
        r+=(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
        beta+=a[i]*(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
      }
      for(size_t i=0;i<n;i++){
        b[i]=b[i]+(r/beta)*(c[i]-a[i]*b[i]);
      }
    }
    t2 = steady_clock::now();
    time_span = duration_cast<duration<double>>(t2 - t1);
  }
  /*Normalized time = elapsed_time/(problem_size*count*nruns.*/
  std::cout<<"Normalized Average time = "<<(time_span.count()/(n*count*nruns))<<std::endl;

  /*Some compilers may recognize we don't use
   * a,b, or c for anything and elide the entire
   * above computation. Thus I take a summary
   * value and print it out.*/
  double sumb=0.0;
  for(size_t i=0;i<n;i++){
    sumb+=b[i];
  }
  std::cout<<"sumb="<<sumb<<std::endl;
}

Naive Rust translation

use std::env;
fn main(){
    use std::time::{Instant};

    let args: Vec<String> = env::args().collect();
    let n=args[1].parse::<usize>().unwrap();
    let nruns=100;

    let mut a : Vec<f64> = vec![0.0;n];
    let mut b : Vec<f64> = vec![0.0;n];
    let mut c : Vec<f64> = vec![0.0;n];
    for i in 0..n{
        a[i]=(i as f64).sin().abs()+0.00001;
        b[i]=(i as f64).cos();
        c[i]=(i as f64).cos();
    }

    let mut count : usize =0;
    let now = Instant::now();
    while now.elapsed().as_secs_f64()<=1.0 {
        count+=1;
        for _ in 0..nruns{
            let mut beta=0.0;
            let mut r=0.0;
            for i in 0..n{
                r+=(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
                beta+=a[i]*(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
            }
            for i in 0..n{
                b[i]=b[i]+(r/beta)*(c[i]-a[i]*b[i])
            }
        }
    }
    println!("Normalized Average time = {}",now.elapsed().as_secs_f64()/((count as f64)*(n as f64)*(nruns as f64)));

    let mut sumb=0.0;
    for i in 0..n{
        sumb+=b[i];
    }

    println!("sumb={}",sumb);
}

Rust translation with loops converted to iterators

use std::env;
fn main(){
    use std::time::{Instant};

    let args: Vec<String> = env::args().collect();
    let n=args[1].parse::<usize>().unwrap();
    let nruns=100;

    let mut a : Vec<f64> = vec![0.0;n];
    let mut b : Vec<f64> = vec![0.0;n];
    let mut c : Vec<f64> = vec![0.0;n];
    for i in 0..n{
        a[i]=(i as f64).sin().abs()+0.00001;
        b[i]=(i as f64).cos();
        c[i]=(i as f64).cos();
    }

    let mut count : usize =0;
    let now = Instant::now();
    while now.elapsed().as_secs_f64()<=1.0 {
        count+=1;
        for _ in 0..nruns{

            let (beta,r) = (&a).iter().zip(&b).zip(&c).fold( (0.0,0.0),
            |acc,x|{
                let ((ai,bi),ci)=x;
                let (beta_tmp,r_tmp) = acc;
                let res = ci - ai*bi;
                let ares = ai*res;
                (beta_tmp+ares*ares,r_tmp+res*ares)
            });

            let rinvbeta = r/beta;

            for ((ai,bi),ci) in (&a).iter().zip(b.iter_mut()).zip(&c) {
                *bi = *bi + rinvbeta*(ci-ai*(*bi));
            }
        }
    }
    println!("Normalized Average time = {}",now.elapsed().as_secs_f64()/((count as f64)*(n as f64)*(nruns as f64)));

    let mut sumb=0.0;
    for i in 0..n{
        sumb+=b[i];
    }

    println!("sumb={}",sumb);
}

Rust translation with loops converted to iterators and partialy unrolled reduction

use std::env;
fn main(){
    use std::time::{Instant};

    let args: Vec<String> = env::args().collect();
    let n=args[1].parse::<usize>().unwrap();
    let nruns=100;
    const CHUNKSIZE : usize  = 32;

    let mut a : Vec<f64> = vec![0.0;n];
    let mut b : Vec<f64> = vec![0.0;n];
    let mut c : Vec<f64> = vec![0.0;n];
    for i in 0..n{
        a[i]=(i as f64).sin().abs()+0.00001;
        b[i]=(i as f64).cos();
        c[i]=(i as f64).cos();
    }

    let mut count : usize =0;
    let now = Instant::now();

    let mut beta_vec : [f64;CHUNKSIZE] = [0.0;CHUNKSIZE];
    let mut r_vec : [f64;CHUNKSIZE] = [0.0;CHUNKSIZE];
    while now.elapsed().as_secs_f64()<=1.0 {
        count+=1;
        for _ in 0..nruns{

            //Initialize partial reduction arrays
            for bv in beta_vec.iter_mut(){ *bv=0.0; }
            for rv in (r_vec).iter_mut(){ *rv=0.0; }

            //Form iterator over chunks of 
            //input arrays
            let outer_iter = 
                (&a).chunks_exact(CHUNKSIZE)
                .zip( (&b).chunks_exact(CHUNKSIZE))
                .zip( (&c).chunks_exact(CHUNKSIZE));
            //Get remainder iterator
            let outer_iter_remainder = 
                (&a).chunks_exact(CHUNKSIZE).remainder().iter()
                .zip( (&b).chunks_exact(CHUNKSIZE).remainder().iter())
                .zip( (&c).chunks_exact(CHUNKSIZE).remainder().iter());



            //Loop over all chunks and form partial reductions
            for ((avec,bvec),cvec) in outer_iter{
                let inner_itter = avec.iter()
                    .zip(bvec.iter())
                    .zip(cvec.iter())
                    .zip(beta_vec.iter_mut())
                    .zip(r_vec.iter_mut());

                for ((((ai,bi),ci),betai),ri) in inner_itter{                    
                    let res = ci - ai*bi;
                    let ares = ai*res;
                    *betai += ares*ares;
                    *ri += res*ares;
                }
            }
            //Form remainder reduction
            let mut beta = 0.0;
            let mut r = 0.0;
            for ((ai,bi),ci) in outer_iter_remainder {
                let res = ci - ai*bi;
                let ares = ai*res;
                beta+=ares*ares;
                r+=res*ares;
            }
            //Loop over partial reductions to form final reduction
            beta += beta_vec.iter().fold(0.0,|acc,x|{acc+x});
            r += r_vec.iter().fold(0.0,|acc,x|{acc+x});

            let rinvbeta = r/beta;

            for ((ai,bi),ci) in (&a).iter().zip(b.iter_mut()).zip(&c) {
                *bi = *bi + rinvbeta*(ci-ai*(*bi));
            }
        }
    }
    println!("Normalized Average time = {}",now.elapsed().as_secs_f64()/((count as f64)*(n as f64)*(nruns as f64)));

    let mut sumb=0.0;
    for i in 0..n{
        sumb+=b[i];
    }

    println!("sumb={}",sumb);



}

Custom LLVM passes on Rust-emitted code on Graviton2

Getting custom LLVM passes on an AWS Graviton2 instance proved more challenging than I initially expected. The Amazon AARCH64 AMI has very old system libraries and GNU utilities. I ran into many dependency issues. Here is a rough outline of what I did:

  1. Downloaded Rust Source (to build against custom LLVM)
  2. Downloaded LLVM AARCH64 binaries from their github releases page
  3. Download and build from source the following: ncurses,glibc,make becuase the versions the AMI came with were too old
    1. For ncurses you need to configure with ./configure --with-termlib --with-shared --with-versioned-syms --with-abi-version=5
    2. With ncurses,glibc built I then added libm.so and libtinfo.so.5 to LD_PRELOAD so that they come before the system equivalents
  4. After doing (3.1-3.2) the LLVM binaries clang,opt,llvm-as,llc should all execute without dependency complaints
  5. In the Rust build: ./configure --llvm-root=/path/to/llvm we must then set codegen-tests=false in config.toml and then run x.py
    1. This is because LLVM no longer ships with FileCheck which Rust uses in its build
    2. The resulting rustc binary will be under build/stage1 and the libraries you have to link against in build/stage1-std (see makefile below)

I copy/paste a sample Makefile which takes a main.rs all the way through its LLVM transformations and final linking

RUST=/home/ec2-user/custom_rust/rust/build/aarch64-unknown-linux-gnu/stage1/bin/rustc
RUSTFLAGS=-C target-cpu=native -C opt-level=3 --emit=llvm-ir
LLC=$(HOME)/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/llc
LLAS=$(HOME)/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/llvm-as
CLANG=$(HOME)/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/clang
OPT=$(HOME)/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/opt

RUSTLIB=/home/ec2-user/custom_rust/rust/build/aarch64-unknown-linux-gnu/stage1-std/aarch64-unknown-linux-gnu/release



main : main.o
        $(CLANG) -L$(RUSTLIB) main.o -lm -lstd -o main

main.o : main.bc opt
        $(LLC) -filetype=obj main.bc

opt : main.bc
        $(OPT) --enable-no-signed-zeros-fp-math --fp-contract=fast --enable-unsafe-fp-math --enable-no-nans-fp-math --enable-no-infs-fp-math -O3 main.bc -o main.bc

main.bc : main.ll
        $(LLAS) main.ll

main.ll : main.rs
        $(RUST) $(RUSTFLAGS) -O main.rs



.PHONHY : clean


clean :
        rm -rf ./main
        rm -rf ./main.ll
        rm -rf ./main.s
        rm -rf ./main.o
        rm -rf ./main.bc

When all is said and done here were the resulting versions of rust and llvm:

>/home/ec2-user/custom_rust/rust/build/aarch64-unknown-linux-gnu/stage1/bin/rustc --version
rustc 1.60.0-dev

Rust came from commit 02fe61b381c2dedc0071e1aacfbe91e0bad1f656 from its repository

>/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/clang --version
clang version 13.0.0 (http://git.linaro.org/toolchain/jenkins-scripts.git d04b0fafc2d906fd9b2e8e55efb35c9cf7114e68)
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/ec2-user/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin