Rust Underperforms g++ On Graviton2 for Floating-Point Intensive Code

Here I document my (partially successful) attempts to get Rust to perform at parity with C++ on a floating point benchmark of my own devising on an AWS Graviton2 instance. This follows some observations I made about on Graviton2 in my previous post. In that post I suggested that the ARM fused multiply-add operation provided a better performance opportunity than did its vectorization, however this requires a fast-math style compiler flag which Rust currently does not have because it can change the bitwise result of floating point math compared to unoptimized reference. Since Rust does not currently have this flag I used a feature of Rust to emit LLVM bitcode so that I can run my own optimization passes using the LLVM infrastructure directly, then link it all back together for an executable. I will demonstrate results for that now after setting the stage for the benchmark and the system I ran it on. Note that the end of this blog post will contain appdendices on some details about the benchmark and how I did custom LLVM passes on the graviton2 system.

The Graviton2 System

Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              4   
On-line CPU(s) list: 0-3 
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Model:               1
BogoMIPS:            243.75
L1d cache:           64K 
L1i cache:           64K 
L2 cache:            1024K
L3 cache:            32768K
NUMA node0 CPU(s):   0-3 
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

The Benchmark

Here I repeat a benchmark I previously did comparing equivalent Rust and C++ floating-point intensive code. For in-depth detail of the benchmark I refer you to that blog post, but I will summarize each of the benchmarked steps below:

For step (5) above the flags I used to llvm opt were:

--enable-no-signed-zeros-fp-math --fp-contract=fast --enable-unsafe-fp-math --enable-no-nans-fp-math --enable-no-infs-fp-math -O3

. See appendices for more details on the mechanics of actually achieving a fully linked executable on the graviton2 system (it was not very straightforward).

I ran the resulting executable for problem sizes n=256 to n=16777216. I show results in the figure below

For easier comparison to the C++ reference I took the C++ reference timing and divided it into the other timings, resulting in a slowdown plot. The C++ reference sits at 1.0 and anything above 1.0 indicates slower than C++ by that factor.

Comparing against clang++

Since Rust and Clang are both frontends to LLVM it’s also useful to run these experiments with clang++ instead of g++. I show these below

The story is a little mixed comparing against clang++. Rust outperforms for some problem sizes, and underperforms for others.

Rust Underperforming on AARCH64 (For now)

I _think_ this may come down to floating point 
optimizations that change results, such as the ability 
to use fused multiply add. On the C++ binary I can see 
FMA used liberally
[...]
[...]
Perhaps there is a tradeoff made here where the  `2X` 
theoretical speedup of vectorization is overcome by the 
`2X` theoretical speedup of a (scalar) fused-multiply-add. 
This would explain why the C++ binary has much fewer vector 
instructions but much more FMADD instructions.

I think the results here do lend some evidence in favor of my FMA theory. For this particular application on graviton2 it appears to be far more important for optimal performance than vectorization is (in contrast to Intel systems with avx512, on which the FMA did not help as much). Unfortunately there is no way yet for the Rust compiler to automatically issue FMA even when the LLVM tools can.

Conclusions

I was able to improve performance over manually tuned Rust code by directly emitting LLVM bitcode and then using the LLVM infrastructure to optimize and link the result. This let me use more aggressive floating point optimizations which in particular allow emitting FMA operations, something Rust conservatively does not allow right now. Unfortunately even after doing this the resulting code still was as much as 1.2X slower than the C++ reference, depending on the input data size.

Appendices

To make this post more self-contained I include some details here of how I did everything including the sources for the benchmarks.

Benchmark Sources

#include <vector>
#include <chrono>
#include <cmath>
#include <iostream>


using vec = std::vector<double>;
using namespace std::chrono;
int main(int argc,char** argv){
  if(argc<2){
    std::cout<<"Usage: ./cplusplus n"<<std::endl;
    return 1;
  }

  /*Size of the arrays to iterate with.*/
  size_t n=std::atoi(argv[1]);
  /*The minimum number of runs to do.*/
  size_t nruns=100;

  /*Allocate and initialize data to 
   * be used in iterations.*/
  vec a(n,0.0);
  vec b(n,0.0);
  vec c(n,0.0);
  for(size_t i=0;i<n;i++){
    a[i]=std::abs( std::sin( double(i) ) )+0.00001;
    b[i]=std::cos( double(i) );
    c[i]=std::cos( double(i) );
  }

  /*How many iterations we did.*/
  size_t count=0;
  /*Initialize timers.*/
  steady_clock::time_point t1 = steady_clock::now();
  steady_clock::time_point t2 = steady_clock::now();
  duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
  /*Iterate until approximately 1 second has elapsed.*/
  while(time_span.count()<=1.0){
    count+=1;
    for(size_t r_=0;r_<nruns;r_++){
      double beta=0.0;
      double r=0.0;
      for(size_t i=0;i<n;i++){
        r+=(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
        beta+=a[i]*(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
      }
      for(size_t i=0;i<n;i++){
        b[i]=b[i]+(r/beta)*(c[i]-a[i]*b[i]);
      }
    }
    t2 = steady_clock::now();
    time_span = duration_cast<duration<double>>(t2 - t1);
  }
  /*Normalized time = elapsed_time/(problem_size*count*nruns.*/
  std::cout<<"Normalized Average time = "<<(time_span.count()/(n*count*nruns))<<std::endl;

  /*Some compilers may recognize we don't use
   * a,b, or c for anything and elide the entire
   * above computation. Thus I take a summary
   * value and print it out.*/
  double sumb=0.0;
  for(size_t i=0;i<n;i++){
    sumb+=b[i];
  }
  std::cout<<"sumb="<<sumb<<std::endl;
}

use std::env;
fn main(){
    use std::time::{Instant};

    let args: Vec<String> = env::args().collect();
    let n=args[1].parse::<usize>().unwrap();
    let nruns=100;

    let mut a : Vec<f64> = vec![0.0;n];
    let mut b : Vec<f64> = vec![0.0;n];
    let mut c : Vec<f64> = vec![0.0;n];
    for i in 0..n{
        a[i]=(i as f64).sin().abs()+0.00001;
        b[i]=(i as f64).cos();
        c[i]=(i as f64).cos();
    }

    let mut count : usize =0;
    let now = Instant::now();
    while now.elapsed().as_secs_f64()<=1.0 {
        count+=1;
        for _ in 0..nruns{
            let mut beta=0.0;
            let mut r=0.0;
            for i in 0..n{
                r+=(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
                beta+=a[i]*(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
            }
            for i in 0..n{
                b[i]=b[i]+(r/beta)*(c[i]-a[i]*b[i])
            }
        }
    }
    println!("Normalized Average time = {}",now.elapsed().as_secs_f64()/((count as f64)*(n as f64)*(nruns as f64)));

    let mut sumb=0.0;
    for i in 0..n{
        sumb+=b[i];
    }

    println!("sumb={}",sumb);
}

use std::env;
fn main(){
    use std::time::{Instant};

    let args: Vec<String> = env::args().collect();
    let n=args[1].parse::<usize>().unwrap();
    let nruns=100;

    let mut a : Vec<f64> = vec![0.0;n];
    let mut b : Vec<f64> = vec![0.0;n];
    let mut c : Vec<f64> = vec![0.0;n];
    for i in 0..n{
        a[i]=(i as f64).sin().abs()+0.00001;
        b[i]=(i as f64).cos();
        c[i]=(i as f64).cos();
    }

    let mut count : usize =0;
    let now = Instant::now();
    while now.elapsed().as_secs_f64()<=1.0 {
        count+=1;
        for _ in 0..nruns{

            let (beta,r) = (&a).iter().zip(&b).zip(&c).fold( (0.0,0.0),
            |acc,x|{
                let ((ai,bi),ci)=x;
                let (beta_tmp,r_tmp) = acc;
                let res = ci - ai*bi;
                let ares = ai*res;
                (beta_tmp+ares*ares,r_tmp+res*ares)
            });

            let rinvbeta = r/beta;

            for ((ai,bi),ci) in (&a).iter().zip(b.iter_mut()).zip(&c) {
                *bi = *bi + rinvbeta*(ci-ai*(*bi));
            }
        }
    }
    println!("Normalized Average time = {}",now.elapsed().as_secs_f64()/((count as f64)*(n as f64)*(nruns as f64)));

    let mut sumb=0.0;
    for i in 0..n{
        sumb+=b[i];
    }

    println!("sumb={}",sumb);
}

Rust translation with loops converted to iterators and partialy unrolled reduction

use std::env;
fn main(){
    use std::time::{Instant};

    let args: Vec<String> = env::args().collect();
    let n=args[1].parse::<usize>().unwrap();
    let nruns=100;
    const CHUNKSIZE : usize  = 32;

    let mut a : Vec<f64> = vec![0.0;n];
    let mut b : Vec<f64> = vec![0.0;n];
    let mut c : Vec<f64> = vec![0.0;n];
    for i in 0..n{
        a[i]=(i as f64).sin().abs()+0.00001;
        b[i]=(i as f64).cos();
        c[i]=(i as f64).cos();
    }

    let mut count : usize =0;
    let now = Instant::now();

    let mut beta_vec : [f64;CHUNKSIZE] = [0.0;CHUNKSIZE];
    let mut r_vec : [f64;CHUNKSIZE] = [0.0;CHUNKSIZE];
    while now.elapsed().as_secs_f64()<=1.0 {
        count+=1;
        for _ in 0..nruns{

            //Initialize partial reduction arrays
            for bv in beta_vec.iter_mut(){ *bv=0.0; }
            for rv in (r_vec).iter_mut(){ *rv=0.0; }

            //Form iterator over chunks of 
            //input arrays
            let outer_iter = 
                (&a).chunks_exact(CHUNKSIZE)
                .zip( (&b).chunks_exact(CHUNKSIZE))
                .zip( (&c).chunks_exact(CHUNKSIZE));
            //Get remainder iterator
            let outer_iter_remainder = 
                (&a).chunks_exact(CHUNKSIZE).remainder().iter()
                .zip( (&b).chunks_exact(CHUNKSIZE).remainder().iter())
                .zip( (&c).chunks_exact(CHUNKSIZE).remainder().iter());



            //Loop over all chunks and form partial reductions
            for ((avec,bvec),cvec) in outer_iter{
                let inner_itter = avec.iter()
                    .zip(bvec.iter())
                    .zip(cvec.iter())
                    .zip(beta_vec.iter_mut())
                    .zip(r_vec.iter_mut());

                for ((((ai,bi),ci),betai),ri) in inner_itter{                    
                    let res = ci - ai*bi;
                    let ares = ai*res;
                    *betai += ares*ares;
                    *ri += res*ares;
                }
            }
            //Form remainder reduction
            let mut beta = 0.0;
            let mut r = 0.0;
            for ((ai,bi),ci) in outer_iter_remainder {
                let res = ci - ai*bi;
                let ares = ai*res;
                beta+=ares*ares;
                r+=res*ares;
            }
            //Loop over partial reductions to form final reduction
            beta += beta_vec.iter().fold(0.0,|acc,x|{acc+x});
            r += r_vec.iter().fold(0.0,|acc,x|{acc+x});

            let rinvbeta = r/beta;

            for ((ai,bi),ci) in (&a).iter().zip(b.iter_mut()).zip(&c) {
                *bi = *bi + rinvbeta*(ci-ai*(*bi));
            }
        }
    }
    println!("Normalized Average time = {}",now.elapsed().as_secs_f64()/((count as f64)*(n as f64)*(nruns as f64)));

    let mut sumb=0.0;
    for i in 0..n{
        sumb+=b[i];
    }

    println!("sumb={}",sumb);



}

Custom LLVM passes on Rust-emitted code on Graviton2

Getting custom LLVM passes on an AWS Graviton2 instance proved more challenging than I initially expected. The Amazon AARCH64 AMI has very old system libraries and GNU utilities. I ran into many dependency issues. Here is a rough outline of what I did:

I copy/paste a sample Makefile which takes a main.rs all the way through its LLVM transformations and final linking

RUST=/home/ec2-user/custom_rust/rust/build/aarch64-unknown-linux-gnu/stage1/bin/rustc
RUSTFLAGS=-C target-cpu=native -C opt-level=3 --emit=llvm-ir
LLC=$(HOME)/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/llc
LLAS=$(HOME)/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/llvm-as
CLANG=$(HOME)/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/clang
OPT=$(HOME)/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/opt

RUSTLIB=/home/ec2-user/custom_rust/rust/build/aarch64-unknown-linux-gnu/stage1-std/aarch64-unknown-linux-gnu/release



main : main.o
        $(CLANG) -L$(RUSTLIB) main.o -lm -lstd -o main

main.o : main.bc opt
        $(LLC) -filetype=obj main.bc

opt : main.bc
        $(OPT) --enable-no-signed-zeros-fp-math --fp-contract=fast --enable-unsafe-fp-math --enable-no-nans-fp-math --enable-no-infs-fp-math -O3 main.bc -o main.bc

main.bc : main.ll
        $(LLAS) main.ll

main.ll : main.rs
        $(RUST) $(RUSTFLAGS) -O main.rs



.PHONHY : clean


clean :
        rm -rf ./main
        rm -rf ./main.ll
        rm -rf ./main.s
        rm -rf ./main.o
        rm -rf ./main.bc

When all is said and done here were the resulting versions of rust and llvm:

>/home/ec2-user/custom_rust/rust/build/aarch64-unknown-linux-gnu/stage1/bin/rustc --version
rustc 1.60.0-dev

Rust came from commit 02fe61b381c2dedc0071e1aacfbe91e0bad1f656 from its repository

>/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin/clang --version
clang version 13.0.0 (http://git.linaro.org/toolchain/jenkins-scripts.git d04b0fafc2d906fd9b2e8e55efb35c9cf7114e68)
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/ec2-user/llvm/clang+llvm-13.0.0-aarch64-linux-gnu/bin