# Rust and C++ on Floating-point Intensive Code

October 19, 2019

I have developed some interest in the Rust programming language. Rust is a systems programming language which means it gets out of the way of developers interested in reaching low-level internals of a system. What sets it aside from other systems languages though is:

1. Borrow checking to guarantee memory safety
2. Functional-like idioms (map, fold, etc)
3. Pattern matching with an almost algebraic datatype called “enum”
4. A much simpler memory model than C++

We might call this “C++ done right” and since I do a lot of heavy numeric computation in C++ it was tempting for me to see how these two languages compare in a shootout. I chose a floating point benchmark to implement in both languages in order to see the performance difference. I give commentary on why the performance is that way, and some potential fixes Rust could implement to close the gap.

# The Benchmark

I will use a benchmark I designed myself for this purpose. This benchmark relies on vectorization for performance. The high-level view of the benchmark is as follows:

1. Create three arrays a,b,c a must have all positive entries
2. Using a as a diagonal matrix approximate solution to (a*b)=c using a fixed number gmres iterations with a restart value of 1.
3. Iterate (2) for approximately 1 second.
4. Output the time (3) takes divided by the size of the arrays in (1) and the total number of iterations (3) was able to run in the allowed time

In c++ this code looks like the following:

#include <vector>
#include <chrono>
#include <cmath>
#include <iostream>

using vec = std::vector<double>;
using namespace std::chrono;
int main(int argc,char** argv){
if(argc<2){
std::cout<<"Usage: ./cplusplus n"<<std::endl;
return 1;
}

/*Size of the arrays to iterate with.*/
size_t n=std::atoi(argv[1]);
/*The minimum number of runs to do.*/
size_t nruns=100;

/*Allocate and initialize data to
* be used in iterations.*/
vec a(n,0.0);
vec b(n,0.0);
vec c(n,0.0);
for(size_t i=0;i<n;i++){
a[i]=std::abs( std::sin( double(i) ) )+0.00001;
b[i]=std::cos( double(i) );
c[i]=std::cos( double(i) );
}

/*How many iterations we did.*/
size_t count=0;
/*Initialize timers.*/
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
/*Iterate until approximately 1 second has elapsed.*/
while(time_span.count()<=1.0){
count+=1;
for(size_t r_=0;r_<nruns;r_++){
double beta=0.0;
double r=0.0;
for(size_t i=0;i<n;i++){
r+=(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
beta+=a[i]*(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
}
for(size_t i=0;i<n;i++){
b[i]=b[i]+(r/beta)*(c[i]-a[i]*b[i]);
}
}
time_span = duration_cast<duration<double>>(t2 - t1);
}
/*Normalized time = elapsed_time/(problem_size*count*nruns.*/
std::cout<<"Normalized Average time = "<<(time_span.count()/(n*count*nruns))<<std::endl;

/*Some compilers may recognize we don't use
* a,b, or c for anything and elide the entire
* above computation. Thus I take a summary
* value and print it out.*/
double sumb=0.0;
for(size_t i=0;i<n;i++){
sumb+=b[i];
}
std::cout<<"sumb="<<sumb<<std::endl;
}


The reason for using the fancy iterations is that there is a pretty good chance that if we just put any ordinary vector code inside the nruns loop that sufficiently aggressive compilers will do an “unroll and jam” optimization, resulting in misleading time measurements. The gmres step requires a dot product which creates an all-to-all data dependency which makes unroll-and-jam illegal.

# Reference Measurements: Intel compiler on a Skylake Intel processor

For this I build with the above c++ code with intel compiler using the following build line:

icpc -O3 -xCORE-AVX512 -qopt-zmm-usage=high -qopt-report  cplusplus.cpp -o cplusplus


An explanation for each option is as follows:

1. -O3 — highest level of optimization
2. -xCORE-AVX512 — use avx512 where possible
3. -qopt-zmm-usage=high — Force avx512 even if compiler thinks the resulting downclocking will be slow (it isn’t in this case)
4. -qopt-report — output optimization report, I used this to inspect compiler optimizations and rule out fancy loop optimizations that could lead to misleading timings.

I then ran the code for double precision arrays a,b,c of length 256 until 16777216. The purpose of this is to measure the performance changes as we exceed different levels of cache.

# Naive Rust Implementation

I learned some basics of Rust and came up with the following implementation that does the equivalent calculation and timing of it:

use std::env;
fn main(){
use std::time::{Instant};

let args: Vec<String> = env::args().collect();
let n=args[1].parse::<usize>().unwrap();
let nruns=100;

let mut a : Vec<f64> = vec![0.0;n];
let mut b : Vec<f64> = vec![0.0;n];
let mut c : Vec<f64> = vec![0.0;n];
for i in 0..n{
a[i]=(i as f64).sin().abs()+0.00001;
b[i]=(i as f64).cos();
c[i]=(i as f64).cos();
}

let mut count : usize =0;
let now = Instant::now();
while now.elapsed().as_secs_f64()<=1.0 {
count+=1;
for _ in 0..nruns{
let mut beta=0.0;
let mut r=0.0;
for i in 0..n{
r+=(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
beta+=a[i]*(c[i]-a[i]*b[i])*a[i]*(c[i]-a[i]*b[i]);
}
for i in 0..n{
b[i]=b[i]+(r/beta)*(c[i]-a[i]*b[i])
}
}
}
println!("Normalized Average time = {}",now.elapsed().as_secs_f64()/((count as f64)*(n as f64)*(nruns as f64)));

let mut sumb=0.0;
for i in 0..n{
sumb+=b[i];
}

println!("sumb={}",sumb);
}


I built this with the following build command

	rustc -C target-cpu=native -C opt-level=3 -O rust.rs


an explanation of the flags follows:

1. -C target-cpu=native — Detect host CPU and use best instruction set
2. -C opt-level=3 — highest level of optimization

This resulted in the following:

## Explanation of results

This naive Rust implementation didn’t vectorize unfortunately, we can see this using objdump

 #Look for vaddpd and vmulpd in executables: vectorized add and multiply
objdump -d cplusplus | grep 'vaddpd\|vmulpd' | wc -l
>> 73
objdump -d naiverust | grep 'vaddpd\|vmulpd' | wc -l
>> 0


Note: The count above doesn’t have performance implications itself, but a count of 0 means those instructions weren’t use - which I interpret to mean the code did not vectorize where it should have.

# Fixing the Naive implementation with iterators

I did some reading on the internet and found that I should not have used ordinary loops, but rather iterators. The reason is that by using iterators there isn’t need for runtime bounds checking. My iterator implementation follows:

use std::env;
fn main(){
use std::time::{Instant};

let args: Vec<String> = env::args().collect();
let n=args[1].parse::<usize>().unwrap();
let nruns=100;

let mut a : Vec<f64> = vec![0.0;n];
let mut b : Vec<f64> = vec![0.0;n];
let mut c : Vec<f64> = vec![0.0;n];
for i in 0..n{
a[i]=(i as f64).sin().abs()+0.00001;
b[i]=(i as f64).cos();
c[i]=(i as f64).cos();
}

let mut count : usize =0;
let now = Instant::now();
while now.elapsed().as_secs_f64()<=1.0 {
count+=1;
for _ in 0..nruns{

let (beta,r) = (&a).iter().zip(&b).zip(&c).fold( (0.0,0.0),
|acc,x|{
let ((ai,bi),ci)=x;
let (beta_tmp,r_tmp) = acc;
let res = ci - ai*bi;
let ares = ai*res;
(beta_tmp+ares*ares,r_tmp+res*ares)
});

let rinvbeta = r/beta;

for ((ai,bi),ci) in (&a).iter().zip(b.iter_mut()).zip(&c) {
*bi = *bi + rinvbeta*(ci-ai*(*bi));
}
}
}
println!("Normalized Average time = {}",now.elapsed().as_secs_f64()/((count as f64)*(n as f64)*(nruns as f64)));

let mut sumb=0.0;
for i in 0..n{
sumb+=b[i];
}

println!("sumb={}",sumb);
}


I compiled this the same way:

	rustc -C target-cpu=native -C opt-level=3 -O rust.rs


and ran for the same problem sizes:

This is a significant improvement but I see that it’s still pretty far behind in the lower cache levels, suggesting more instructions executed there where the Intel compiler has managed to do fewer.

## Explanation of iterator results

First of all let’s see if the iterator results vectorized:

 #Look for vaddpd and vmulpd in executables: vectorized add and multiply
objdump -d cplusplus | grep 'vaddpd\|vmulpd' | wc -l
>> 73
objdump -d naiverust | grep 'vaddpd\|vmulpd' | wc -l
>> 0
objdump -d iteratorrust | grep 'vaddpd\|vmulpd' | wc -l
>> 12


Success! We have some vectorization. I think the next performance bump that Intel achieves is through its very aggressive treatment of floating point math. The Intel compiler, unless told otherwise, will change the order of operations if doing so is faster. The most obvious way this unfolds is in the use of fused multiply-add, or FMA, which can potentially double a floating-point-heavy code’s throughput.
Looking for the avx512 version of this in the intel executable finds it is indeed used:

objdump -d cplusplus | grep 'vfma' |wc -l
>>66
objdump -d iteratorrust | grep 'vfma' | wc -l
>>0


So the Intel compiler has taken some liberties with our accuracy and used FMA. This can account for a potential 2X performance difference. This doesn’t tell the whole story, as we can see that in the L2 cache regime the Intel executable achieves more than a 2X speedup, but I haven’t looked further than math associativity optimizations.

Comparing with Clang is interesting because both Clang and Rust heavily use LLVM for backend code generation, even though their frontend languages are different (C++ versus Rust). Comparing against Intel’s compiler on Intel’s hardware also can be insanely unfair, but really the performance of this code is all about being able to exploit vectorization and special accelerated arithmetic operations like FMA. Most open source C and C++ compilers are able to achieve this. To show this I compile the c++ code with clang-7.1 and the following build line

clang++ -Ofast -fvectorize -march=skylake-avx512 cplusplus.cpp -o cplusplus_clang


Explanation of the options are as follows

1. -Ofast — Enable fast optimizations even if it changes bitwise result of floating point
2. -fvectorize — Enable vectorization
3. -march=skylake-avx512 – enable avx512 instructions on Skylake architecture

First we can check that the code vectorized also used FMA instructions:

 objdump -d cplusplus_clang | grep 'vaddpd\|vmulpd' | wc -l
>>>47
objdump -d cplusplus_clang |grep fma | wc -l
>>>15


Here are the results:

## What if we turn off -Ofast in Clang?

This test helps to confirm if the issue really is associativity related floating point optimizations. I got this advice from reading this reddit post, specifically this comment by greppable777

This time I took the same code and compiled with the line:

clang++ -O3 -fvectorize -march=skylake-avx512 cplusplus.cpp  -o cplusplus_clang_noofast


We can confirm that it vectorized but did not emit FMA

objdump -d cplusplus_clang_noofast |grep 'vaddpd\|vmulpd' | wc -l
>>>12
objdump -d cplusplus_clang_noofast |grep 'fma' | wc -l
>>>0



and finally the plot illustrating results. Interestingly, Rust did slightly better than Clang when we turn off associativity optimizations.

# Update (10/24/2019) : Fixing the Reduction

Thanks to discussions on hackernews, specifically this comment by tom mellior I realized that actually my reduction (calculating beta and r) was not vectorizing in my rust code. This makes sense because doing so would require changing the order of math operations, which is also something enabled by the intel compiler and clang++ -Ofast, but not enabled by Rust by default.

The strategy to vectorizing loops is to keep a small vector (say of size 16 or 32) of partial reductions, and use the vaddpd vector instruction to maintain a list of partial reductions. The partial reduction accumulations are vectorized. After we have finished looping over all the sub-ranges of the specified size we finish up with a standard reduction.

In Rust I still used iterators, but since I needed a more sophisticated looping strategy I used a way of dividing loopps into chunks called chunks_exact. I give the code for this below. It’s a little more complicated than the original Rust code, but maybe some Rust experts out there know a good way to simplify it?

use std::env;
fn main(){
use std::time::{Instant};

let args: Vec<String> = env::args().collect();
let n=args[1].parse::<usize>().unwrap();
let nruns=100;
const CHUNKSIZE : usize  = 32;

let mut a : Vec<f64> = vec![0.0;n];
let mut b : Vec<f64> = vec![0.0;n];
let mut c : Vec<f64> = vec![0.0;n];
for i in 0..n{
a[i]=(i as f64).sin().abs()+0.00001;
b[i]=(i as f64).cos();
c[i]=(i as f64).cos();
}

let mut count : usize =0;
let now = Instant::now();

let mut beta_vec : [f64;CHUNKSIZE] = [0.0;CHUNKSIZE];
let mut r_vec : [f64;CHUNKSIZE] = [0.0;CHUNKSIZE];
while now.elapsed().as_secs_f64()<=1.0 {
count+=1;
for _ in 0..nruns{

//Initialize partial reduction arrays
for bv in beta_vec.iter_mut(){ *bv=0.0; }
for rv in (r_vec).iter_mut(){ *rv=0.0; }

//Form iterator over chunks of
//input arrays
let outer_iter =
(&a).chunks_exact(CHUNKSIZE)
.zip( (&b).chunks_exact(CHUNKSIZE))
.zip( (&c).chunks_exact(CHUNKSIZE));
//Get remainder iterator
let outer_iter_remainder =
(&a).chunks_exact(CHUNKSIZE).remainder().iter()
.zip( (&b).chunks_exact(CHUNKSIZE).remainder().iter())
.zip( (&c).chunks_exact(CHUNKSIZE).remainder().iter());

//Loop over all chunks and form partial reductions
for ((avec,bvec),cvec) in outer_iter{
let inner_itter = avec.iter()
.zip(bvec.iter())
.zip(cvec.iter())
.zip(beta_vec.iter_mut())
.zip(r_vec.iter_mut());

for ((((ai,bi),ci),betai),ri) in inner_itter{
let res = ci - ai*bi;
let ares = ai*res;
*betai += ares*ares;
*ri += res*ares;
}
}
//Form remainder reduction
let mut beta = 0.0;
let mut r = 0.0;
for ((ai,bi),ci) in outer_iter_remainder {
let res = ci - ai*bi;
let ares = ai*res;
beta+=ares*ares;
r+=res*ares;
}
//Loop over partial reductions to form final reduction
beta += beta_vec.iter().fold(0.0,|acc,x|{acc+x});
r += r_vec.iter().fold(0.0,|acc,x|{acc+x});

let rinvbeta = r/beta;

for ((ai,bi),ci) in (&a).iter().zip(b.iter_mut()).zip(&c) {
*bi = *bi + rinvbeta*(ci-ai*(*bi));
}
}
}
println!("Normalized Average time = {}",now.elapsed().as_secs_f64()/((count as f64)*(n as f64)*(nruns as f64)));

let mut sumb=0.0;
for i in 0..n{
sumb+=b[i];
}

println!("sumb={}",sumb);

}



If we compare the objdump of this file with the basic iterator version of rust we see it has three times the vector instructions

objdump -d rust_iterators | grep 'vaddpd\|vmulpd' | wc -l
>>>12

objdump -d rust_iterators_reduction | grep 'vaddpd\|vmulpd' | wc -l
>>>36



since nothing else changed in the file I’m only guessing here that these correspond with the new reduction code, which didn’t vectorize originally.

The results seem to back up this idea.

# Conclusions and Caveats

Rust looks promising. What I like about it is that I didn’t see performance penalties using a more functional style programming approach where I used maps and folds, it turns out that style is even preferable for vectorization.

One thing I could not figure out how to do was to turn on associativity optimizations - these are the kind of optimizations which will allow FMA operations. Often these do not hurt accuracy in any significant way, but since they do change the answer from a bitwise reproducibility perspective many compilers will not do it unless specifically told to do so. A flag from Rust to allow this would be helpful, and it seems that LLVM can already do this.

I was able to achieve similar results to associativity optimizations by manually changing the order of the reduction operations and maintaining an explicit array of partial reductions. It would be nice to have the compiler do this as well however.

## Caveats

This code is single threaded. I did this because threading performance often depends a lot not just on compiler capabilities, but the language’s ability to coordinate with an operating system. Things like thread pinning and processor topology combined with scheduling can have a huge impact on a threaded application’s performance. Thus I decided to narrow my focus to a single-threaded application, but it would be interesting to try this on a multi-threaded application also at a later date.

## Help?

I would love to get feedback, especially from Rust users, about how I can improve my Rust code. If I get good feedback I’ll try to make a short blog post about it and cite you (if you want). Here are ways I would like to improve the Rust code:

1. Flags to improve code generation (e.g. FMA)
2. Macros or conditional compilation strategies to achieve (1)
3. Ideally no intrinsics for (1-2), I like to have a single source file across architectures as far as possible. For example I would like to try this same experiment on an AMD machine or ARM machine in the future.
4. Maybe there is a better way to use iterators than what I have done?