Benchmarking Rust's ndarray crate against plain slices

June 11, 2022

Since I do a lot of linear algebra work I care a lot about good multidimensional containers. The best such containers get out of the developer’s way by making core operations easy to do, but also introduces as little overhead as possible so that the host compiler can correctly optimize any custom code that the developer builds on top of the container functionality. Rust does not have a standard way of manipulating this kind of data but there are some very good crates that implement this. I focus here on the ndarray crate which boasts a very full set of features and also co-operates well with Rust’s iterator-based functional idioms. Unfortunately I will show that this crate introduces some overhead that likely interferes with vectorization which prevents it from achieving its full potential. It remains the case that this container achieves high performance when the developer sticks to built-in and supported functionality or does not stray far from it. I illustrate this point by comparing a matrix-matrix multiply function written with raw rust slices as well as with ndarray.

The benchmark

The benchmark I use is a simple xGEMM style matrix-matrix multiply function following BLAS convention. Normally BLAS takes in pointers but instead of pointers I use rust’s slices. This raw code below matches the performance of a manual indexing equivalent done in unsafe blocks as I showed in another post

pub fn gemm(_m : usize, _n : usize, _p : usize, a : &[f64], lda : usize, b : &[f64], ldb : usize, c : &mut [f64], ldc : usize) -> () {
    for (ci,ai) in c.chunks_exact_mut(ldc).zip(a.chunks_exact(lda)){
        for (aik,bk) in ai.iter().zip(b.chunks_exact(ldb)){
            for (cij,bkj) in ci.iter_mut().zip(bk.iter()){
                *cij += (*aik) * (*bkj);
            }   
        }   
    }   
}

I randomly sample from 100000 matrices of different sizes feed them into gemm and then calculate histogram and percentiles of the FLOPS achieved. This will be the basis of comparison for the ndarray version.

fn main() {
    use std::time::{Instant};
    use rand::distributions::{Distribution,Uniform};
    use rand::SeedableRng;
    use rand_chacha::ChaCha20Rng;

    let maxsize = 1024;
    let nsamples = 100000;
    let seed : u64 = 19140148;
    let mut times0 = Vec::<f64>::new();
    let mut times1 = Vec::<f64>::new();
    {   
        for _ in 0..nsamples{
            let size_dist = Uniform::from(5..maxsize);
            let vals_dist = Uniform::new(-2.0,2.0);
            let mut rng = ChaCha20Rng::seed_from_u64(seed);
            let m = size_dist.sample(&mut rng);

            let a : Vec<f64> = (0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect();
            let b : Vec<f64> = (0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect();
            let mut c : Vec<f64> = (0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect();
            let start = Instant::now();
            gemm(m,m,m,&a,m,&b,m,&mut c,m);
            times0.push( ((2*m*m*m) as f64 / start.elapsed().as_secs_f64())/(1024.0*1024.0) );
        }
    }
}

I will include the full code listing in an appendix at the end of this post

Straightforward Translation

Using ndarray the most straightforward translation of the slice version of GEMM above is to use the rows methods along with iterators:

use ndarray::{Array,Array2};
pub fn gemm2(_m : usize, _n : usize, _p : usize, a : &Array2<f64>, lda : usize, b : &Array2<f64>, ldb : usize, c : &mut Array2<f64>, ldc : usize) -> () {
    for (mut ci,ai) in c.rows_mut().into_iter().zip(a.rows().into_iter()){
        for (aik,bk) in ai.iter().zip(b.rows().into_iter()){
            for (cij,bkj) in ci.iter_mut().zip(bk.iter()){
                *cij += (*aik) * (*bkj);
            }
        }
    }
}

Note that since the arrays are now two-dimensional the lda,ldb,ldc parameters are no longer used.

When I build my experiment with RUSTFLAGS="-C target-cpu=native -C opt-level=3" cargo run I find the following:

MFLOPS histogram

Percentile Slice ndarray
1%96393831
5%96594079
50%116584098
95%116994115
99%117174121

In this case the slice version performed 2X-3X faster than the ndarray version.

Using Zip

The ndarray documentation suggests using their custom Zip for many iterator-like idioms. I could not get a full implementation using only zip becuase it requires using lambdas and when many lambdas get nested the Rust borrowchecker complains about repeated borrows of mutable references on the c array. Still I tried to use it as much as I could:

pub fn gemm2(_m : usize, _n : usize, _p : usize, a : &Array2<f64>, lda : usize, b : &Array2<f64>, ldb : usize, c : &mut Array2<f64>, ldc : usize) -> () {
    Zip::from(c.rows_mut()).and(a.rows()).for_each(|mut ci,ai|{
        Zip::from(ai).and(b.rows()).for_each(|aik,bk|{
            for (cij,bkj) in ci.iter_mut().zip(bk.iter()){
                *cij += (*aik) * (*bkj);
            }   
        }); 
    });
}

Percentile Slice ndarray
1%95451954
5%110202954
50%115742958
95%118342961
99%118752962

Unfortunately the results here were worse off than the standard iterators implementation

Unsafe methods

Since using rust idioms was not yielding very good performance with ndarray I decided to try and resort to basic indexing and furthermore turning off bounds checking by using unsafe indexing methods (one may only use these inside “unsafe blocks”).

pub fn gemm2(m : usize, _n : usize, _p : usize, a : &Array2<f64>, lda : usize, b : &Array2<f64>, ldb : usize, c : &mut Array2<f64>, ldc : usize) -> () {
    for i in 0..m{
        for k in 0..m{
            unsafe{
                let aik = a.uget([i,k]);
                for j in 0..m{
                    let bkj = b.uget([k,j]);
                    let cij = c.uget_mut([i,j]);
                    *cij += (*aik) * (*bkj);
                }   
            }   
        }   
    }   
}

Percentile Slice ndarray
1%72367687
5%112388821
50%114429014
95%114939322
99%115169347

Unfortunately even with the use of unsafe indexing we are still leaving a fairly significant amount of performance on the table.

This result did surprise I assumed the unsafe indexing should at least result in similar codegeneration quality. There does not appear to be a major differnece between the vectorization in the two versions. (note to get the lineinfo add -C debuginfo=1 to your RUSTFLAGS prior to invoking cargo)

Slice GEMM ndarray+unsafe GEMM
.LBB6_17:
  .loc  12 270 12
  vbroadcastsd  (%rdx,%rdi,8), %ymm0
  xorl  %ebx, %ebx 
.Ltmp46:
  .p2align  4, 0x90 
.LBB6_18:
  .loc  10 7 25 
  vmulpd  -224(%rcx,%rbx,8), %ymm0, %ymm1
  vmulpd  -192(%rcx,%rbx,8), %ymm0, %ymm2
  vmulpd  -160(%rcx,%rbx,8), %ymm0, %ymm3
  vmulpd  -128(%rcx,%rbx,8), %ymm0, %ymm4
  .loc  10 7 17 is_stmt 0
  vaddpd  -224(%rsi,%rbx,8), %ymm1, %ymm1
  vaddpd  -192(%rsi,%rbx,8), %ymm2, %ymm2
  vaddpd  -160(%rsi,%rbx,8), %ymm3, %ymm3
  vaddpd  -128(%rsi,%rbx,8), %ymm4, %ymm4
  vmovupd %ymm1, -224(%rsi,%rbx,8)
  vmovupd %ymm2, -192(%rsi,%rbx,8)
  vmovupd %ymm3, -160(%rsi,%rbx,8)
  vmovupd %ymm4, -128(%rsi,%rbx,8)
  .loc  10 7 25 
  vmulpd  -96(%rcx,%rbx,8), %ymm0, %ymm1
  vmulpd  -64(%rcx,%rbx,8), %ymm0, %ymm2
  vmulpd  -32(%rcx,%rbx,8), %ymm0, %ymm3
  vmulpd  (%rcx,%rbx,8), %ymm0, %ymm4
  .loc  10 7 17 
  vaddpd  -96(%rsi,%rbx,8), %ymm1, %ymm1
  vaddpd  -64(%rsi,%rbx,8), %ymm2, %ymm2
  vaddpd  -32(%rsi,%rbx,8), %ymm3, %ymm3
  vaddpd  (%rsi,%rbx,8), %ymm4, %ymm4
  vmovupd %ymm1, -96(%rsi,%rbx,8)
  vmovupd %ymm2, -64(%rsi,%rbx,8)
  vmovupd %ymm3, -32(%rsi,%rbx,8)
  vmovupd %ymm4, (%rsi,%rbx,8)
.Ltmp41:
  .loc  13 621 12
  vbroadcastsd  (%r15,%rbx,8), %ymm0
  xorl  %ebx, %ebx 
.Ltmp42:
  .p2align  4, 0x90 
.LBB6_18:
  .loc  11 12 29
  vmulpd  -224(%rsi,%rbx,8), %ymm0, %ymm1
  vmulpd  -192(%rsi,%rbx,8), %ymm0, %ymm2
  vmulpd  -160(%rsi,%rbx,8), %ymm0, %ymm3
  vmulpd  -128(%rsi,%rbx,8), %ymm0, %ymm4
  .loc  11 12 21 is_stmt 0
  vaddpd  -224(%rcx,%rbx,8), %ymm1, %ymm1
  vaddpd  -192(%rcx,%rbx,8), %ymm2, %ymm2
  vaddpd  -160(%rcx,%rbx,8), %ymm3, %ymm3
  vaddpd  -128(%rcx,%rbx,8), %ymm4, %ymm4
  vmovupd %ymm1, -224(%rcx,%rbx,8)
  vmovupd %ymm2, -192(%rcx,%rbx,8)
  vmovupd %ymm3, -160(%rcx,%rbx,8)
  vmovupd %ymm4, -128(%rcx,%rbx,8)
  .loc  11 12 29
  vmulpd  -96(%rsi,%rbx,8), %ymm0, %ymm1
  vmulpd  -64(%rsi,%rbx,8), %ymm0, %ymm2
  vmulpd  -32(%rsi,%rbx,8), %ymm0, %ymm3
  vmulpd  (%rsi,%rbx,8), %ymm0, %ymm4
  .loc  11 12 21
  vaddpd  -96(%rcx,%rbx,8), %ymm1, %ymm1
  vaddpd  -64(%rcx,%rbx,8), %ymm2, %ymm2
  vaddpd  -32(%rcx,%rbx,8), %ymm3, %ymm3
  vaddpd  (%rcx,%rbx,8), %ymm4, %ymm4
  vmovupd %ymm1, -96(%rcx,%rbx,8)
  vmovupd %ymm2, -64(%rcx,%rbx,8)
  vmovupd %ymm3, -32(%rcx,%rbx,8)
  vmovupd %ymm4, (%rcx,%rbx,8)	

Summary and Conclusions

The ndarray crate has a good design and diligently implements necessary features for users of multidimensional arrays, but it currently lacks the ability to extend its functionality by building on its primitives without taking a fairly large performance hit. I’m not completely sure what causes the performance discrepancy yet except that the Array::rows()iterators on ndarrays must be somehow more expensive than Slice::chunks_exact() - but in what way I do not yet know.

Appendix

use ndarray::{Array,Array2};
pub fn gemm(_m : usize, _n : usize, _p : usize, a : &[f64], lda : usize, b : &[f64], ldb : usize, c : &mut [f64], ldc : usize) -> () {
    for (ci,ai) in c.chunks_exact_mut(ldc).zip(a.chunks_exact(lda)){
        for (aik,bk) in ai.iter().zip(b.chunks_exact(ldb)){
            for (cij,bkj) in ci.iter_mut().zip(bk.iter()){
                *cij += (*aik) * (*bkj);
            }
        }
    }
}

pub fn gemm2(_m : usize, _n : usize, _p : usize, a : &Array2<f64>, lda : usize, b : &Array2<f64>, ldb : usize, c : &mut Array2<f64>, ldc : usize) -> () {
    for (mut ci,ai) in c.rows_mut().into_iter().zip(a.rows().into_iter()){
        for (aik,bk) in ai.iter().zip(b.rows().into_iter()){
            for (cij,bkj) in ci.iter_mut().zip(bk.iter()){
                *cij += (*aik) * (*bkj);
            }
        }
    }
}
fn main() {
    use std::time::{Instant};
    use rand::distributions::{Distribution,Uniform};
    use rand::SeedableRng;
    use rand_chacha::ChaCha20Rng;

    let maxsize = 1024;
    let nsamples = 100000;
    let seed : u64 = 19140148;
    let mut times0 = Vec::<f64>::new();
    let mut times1 = Vec::<f64>::new();
    {
        for _ in 0..nsamples{
            let size_dist = Uniform::from(5..maxsize);
            let vals_dist = Uniform::new(-2.0,2.0);
            let mut rng = ChaCha20Rng::seed_from_u64(seed);
            let m = size_dist.sample(&mut rng);

            let a : Vec<f64> = (0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect();
            let b : Vec<f64> = (0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect();
            let mut c : Vec<f64> = (0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect();
            let start = Instant::now();
            gemm(m,m,m,&a,m,&b,m,&mut c,m);
            times0.push( ((2*m*m*m) as f64 / start.elapsed().as_secs_f64())/(1024.0*1024.0) );
        }
    }
    {
        for _ in 0..nsamples{
            let size_dist = Uniform::from(5..maxsize);
            let vals_dist = Uniform::new(-2.0,2.0);
            let mut rng = ChaCha20Rng::seed_from_u64(seed);
            let m = size_dist.sample(&mut rng);


            let a = Array::from_shape_vec((m,m),(0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect()).unwrap();
            let b = Array::from_shape_vec((m,m),(0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect()).unwrap();
            let mut c = Array::from_shape_vec((m,m),(0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect()).unwrap();
            let start = Instant::now();
            gemm2(m,m,m,&a,m,&b,m,&mut c,m);
            times1.push( ((2*m*m*m) as f64 / start.elapsed().as_secs_f64())/(1024.0*1024.0) );
        }
    }

    for (t0,t1) in times0.iter().zip(times1.iter()){
        println!("{},{}",t0,t1);
    }


}