Benchmarking Rust's ndarray crate against plain slices

Since I do a lot of linear algebra work I care a lot about good multidimensional containers. The best such containers get out of the developer’s way by making core operations easy to do, but also introduces as little overhead as possible so that the host compiler can correctly optimize any custom code that the developer builds on top of the container functionality. Rust does not have a standard way of manipulating this kind of data but there are some very good crates that implement this. I focus here on the ndarray crate which boasts a very full set of features and also co-operates well with Rust’s iterator-based functional idioms. Unfortunately I will show that this crate introduces some overhead that likely interferes with vectorization which prevents it from achieving its full potential. It remains the case that this container achieves high performance when the developer sticks to built-in and supported functionality or does not stray far from it. I illustrate this point by comparing a matrix-matrix multiply function written with raw rust slices as well as with ndarray.

The benchmark

The benchmark I use is a simple xGEMM style matrix-matrix multiply function following BLAS convention. Normally BLAS takes in pointers but instead of pointers I use rust’s slices. This raw code below matches the performance of a manual indexing equivalent done in unsafe blocks as I showed in another post

pub fn gemm(_m : usize, _n : usize, _p : usize, a : &[f64], lda : usize, b : &[f64], ldb : usize, c : &mut [f64], ldc : usize) -> () {
    for (ci,ai) in c.chunks_exact_mut(ldc).zip(a.chunks_exact(lda)){
        for (aik,bk) in ai.iter().zip(b.chunks_exact(ldb)){
            for (cij,bkj) in ci.iter_mut().zip(bk.iter()){
                *cij += (*aik) * (*bkj);
            }   
        }   
    }   
}

I randomly sample from 100000 matrices of different sizes feed them into gemm and then calculate histogram and percentiles of the FLOPS achieved. This will be the basis of comparison for the ndarray version.

fn main() {
    use std::time::{Instant};
    use rand::distributions::{Distribution,Uniform};
    use rand::SeedableRng;
    use rand_chacha::ChaCha20Rng;

    let maxsize = 1024;
    let nsamples = 100000;
    let seed : u64 = 19140148;
    let mut times0 = Vec::<f64>::new();
    let mut times1 = Vec::<f64>::new();
    {   
        for _ in 0..nsamples{
            let size_dist = Uniform::from(5..maxsize);
            let vals_dist = Uniform::new(-2.0,2.0);
            let mut rng = ChaCha20Rng::seed_from_u64(seed);
            let m = size_dist.sample(&mut rng);

            let a : Vec<f64> = (0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect();
            let b : Vec<f64> = (0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect();
            let mut c : Vec<f64> = (0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect();
            let start = Instant::now();
            gemm(m,m,m,&a,m,&b,m,&mut c,m);
            times0.push( ((2*m*m*m) as f64 / start.elapsed().as_secs_f64())/(1024.0*1024.0) );
        }
    }
}

Straightforward Translation

Using ndarray the most straightforward translation of the slice version of GEMM above is to use the rows methods along with iterators:

use ndarray::{Array,Array2};
pub fn gemm2(_m : usize, _n : usize, _p : usize, a : &Array2<f64>, lda : usize, b : &Array2<f64>, ldb : usize, c : &mut Array2<f64>, ldc : usize) -> () {
    for (mut ci,ai) in c.rows_mut().into_iter().zip(a.rows().into_iter()){
        for (aik,bk) in ai.iter().zip(b.rows().into_iter()){
            for (cij,bkj) in ci.iter_mut().zip(bk.iter()){
                *cij += (*aik) * (*bkj);
            }
        }
    }
}

Note that since the arrays are now two-dimensional the lda,ldb,ldc parameters are no longer used.

When I build my experiment with RUSTFLAGS="-C target-cpu=native -C opt-level=3" cargo run I find the following:

Using Zip

Percentile	Slice	ndarray
1%	9639	3831
5%	9659	4079
50%	11658	4098
95%	11699	4115
99%	11717	4121

The ndarray documentation suggests using their custom Zip for many iterator-like idioms. I could not get a full implementation using only zip becuase it requires using lambdas and when many lambdas get nested the Rust borrowchecker complains about repeated borrows of mutable references on the c array. Still I tried to use it as much as I could:

pub fn gemm2(_m : usize, _n : usize, _p : usize, a : &Array2<f64>, lda : usize, b : &Array2<f64>, ldb : usize, c : &mut Array2<f64>, ldc : usize) -> () {
    Zip::from(c.rows_mut()).and(a.rows()).for_each(|mut ci,ai|{
        Zip::from(ai).and(b.rows()).for_each(|aik,bk|{
            for (cij,bkj) in ci.iter_mut().zip(bk.iter()){
                *cij += (*aik) * (*bkj);
            }   
        }); 
    });
}

Unfortunately the results here were worse off than the standard iterators implementation

Unsafe methods

Percentile	Slice	ndarray
1%	9545	1954
5%	11020	2954
50%	11574	2958
95%	11834	2961
99%	11875	2962

Since using rust idioms was not yielding very good performance with ndarray I decided to try and resort to basic indexing and furthermore turning off bounds checking by using unsafe indexing methods (one may only use these inside “unsafe blocks”).

pub fn gemm2(m : usize, _n : usize, _p : usize, a : &Array2<f64>, lda : usize, b : &Array2<f64>, ldb : usize, c : &mut Array2<f64>, ldc : usize) -> () {
    for i in 0..m{
        for k in 0..m{
            unsafe{
                let aik = a.uget([i,k]);
                for j in 0..m{
                    let bkj = b.uget([k,j]);
                    let cij = c.uget_mut([i,j]);
                    *cij += (*aik) * (*bkj);
                }   
            }   
        }   
    }   
}

Unfortunately even with the use of unsafe indexing we are still leaving a fairly significant amount of performance on the table.

Percentile	Slice	ndarray
1%	7236	7687
5%	11238	8821
50%	11442	9014
95%	11493	9322
99%	11516	9347

This result did surprise I assumed the unsafe indexing should at least result in similar codegeneration quality. There does not appear to be a major differnece between the vectorization in the two versions. (note to get the lineinfo add -C debuginfo=1 to your RUSTFLAGS prior to invoking cargo)

Summary and Conclusions

Slice GEMM	ndarray+unsafe GEMM
.LBB6_17: .loc 12 270 12 vbroadcastsd (%rdx,%rdi,8), %ymm0 xorl %ebx, %ebx .Ltmp46: .p2align 4, 0x90 .LBB6_18: .loc 10 7 25 vmulpd -224(%rcx,%rbx,8), %ymm0, %ymm1 vmulpd -192(%rcx,%rbx,8), %ymm0, %ymm2 vmulpd -160(%rcx,%rbx,8), %ymm0, %ymm3 vmulpd -128(%rcx,%rbx,8), %ymm0, %ymm4 .loc 10 7 17 is_stmt 0 vaddpd -224(%rsi,%rbx,8), %ymm1, %ymm1 vaddpd -192(%rsi,%rbx,8), %ymm2, %ymm2 vaddpd -160(%rsi,%rbx,8), %ymm3, %ymm3 vaddpd -128(%rsi,%rbx,8), %ymm4, %ymm4 vmovupd %ymm1, -224(%rsi,%rbx,8) vmovupd %ymm2, -192(%rsi,%rbx,8) vmovupd %ymm3, -160(%rsi,%rbx,8) vmovupd %ymm4, -128(%rsi,%rbx,8) .loc 10 7 25 vmulpd -96(%rcx,%rbx,8), %ymm0, %ymm1 vmulpd -64(%rcx,%rbx,8), %ymm0, %ymm2 vmulpd -32(%rcx,%rbx,8), %ymm0, %ymm3 vmulpd (%rcx,%rbx,8), %ymm0, %ymm4 .loc 10 7 17 vaddpd -96(%rsi,%rbx,8), %ymm1, %ymm1 vaddpd -64(%rsi,%rbx,8), %ymm2, %ymm2 vaddpd -32(%rsi,%rbx,8), %ymm3, %ymm3 vaddpd (%rsi,%rbx,8), %ymm4, %ymm4 vmovupd %ymm1, -96(%rsi,%rbx,8) vmovupd %ymm2, -64(%rsi,%rbx,8) vmovupd %ymm3, -32(%rsi,%rbx,8) vmovupd %ymm4, (%rsi,%rbx,8)	.Ltmp41: .loc 13 621 12 vbroadcastsd (%r15,%rbx,8), %ymm0 xorl %ebx, %ebx .Ltmp42: .p2align 4, 0x90 .LBB6_18: .loc 11 12 29 vmulpd -224(%rsi,%rbx,8), %ymm0, %ymm1 vmulpd -192(%rsi,%rbx,8), %ymm0, %ymm2 vmulpd -160(%rsi,%rbx,8), %ymm0, %ymm3 vmulpd -128(%rsi,%rbx,8), %ymm0, %ymm4 .loc 11 12 21 is_stmt 0 vaddpd -224(%rcx,%rbx,8), %ymm1, %ymm1 vaddpd -192(%rcx,%rbx,8), %ymm2, %ymm2 vaddpd -160(%rcx,%rbx,8), %ymm3, %ymm3 vaddpd -128(%rcx,%rbx,8), %ymm4, %ymm4 vmovupd %ymm1, -224(%rcx,%rbx,8) vmovupd %ymm2, -192(%rcx,%rbx,8) vmovupd %ymm3, -160(%rcx,%rbx,8) vmovupd %ymm4, -128(%rcx,%rbx,8) .loc 11 12 29 vmulpd -96(%rsi,%rbx,8), %ymm0, %ymm1 vmulpd -64(%rsi,%rbx,8), %ymm0, %ymm2 vmulpd -32(%rsi,%rbx,8), %ymm0, %ymm3 vmulpd (%rsi,%rbx,8), %ymm0, %ymm4 .loc 11 12 21 vaddpd -96(%rcx,%rbx,8), %ymm1, %ymm1 vaddpd -64(%rcx,%rbx,8), %ymm2, %ymm2 vaddpd -32(%rcx,%rbx,8), %ymm3, %ymm3 vaddpd (%rcx,%rbx,8), %ymm4, %ymm4 vmovupd %ymm1, -96(%rcx,%rbx,8) vmovupd %ymm2, -64(%rcx,%rbx,8) vmovupd %ymm3, -32(%rcx,%rbx,8) vmovupd %ymm4, (%rcx,%rbx,8)

The ndarray crate has a good design and diligently implements necessary features for users of multidimensional arrays, but it currently lacks the ability to extend its functionality by building on its primitives without taking a fairly large performance hit. I’m not completely sure what causes the performance discrepancy yet except that the Array::rows()iterators on ndarrays must be somehow more expensive than Slice::chunks_exact() - but in what way I do not yet know.

Appendix

use ndarray::{Array,Array2};
pub fn gemm(_m : usize, _n : usize, _p : usize, a : &[f64], lda : usize, b : &[f64], ldb : usize, c : &mut [f64], ldc : usize) -> () {
    for (ci,ai) in c.chunks_exact_mut(ldc).zip(a.chunks_exact(lda)){
        for (aik,bk) in ai.iter().zip(b.chunks_exact(ldb)){
            for (cij,bkj) in ci.iter_mut().zip(bk.iter()){
                *cij += (*aik) * (*bkj);
            }
        }
    }
}

pub fn gemm2(_m : usize, _n : usize, _p : usize, a : &Array2<f64>, lda : usize, b : &Array2<f64>, ldb : usize, c : &mut Array2<f64>, ldc : usize) -> () {
    for (mut ci,ai) in c.rows_mut().into_iter().zip(a.rows().into_iter()){
        for (aik,bk) in ai.iter().zip(b.rows().into_iter()){
            for (cij,bkj) in ci.iter_mut().zip(bk.iter()){
                *cij += (*aik) * (*bkj);
            }
        }
    }
}
fn main() {
    use std::time::{Instant};
    use rand::distributions::{Distribution,Uniform};
    use rand::SeedableRng;
    use rand_chacha::ChaCha20Rng;

    let maxsize = 1024;
    let nsamples = 100000;
    let seed : u64 = 19140148;
    let mut times0 = Vec::<f64>::new();
    let mut times1 = Vec::<f64>::new();
    {
        for _ in 0..nsamples{
            let size_dist = Uniform::from(5..maxsize);
            let vals_dist = Uniform::new(-2.0,2.0);
            let mut rng = ChaCha20Rng::seed_from_u64(seed);
            let m = size_dist.sample(&mut rng);

            let a : Vec<f64> = (0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect();
            let b : Vec<f64> = (0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect();
            let mut c : Vec<f64> = (0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect();
            let start = Instant::now();
            gemm(m,m,m,&a,m,&b,m,&mut c,m);
            times0.push( ((2*m*m*m) as f64 / start.elapsed().as_secs_f64())/(1024.0*1024.0) );
        }
    }
    {
        for _ in 0..nsamples{
            let size_dist = Uniform::from(5..maxsize);
            let vals_dist = Uniform::new(-2.0,2.0);
            let mut rng = ChaCha20Rng::seed_from_u64(seed);
            let m = size_dist.sample(&mut rng);


            let a = Array::from_shape_vec((m,m),(0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect()).unwrap();
            let b = Array::from_shape_vec((m,m),(0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect()).unwrap();
            let mut c = Array::from_shape_vec((m,m),(0..m*m).map(|_x|vals_dist.sample(&mut rng)).collect()).unwrap();
            let start = Instant::now();
            gemm2(m,m,m,&a,m,&b,m,&mut c,m);
            times1.push( ((2*m*m*m) as f64 / start.elapsed().as_secs_f64())/(1024.0*1024.0) );
        }
    }

    for (t0,t1) in times0.iter().zip(times1.iter()){
        println!("{},{}",t0,t1);
    }


}