More Fun with Rust Iterators: Matrix-Matrix Multiply

February 26, 2021

Vectorization is a critical optimization step without which numerical computing code will not perform well. Achieving vectorization in Rust however is a little difficult. Rust emphasizes memory safety and as a result many common indexing operations do runtime bounds-checking. This bounds checking generally makes vectorization impossible. One can avoid bounds-checking by relying on iterators instead, however. I show here an interesting implementation of matrix-matrix multiply in Rust using iterators alone and no raw indexing. The absence of raw indexing usually results in faster Rust code because that allows Rust to omit bounds checking. I show the reference code with raw indexing, its translation to iterators, and I also include an “unsafe” version using raw pointer dereferencing for performance comparison.

The basic code

In what follows I will compute \[ \mathbf{C} \leftarrow \mathbf{C} + \mathbf{A}\mathbf{B} \] and these symbols will roughly correspond to variables in the Rust code which follows. I map matrices to simple 1D arrays of contiguous floating point numbers as is common in libraries like BLAS and LAPACK. I store the matrices in row-major order so that rows are stored contiguous in memory.

The basic Rust implementation is

pub fn gemm(m : usize, n : usize, p : usize, a : &[f64], lda : usize, b : &[f64], ldb : usize, c : &mut [f64], ldc : usize) -> () {
    //m : number of rows of A and C
    //n : number of columns of B and C
    //p : number of columns of A,rows of B
    for i in 0..m{
        for k in 0..p{
            for j in 0..n{
                let aik=a[k+lda*i];
                let bkj=b[j+ldb*k];
                let ref mut cij=c[j+ldc*i];
                *cij += aik*bkj;
            }
        }
    }

}

We can see from Godbolt.org that this code does not vectorize even with optimizations turned on (I set the architecture to Skylake+AVX512 so we are looking for register usage from ymm* and packed vector operations like vmulpd,vaddpd.)

The culprit is bounds checking, which I eliminate by switching to iterators below.

Iterator implementation

Rust data slices (the types &[f64] in this case) are somewhat similar to pointers in C and C++ but one key difference is that they come with a length. This means many iterator idioms common for more advanced containers in Rust (e.g. BTreeSet, Vec) also work with slices. However It is not immediately obvious how to translate the above indexed code to iterator code but if instead of iterating directly over floating point numbers in the slices we iterate over rows using chunks and use zip for co-iteration then the indexed code translates to the following:

pub fn gemm(_m : usize, _n : usize, _p : usize, a : &[f64], lda : usize, b : &[f64], ldb : usize, c : &mut [f64], ldc : usize) -> () {
    for (ci,ai) in c.chunks_exact_mut(ldc).zip(a.chunks_exact(lda)){
        for (aik,bk) in ai.iter().zip(b.chunks_exact(ldb)){
            for (cij,bkj) in ci.iter_mut().zip(bk.iter()){
                *cij += (*aik) * (*bkj);
            }
        }
    }
}

Since this is done with iterators Rust can omit the bounds checking. Godbolt shows decent usage of ymm* registers and vmulpd,vaddpd where one should expect them to be:

For comparison I will show an unsafe version of this code and one can compare how they vectorize and whether there is loss of codegen quality because of the use of iterators.

Unsafe implementation

This implementation matches more closely how it would work in C or C++. The unsafe block in Rust in this case lets us dereference raw pointers without Rust inserting a bounds check. Thus this should vectorize about as well as the equivalent would in C or C++ and so will serve as a useful reference to the iterator implementation.

pub fn gemm(m : usize, n : usize, p : usize, a : &[f64], lda : usize, b : &[f64], ldb : usize, c : &mut [f64], ldc : usize) -> () {

    //Grab pointers to the slices.
    //This is fine to do outside unsafe blocks, but you can't
    //realy do anything useful with them except inside unsafe
    //blocks. Putting it here makes the inner loop look cleaner.
    let ap=a.as_ptr();
    let bp=b.as_ptr();
    let cp=c.as_mut_ptr();


    //Pointer offsets have to be done with
    //signed integers. Convert everything to 
    //signed for later use.
    let ldai = lda as isize;
    let ldbi = ldb as isize;
    let ldci = ldc as isize;

    for i in 0..(m as isize){
        for k in 0..(p as isize){
            for j in 0..(n as isize){
                let ik = k+ldai*i;
                let kj = j+ldbi*k;
                let ij = j+ldci*i;
                unsafe{
                    let aik = *(ap.offset(ik));
                    let bkj = *(bp.offset(kj));
                    let cij = cp.offset(ij);
                    *cij += aik*bkj;
                }
            }
        }
    }

}

The binary size is a little smaller for this version than the iterator version, but the actual inner loop vectorization is roughly identical.

Now I show how these implementations compare

Results

Running on my laptop I benchmarked these matrix-matrix multiplications for various matrix sizes by running it 100 times for each size and taking the fastest runtime. I plot below all three implementations together

Both the iterator version and the unsafe version handily defeat basic Rust indexing version because they were able to similarly vectorize. On this scale it is hard however to compare the iterator version to the unsafe version so I removed the “naive” numbers to make it easier to see these together

It appears that on these matrix sizes the iterator implementation is effectively equivalent in performance to the unsafe version.

Conclusions

Vectorization is only one piece of the puzzle of performance for matrix-matrix multiply. Other important optimizations are:

  1. Blocking for parallelism
  2. Blocking for cache
  3. Packing data
  4. Blocking for register usage
  5. Microkernel (usually a small implementation of outer product in intrinsics or inline assembly)

Implementing all of these is a tremendous amount of work so it’s usually best to just use an optimized BLAS wherever possible. I was just curious here whether one could implement the matrix-matrix multiply using pure iterators in Rust and achieve vectorization - it does appear that this is possible.