# Performance Results of Low-Rank Trick

March 09, 2019

In recent posts I explored matrices with structure that enables fast linear solution algorithms. I specifically mentioned Toeplitz and Circulant matrices. I also showed how to still take advantage of the fast linear solvers even if the input matrices do not exactly satisfy the necessary structure. I didn’t give much results though in terms of actual performance. I do that here. I will focus on circulant matrices and I will implement its fast solver in terms of FFTs from scipy. I then perturb circulant systems by low-rank updates and see if my previously developed ideas actually still result in fast solvers. I start by explaining circulant matrices.

# Circulant Matrices

Without going into too much math circulant matrices are those matrices $$C$$ which may be generated by a one dimensional vector $$c$$ such that each row of $$C$$ is $$c$$ shifted a fixed number of times (with wrap-around).

In Python, generating circulant matrices is easy:

import scipy.linalg

#Make a data array c
n=4
c=[0,1,2,3]
#Form circulant from c
C=scipy.linalg.circulant(c)
#This forms the array
#array([[0, 3, 2, 1],
#       [1, 0, 3, 2],
#       [2, 1, 0, 3],
#       [3, 2, 1, 0]])


Thus if we want to solve a system $$Cx=b$$ for $$x$$ we can simply use scipy.linalg.solve as follows:

import numpy as np
#Make some data for b
n=4
b=np.random.rand(n)
#Solve for x in Cx=b
x=scipy.linalg.solve(C,b)


This however requires $$O(n^3)$$ operations because it uses the most generic solver, which typically falls back to LU factorization. When we use the fact that $$C$$ is circulant we can solve this system with $$O(n\log(n))$$ operations by the following approach:

import scipy.linalg
import scipy.fftpack
import numpy as np
#Shorthand definitions for forward,inverse FFT operations
fft=scipy.fftpack.fft
ifft=scipy.fftpack.ifft
fftcs=fft(c)
#Solves linear systems involving C.
#Note it does not directly refrence the
#matrix C, but rather only uses its
#generator c.
def solveC(b):
return np.real(ifft(fft(b)/fftcs))

n=4
b=np.random.rand(n)
#Solve for x in Cx=b
x=solveC(b)
#x is the same answer we get by
#using scipy.linalg.solve, but we took much fewer operations
#to get it this time.



Performance of this approach speaks for itself. I timed both approaches with matrix sizes $$n=1,2,4,8,16,32,64,128,256,512,1024,2048$$, producing the following plot: (see note after the figure)

(Note: one should always keep in mind that FFTs perform very well on power-of-two sized inputs. Some FFT libraries gracefully handle input sizes of different types and others do not. I chose to only measure the power-of-two case so as to avoid this complicating factor)

In my previous posts I suggested we can still obtain performance like this even if our matrices aren’t exactly circulant, but circulant plus a low rank update. Now I show that this is indeed true

# Low-rank Modification of Circulant Matrices

Now let’s consider the case of $$B = C + UV^T$$ where $$C$$ is a circulant matrix and $$U,V$$ are $$n \times k$$ matrices. In my previous post I gave the following algorithm which requires $$k+1$$ evaluations of the fast solver for $$C$$:

#B: Input matrix that is B=A+UV^T. U,V not necessarily known
#b: Input right-hand-side data
#solveA: Callback function that calculates inv(A)*y for any input y.
def lowrank_solver(B,b,solveA):
(m,n)=B.shape
restart=n
maxit=n
M=sp.LinearOperator(matvec=solveA,shape=(m,n))
return sp.gmres(B,b,M=M,restart=restart,maxiter=maxit)[0]


Let’s see how this compares to simple scipy.linalg.solve for different values of $$k$$. We should expect that the larger $$k$$ is, the more expensive the modified solver is (relative to the pure circulant solver).

Rank of update Timing measurements
Circulant+Rank-1 update
Circulant+Rank-10 update
Circulant+Rank-20 update

# Conclusions

Here I showed performance of generic linear solution algorithms which do not exploit structure and compared that performance against a solver that does exploit structure. The structure I chose for illustration purposes was the circulant property. I used some tricks I developed in previous posts to recover the fast structured solve performance even when the input matrix does not precisely have the desired structure.

What we see is that after a certain matrix size there is a crossover point at which a rank-k updated fast-circulant-solver begins to outperform standard direct solution techniques used in scipy.linalg.solve. As expected this crossover point grows with the rank $$k$$ since this number has a strong relationship with the number of iterations that GMRES must take to fully solve the system. Generally, if the rank of the update is $$k$$ then the circulant+low-rank solver will take $$k+1$$ GMRES iterations to converge.

One unexpected effect was the dramatic performance drop for the $$k=1$$ case compared to the $$k=0$$ case (pure circulant, no updates). The reason for this I believe is that the pure circulant solving algorithm is very cache efficient because it only requires $$O(n)$$ (storage of the one dimensional array $$c$$ which generates $$C$$, but does $$O(n \log(n))$$ operations. Thus for large $$n$$ the operation/byte ratio is favorable for caching.

The low-rank-updated matrix $$B$$ however must be computed explicitly and stored in memory in order to be used in GMRES, this dramatically increases the memory requirement to $$O(n^2)$$, but the overall operation count is $$O(n^2 + n\log(n))$$. The quadratic factor comes from needing to evaluate the matrix-vector product $$Bx$$ to drive GMRES.