Here I show a sparse linear system which I was unable to solve with Julia’s `lufact`

because of its excessive memory requirements. Rather than resorting to an iterative algorithm I show how to use nested dissection to get the same answer that `lufact`

would if it had enough memory. I’ll start by discussing the issues and why one may not want to use iterative solvers even though they solve the memory issue.

Every year researchers publish hundreds of articles on efficient solvers for sparse systems of linear equations. The reason for this is that the most robust solver, Gaussian Elimination, results in explosive memory requirements for even modestly sized problems. This explosion in memory cost results from a phenomenon known as fill-in. Unfortunately there is no known way to completely avoid fill-in, so the vast majority of research on linear solvers focuses on how to solve very specific sparse linear systems with iterative algorithms such as those found in the Julia package IterativeSolvers.jl. These solvers are already “black-box” in the sense that they theoretically work without requiring any extra information other than the linear system itself, but without any extra information they converge at a glacial pace. Thus there is a huge amount of research available on how to incorporate extra information into these solvers to make them practical, see below for a small sampling of my personal numerical linear algebra library which relate only to these types of solutions

Unfortunately much of these solutions remain within the realm of the very limited problem they originally targeted, resulting in the inability to use those techniques in a more black-box fashion. See for example this paper by Ernst and Gander which shows how a very modest modification to the Laplace equation results in failure of most approaches to iterative solvers - this includes domain decomposition, incomplete factorizations, and even multigrid.

Thus it remains desirable to have the direct solver option, but how can we solve larger problems without scaling up to more compute nodes? I’ll show one approach here which initially targets sparse matrices arising from finite-difference discretization of partial differential equations, but I’ll also mention how the approach generalizes to generic sparse matrices as well.

The sparse matrix in question here results from discretizing the Laplacian differential operator with second order finite difference stencils, that is we want to solve for \( u \) where

\[ 6u _ {i,j,k} - u _ {i-1,j,k} - u _ {i+1,j,k} - u _ {i,j-1,k} - u _ {i,j+1,k} - u _ {i,j,k-1} - u _ {i,j,k+1} = f _ {i,j,k} \]

for \(i=1,\ldots,N _ x,j = 1,\ldots,N _ y, k = 1, \ldots,N _ z \).

We may quickly assemble the matrix for this in Julia using a 1D laplacian and Kroneckor products as follows:

```
function laplace1d(nx)
return spdiagm( (-ones(nx-1),2.0*ones(nx),-ones(nx-1)), (-1,1,-1) );
end
function laplace3d(nx,ny,nz)
Ix=speye(nx,nx)
Iy=speye(ny,ny)
Iz=speye(nz,nz)
Ax=laplace1d(nx)
Ay=laplace1d(ny)
Az=laplace1d(nz)
return kron(Ix,kron(Iy,Az)) + kron(Ix,kron(Ay,Iz)) + kron(Ax,kron(Iy,Iz));
end
```

The memory requirements for using a direct solver (i.e. `lufact`

) grow rapidly as we increase the problem size through the parameters \( Nx,Ny,Nz \). On my machine the following code

```
nx=128
ny=128
nz=128
A=laplace3d(nx,ny,nz)
b=rand(nx*ny*nz)
F=lufact(A)
x=F\b;
```

results in

```
ERROR: LoadError: OutOfMemoryError()
Stacktrace:
[1] umfpack_numeric!(::Base.SparseArrays.UMFPACK.UmfpackLU{Float64,Int64}) at ./sparse/umfpack.jl:224
...
```

Later on I will show how to still use Julia’s `lufact`

and solve this exact system without running out of memory.

Iterative solutions for this particular equation are extremely well understood now, with multigrid algorithms likely having the most well understood convergence behavior, so we could just use one of the many fast Poisson solvers that exist and solve many orders of magnitude larger systems than \( (Nx,Ny,Nz) = (128,128,128) \) What I want to do though is show how we can still keep the robust direct solver accuracy without needing more memory. I will do this first by explicitly using the stencil and grid information, but then I will show how it may be generalized to a more black-box style solver which does not rely on any information outside of the sparse matrix itself.

An out-of-core algorithm is one which will make use of an abundant memory resource that may be several orders of magnitude slower than traditional memory (in terms of bandwidth, latency, or both). In my case I will make use of disk, but my main intention in exploring out-of-core approaches is actually due to new I/O accelerator technology becoming more widespread, such as burst buffers, and so I hope to explore out-of-core approaches in that context soon as well.

Linear algebra out-of-core algorithms often begin with a special partitioning of unknowns. In this case I will seek a partitioning of unknowns which results in the following nonzero structure for the system \( Ax=b \)

$$ \begin{pmatrix} A _ 1 & 0 & B _ 1 \\ 0 & A _ 2 & B _ 2 \\ C _ 1 & C _ 2 & D \end{pmatrix} \begin{pmatrix} x \\ y \\ z \end{pmatrix} = \begin{pmatrix} a \\ b \\ c \end{pmatrix} $$

With this partitioning we can easily work from the bottom up, starting with the schur complement system and then quickly recovering the remaining unknowns. That algorithm will proceed with the following steps

- Precompute LU factorization \( A _ 1 = L _ 1 U _ 1 \) and save to disk
- Overwrite \( D \) with \( D - C _ 1 U _ 1 ^{-1} L _ 1 ^ {-1} B _ 1 \)
- Overwrite \( c \) with \( c - C _ 1 U _ 1 ^{-1} L _ 1 ^ {-1} a \)
- Overwrite \( a \) with \( U _ 1 ^{-1} L _ 1 ^ {-1} a \)
- Free memory memory for \( L _ 1, U _ 1 \)
- Precompute LU factorization \( A _ 2 = L _ 2 U _ 2 \) and save to disk
- Overwrite \( D \) with \( D - C _ 2 U _ 2 ^{-1} L _ 2 ^ {-1} B _ 2 \)
- Overwrite \( c \) with \( c - C _ 2 U _ 2 ^{-1} L _ 2 ^ {-1} b \)
- Overwrite \( b \) with \( U _ 2 ^{-1} L _ 2 ^ {-1} b \)
- Free memory memory for \( L _ 2, U _ 2 \)
- Overwrite \( c \) with \( c = D ^ {-1} c \)
- Load LU factorization \( A _ 2 = L _ 2 U _ 2 \) from disk
- Overwrite \( b \) with \( b = b - U _ 2 ^ {-1} L _ 2 ^ {-1} C _ 2 b \)
- Free memory memory for \( L _ 2, U _ 2 \)
- Load LU factorization \( A _ 1 = L _ 1 U _ 1 \) from disk
- Overwrite \( a \) with \( a = a - U _ 1 ^ {-1} L _ 1 ^ {-1} C _ 1 a \)

with this procedure the solution is stored inside the right-hand-side vector. Note that we can use `lufact`

for the sub-blocks \( A _ 1, A _ 2 \), so the goal is to design a partition which achieves the desired nonzero structure and to make the diagonal blocks small enough that their full LU factorizations fit in main memory. The partitioning is also usually designed in such a way that the schur complement system is small and dense, perfect for highly optimized LAPACK.

This procedure may be implemented in Julia as follows:

```
function ooc_solver!(A,b,ps)
# Out of core solver for a sparse matrix A with input array of partitions ps
# The RHS b gets overwritten by the solution inv(A)*b
npartitions=length(ps);
#First assemble Schur complement system.
#With this approach we can accumulate diagonal and off-diagonal components
#and then "forget" precomputed LU factors (save to disk).
#We can then read them later
#to solve for remaining blocks.
#after the Schur complement solve is complete
D=full(A[ps[end],ps[end]]);
for i = 1 : npartitions-1
B=A[ps[i],ps[end]];
C=A[ps[end],ps[i]];
L=lufact(A[ps[i],ps[i]]);
#Here we need to compute D=D-C*(L\full(B))
#But full(B) requires a lot of memory
#So we proceed column-wise by a fixed width to
#avoid needing to compute "full(B)"
temp=zeros(D);
width=16;
(m,n)=size(B);
for j = 1 : width : n
fwidth=width
#In case width does not divide n
if j+width>n
fwidth=n-j;
end
if j!= n
temp[:,j:j+fwidth]=C*(L\full(B[:,j:j+fwidth]))
end
end
D=D-temp
b[ps[end]]=b[ps[end]]-C*(L\b[ps[i]])
b[ps[i]]=L\b[ps[i]];
#Write LU factorization of diagonal block to disk
f=open("L$(i)","w");
serialize(f,L);
close(f);
#Clear out components and force GC to
#free associated memory
B=0;
C=0;
L=0;
gc();
end
#Now solve Schur complement system
b[ps[end]] = D\b[ps[end]]
#Finish with back-substitution
for i = (npartitions-1) : -1 : 1
#Load off-diagonal block
B=A[ps[i],ps[end]];
#load precomputed LU factorization
f=open("L$(i)","r");
L=deserialize(f);
b[ps[i]]=b[ps[i]]-L\(B*b[ps[end]]);
close(f);
#Free LU factorization memory
L=0;
gc();
end
#Solution stored in vector "b"
end
```

Note that for simplicity of math I kept the partition size to three, but in the above Julia code you can have as many partitions as you like. There is a tradeoff in terms of how big the sparse LU factors are and the resulting size of the dense Schur Complement solve.

Now I will show how to compute the partitions for the specific case where we know the stencil size and grid nature of the underlying equations.

Assuming a grid-like structure of our unknowns the following Julia script computes one level of a nested dissection partition.

```
function nested_dissection_partition(nx,ny,nz)
N=nx*ny*nz
#Linearize 3D array
ids=reshape(collect(1:N),(nx,ny,nz));
#Split on z-axis
nz2=div(nz,2)
#First partition: top half of 3D grid
p1=ids[:,:,1:nz2-1][:]
#Second partition: bottom half of 3D grid
p2=ids[:,:,nz2+1:nz][:]
#Third partition: the 2D slice of gridpoints that separates p1 and p2
p3=ids[:,:,nz2][:]
#Put partitions into array
ps=[];
push!(ps,p1);
push!(ps,p2);
push!(ps,p3);
return ps;
end
```

The nested dissection ordering tries to find a small grid that separates two similar-sized \( n \) dimensional grids, and returns the three partitions resulting from the separated grids and the separator. In our case this does indeed result in a nonzero structure as shown above because the stencil only ever looks at its immediate neighbors and no further. This partitioning strategy is called *nested* dissection because it is often applied recursively, but in this case I just do one level for illustration purposes.

Recall that the following Julia code crashes because of not enough memory:

```
nx=128
ny=128
nz=128
A=laplace3d(nx,ny,nz)
b=rand(nx*ny*nz)
F=lufact(A) # Crash here, out of memory!
x=F\b;
```

but now the following code works

```
nx=128
ny=128
nz=128
A=laplace3d(nx,ny,nz)
b=rand(nx*ny*nz)
ps=nested_dissection_partition(nx,ny,nz);
ooc_solver!(A,b,ps);
```

While I used the fact that I was on a simple grid-and-stencil based discretization of the Poisson equation to produce the partition, this is not strictly necessary. Every sparse matrix has an underlying graph structure, and here instead of using stencil and grid data to form separators for partitions we may use graph theory algorithms like min cut. Thus the ideas I showed here for one specific case can be applied in general as well, we just need good graph theory algorithms to achieve it. The first appearance of graph theory for solving sparse linear systems appears to have been in the 1979 paper Generalized Nested Dissection.

Here I hit a limitation on my compute node of being unable to solve the Poisson equation with `lufact`

after exceeding the problem size \( (Nx,Ny,Nz) = (128,128,128) \), because there was not enough memory to store the resulting fill-in. I showed how to use one level of nested dissection partitioning to uncover special nonzero structure which I then exploited for an out-of-core direct solver.

I am curious how out-of-core algorithms can increase the value of existing compute nodes without scaling them out. It is often accepted that the best way to solve larger problems is to scale out with more compute nodes with a distributed memory parallel implementation through MPI or something similar. But with the out-of-core solver approach we may instead employ fast temporary storage such as available with burst buffers and dramatically increase the problem size solvable by a single node. That is a positive take on this approach, but like many linear algebra algorithms there are tradeoffs which may reduce its value even in the presence of fast external storage.

I was able to do a direct solve for a larger problem than `lufact`

could do on its own, but the result was quite slow and has some tradeoffs that are worth mentioning. First of all I expected a good part of time would be spent in the disk I/O but this was not the case. Actually the vast majority of time is spent simply forming the Schur complement system. This is partially because Julia has not yet optimized sparse LU solves involving multiple RHS vectors yet (e.g. by parallelizing over columns), this appears to be a known issue that has not been resolved yet (see this GitHub issue ).

Another limitation is that this will work up to a point, but once the problems get large enough it will become impossible to store the Schur complement system at all, because it is dense. Thus to push this procedure further into larger problem sizes we may need to look into compression techniques. One beautiful observation concerning this Schur complement system is that its off-diagonal blocks tend to be very low rank and this can be exploited to dramatically reduce the memory cost of storing it, see more details in this paper by Per-Gunnar Martinsson. I may explore this idea in a later blog post.