Nightmare Matrices: When Sparse Solvers Fail on GPUs

August 22, 2025

I introduce a new family of symmetric-positive-definite sparse matrices (“Nightmare matrices”) \(A\) that defeats all known preconditioners, sparse direct solvers, and Krylov iterative methods for solving \(Ax = b\). I then analyze the performance of Krylov methods for these matrices on GPUs and show how the expander structure leads to uncoalesced memory access in cusparse::csrmv, causing warp stalls and poor GPU utilization per iteration. Next I demonstrate how deflation can improve GPU utilization by adding work the hardware handles efficiently while still accelerating convergence. Finally, I outline how to compute a deflation subspace that uses the GPU effectively and support these claims with performance counter and profiler results.

Nightmare Sparse Matrices

These matrices are created in three phases. First, nonzeros are distributed randomly across the matrix (each with a value of 1) to match a prescribed number of nonzeros per row. The identity matrix is then added to ensure a nonzero diagonal, and the result is symmetrized. This has high probability of creating an expander graph, which breaks many fill-reducing orderings used for sparse direct solvers. In the second phase the nonzero values (currently all 1s) are replaced with numbers uniformly distributed in [-1,1], centering the eigenvalue distribution. Finally, I set \(A \leftarrow A^T A\). This increases the number of nonzeros, shrinks the smallest absolute eigenvalue, and concentrates the spectrum at the endpoints. Because the nonzero values are sampled randomly, the primary eigenvalue distribution centers near 0 and tapers off. This is a very poor (but not the absolute worst) distribution for unpreconditioned Krylov solvers. The worst case would be chebyshev points, but it’s not trivial to match an arbitrary spectrum to a fixed sparsity pattern.

I will investigate unpreconditioned Krylov methods in more detail later, but just to illustrate the poor performance I used Conjugate Gradients on a nightmare matrix with m=100000 rows on an Nvidia T4 GPU and it never really converged (note I used relative error here instead of relative residual)

it: 175593 relative error = 0.005438385018617882

real    3m52.127s
user    3m44.893s
sys     0m8.932s

Since unpreconditioned Krylov methods fail, we might consider preconditioners or sparse direct solvers, but these approaches also fall short

Failure of sparse direct solvers on Nightmare matrices

Expander graphs ensure that any two points in the underlying graph are connected by a short path. The graph can be quite sparse, yet in many important ways it behaves like a dense matrix. In particular, sparse direct methods produce significant fill-in regardless of the fill-reducing ordering, as illustrated below with a banded matrix as reference

These figures show that sparse direct methods effectively require storing a fully dense version of \(A\) even though \(A\) is extremely sparse. For matrices with 100000 rows this is impossible on a GPU with limited memory. We therefore turn to iterative methods and preconditioning, but those options mostly fail as well.

Failure of Sparsifying Preconditioners

“Sparsifying” preconditioners here means approaches that do not use the full matrix—for example incomplete LU, algebraic multigrid, or additive Schwarz/Block Jacobi. For Nightmare matrices these approaches fail to approximate the underlying system in important ways. One could design expander matrices where they succeed; for instance, Block Jacobi could work if off-diagonal blocks have low rank, even when the underlying graph is an expander. Nightmare matrices are specifically crafted to avoid such properties. I omit numerical results here to keep the post manageable, though I may explore them later.

But in a nutshell here is what ends up happening in each case:

  1. Incomplete LU
  2. Algebraic multigrid
  3. Additive Schwarz/Block Jacobi

These techniques may offer some benefit, but not enough to outweigh the additional memory and computational burden they impose

Solving Nightmare Systems on GPU

Given the failure of sparse direct solvers, we turn to iterative methods. Most preconditioners offer little help, leaving unpreconditioned Krylov methods such as Conjugate Gradients. Here we encounter another awful property of Nightmare matrices: evaluating \(Ax\) for any \(x\) resembles pure random access, the worst possible memory-access pattern on the GPU.

Plain Conjugate Gradient

Each step of Conjugate Gradient involves a single evaluation of \(Ax\), and this dominates the runtime. For a Nightmare matrix that work makes poor use of the GPU. I profiled a run with m=100000 and ~60 nonzeros per row using cupy which mostly calls into cusparse in this case. The table below from an nsys profile confirms this, with 70% of the time spent in csrmv which is the cusparse sparse matrix-vector product for CSR matrices

Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
70.55255249671003523953.1482678.0117693757360101194.5cusparse::csrmv_v3_kernel
5.74228630769996041.86431.0351994401453.2cupy_multiply__float_float64_float64
4.73538198550017075.06400.05375119041741.4cub::DeviceReduceKernel
4.43273600860015455.15184.03809875181639.6cupy_subtract__float64_float64_float64
3.42501772350015002.54544.0371188001205.1cupy_multiply__float64_float64_float64
2.21651151950013301.63008.025275600799.0cub::DeviceReduceSingleTileKernel
1.611606399100311571.711073.08544161601566.6cusparse::csrmv_v3_partition_kernel
1.1820799310008208.08000.0688010336689.7cupy_add__float64_float64_float64
1.0730100730012432.92176.019193680544.3cupy_sqrt__float64_float64

The problem with csrmv is that the matrix comes from an expander graph, so access is nearly random. This leads to significant uncoalesced memory traffic and warp stalls on the GPU, as shown by the performance counters below

Metric Throughput % Avg Avg Warps per Cycle
Compute Warps in Flight21.043,237,13014
Unallocated Warps in Active SMs20.040,177,61413
Vertex/Tess/Geometry Warps in Flight0.000
Pixel Warps in Flight0.000
DRAM Read Bandwidth21.0
DRAM Write Bandwidth2.0
GR Active47.0
Async Compute in Flight45.0
Sync Compute in Flight0.0

The profile shows poor DRAM bandwidth (21% utilization) and few warps in flight because uncoalesced reads cause stalls. In the next section I show how to recover GPU utilization by doing more work per iteration, but choosing work the GPU executes efficiently.

Deflated Conjugate Gradient

Deflated methods seek an orthonormal matrix \(V\) that projects out components slowing Krylov convergence. The challenge is computing such a \(V\) and integrating it with a Krylov method, each of which handles \(V\) slightly differently. The scheme below is due to Kahl and Rittich

For ( A x = b ), let ( V \in \mathbb{R}^{n \times k} ) have orthonormal columns and define

\[ M = V^\top A V. \]

The deflated system is

\[ (I - A V M^{-1} V^\top)A x_h = (I - A V M^{-1} V^\top)b. \]

After solving for \( x_h \), the reconstruction is

\[ x = V M^{-1} V^\top b + \bigl( x_h - V M^{-1} V^\top A x_h \bigr). \]

I choose \(V\) to be the matrix of eigenvectors associated with the smallest-magnitude eigenvalues of \(A\) and postpone discussion of how to compute it until the next section. This adds work per iteration on top of evaluating \(Av\), but the extra work is productive for the following reasons:

  1. It involves level-2 blas ( \(Vx\), \(V^Tx \) with dense \(V\) which coalesces well and achieves higher GPU utilization
  2. \( V \) projects out the extremely small eigenvalue components which caused our observed CG to stagnate

and indeed we see point (1) plainly in the trace, with aggregates below. Almost 80% of the time goes to *gemv-style level-2 BLAS, and the share spent on csrmv drops from about 70% to roughly 9%. The csrmv call is unchanged; each iteration now includes work that uses the GPU efficiently, so far fewer iterations are needed to reach the same error (quantified later, after the GPU efficiency discussion).

Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
47.0933402255958001609314.21609599.0160111916173752385.3gemv2N_kernel
30.9613825040838671587341.71587103.0158079915991362410.3gemv2T_kernel_val
8.516839614563868435357.1435542.511663875425711635.0cusparse::csrmv_v3_kernel
5.71134962046111349620461134962046113496204611349620460.0cusparse::load_balancing_kernel
1.8357572186386792467.692510.0740461175023270.0trsv_ln_exec_up
1.8357424521386792429.492350.0746551172783253.3trsv_ln_exec
1.528825752512882575252882575252882575252882575250.0volta_dgemm_64x64_tn
0.7144908215174058325.72240.012481195483191079.1cupy_copy
0.480594963154605213.14912.02976877751761.2cupy_subtract

Our utilization metrics improve markedly. DRAM bandwidth nearly triples, and many more warps are active because memory access patterns become largely coalesced.

Metric Throughput % Avg Avg Warps per Cycle
Compute Warps in Flight71.0142,502,22746
Unallocated Warps in Active SMs18.036,210,15812
Vertex/Tess/Geometry Warps in Flight0.000
Pixel Warps in Flight0.000
DRAM Read Bandwidth58.0
DRAM Write Bandwidth1.0
GR Active91.0
Async Compute in Flight91.0
Sync Compute in Flight0.0

And just for ease of comparison I include both runs - with and without deflation - below

Metric Run 1 (Deflated CG, large deflation space) Run 2 (CG on expander matrix, no deflation)
Compute Warps in Flight71% (142M, 46 warps/cycle)21% (43M, 14 warps/cycle)
Unallocated Warps in Active SMs18% (36M, 12 warps/cycle)20% (40M, 13 warps/cycle)
Vertex/Tess/Geometry Warps in Flight0%0%
Pixel Warps in Flight0%0%
DRAM Read Bandwidth58%21%
DRAM Write Bandwidth1%2%
GR Active91%47%
Async Compute in Flight91%45%
Sync Compute in Flight0%0%

and finally the payoff that we actually care about: achieving much lower error in significantly less wall time:

Run Iterations Final Error Runtime (s) Iterations / s log10(Error) / Iteration
k = 0 (no deflation)175,5935.44e-03232.1~757-1.28e-04
k = 1024 (deflation)1,9313.59e-0628.2~68.5-1.70e-03

Computing the deflation subspace

The results from deflated conjugate gradient are impressive but of course require computing a suitable \( V \). This takes us back to the beginning in a way because now we need an eigensolver capable of producing the smallest-magnitude eigenvalues for \( A \), which is a nightmare matrix and most solution techniques are no-go. The gold standard for a Krylov eigensolver are arnoldi or lanczos iterations, but for these to adequately yield small-magnitude eigenvalues we have to proceed sequentially one vector at a time because these method maintains orthogonality with previous iterates. I have already established that evaluating \(Ax\) for single vectors uses the GPU poorly. The solution is to attempt a block method that jointly advances multiple eigenvectors simultaneously. This brings us back to coalesced accesses because evaluating \(AX\) for a matrix \(X\) especially if \( X \) is row-major will allow each nonzero of \( A \) to load in a significant amount of data.

To accomplish this I make use of the fact that symmetric matrices can be diagonalized with orthonormal eigenvectors \(W \)

\[ A = W \Lambda W^T \]

and this decomposition shows us that all polynomials \( P \) satisfy

\[ P(A) = W P(\Lambda) W^T \]

This means that if we choose \( P \) to approximate \( P(x) \approx \frac{1}{x} \) then we can magnify the small eigenvalues. Now this is basically restating the underlying principle of a Krylov solver, but we can statically compute a polynomial for this as well. This is the principle underlying Chebyshev iteration.

For an SPD \(A\) with \(\operatorname{spec}(A)\subset[\lambda_{\min},\lambda_{\max}]\), set \[ \mu = \tfrac{\lambda_{\max}+\lambda_{\min}}{2}, \qquad \nu = \tfrac{\lambda_{\max}-\lambda_{\min}}{2}. \]

Initialize \(x^{(0)}\) and \(r^{(0)} = b - A x^{(0)}\). Let \(\alpha_0 = \tfrac{1}{\mu},\; \beta_0 = 0\) and form \[ x^{(1)} = x^{(0)} + \alpha_0 r^{(0)}. \]

For \(k \ge 1\), \[ \beta_k = \left(\frac{\nu \,\alpha_{k-1}}{2}\right)^2, \qquad \alpha_k = \frac{1}{\mu - \dfrac{\beta_k}{\alpha_{k-1}}}, \] \[ x^{(k+1)} = x^{(k)} + \alpha_k r^{(k)} + \beta_k \big(x^{(k)} - x^{(k-1)}\big), \qquad r^{(k)} = b - A x^{(k)}. \]

The advantage of Chebyshev iteration in this context compared to just using another Krylov method is that it is simple to advance multiple vectors simultaneously because the polynomial coefficients they get are all exactly the same (as opposed to say Conjugate Gradient, which computes the coefficients through dot products of previous iterates, which will result in different coefficients for each vector).

Here is the pseudocode for computing \( V \) with all of the above observations in mind:

# Chebyshev–Rayleigh–Ritz for k smallest eigenpairs
# Assume a routine:
#   Y = ChebyshevSolve(A, RHS, inner, λ_min, λ_max)
# which does 'inner' Chebyshev iterations to approximately solve A·Y = RHS.

Inputs:
  A            # symmetric SPD n×n operator (matvec/matmat supported)
  k            # number of eigenpairs
  outer        # outer RR iterations
  inner        # inner Chebyshev iterations
  λ_min, λ_max # spectral bounds

Procedure:
  V ← RandomOrthonormal(n, k)      # e.g., QR of random matrix
  Θ ← undefined

  for t = 1..outer:
      # 1) Block Chebyshev solve: approximate A^{-1}·V
      W ← ChebyshevSolve(A, V, inner, λ_min, λ_max)   # n×k

      # 2) Orthonormalize subspace
      Q, _ ← qr(W)                                    # n×k with QᵀQ = I

      # 3) Rayleigh–Ritz on span(Q)
      T ← Qᵀ A Q                                      # k×k
      S, diag(Θ) ← eig(T)                             # Θ ascending
      V ← Q S                                         # n×k (orthonormal)

      # (optional) stop if max_j ||A v_j − Θ_j v_j|| ≤ tol
  end for

Outputs:
  V  # Ritz vectors (≈ k smallest eigenvectors)
  Θ  # Ritz values (≈ k smallest eigenvalues)

I also provide code in the appendices.

From the trace we can see significant time in cusparse::spmm (note the asterisk: I renamed this from the symbol in cusparse which had a lot of template magic in it) and subsequent calls to DGEMM which is level-3 blas, absolutely the best place to be on a GPU.

Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
76.01842261547582302826934.42287765279.52268521089236899898237822090.5cusparse::spmm( * )
6.2150848850210214789103.03460175.04899121356759331727229.9volta_dgemm_64x64_tn
4.7113717477711137174777.01137174777.0113717477711371747770.0volta_dgemm_128x64_tn
4.09666271744421968799.46511025.5157835216459526034956297.7volta_dgemm_128x64_nn
3.68823724041882372404.0882372404.08823724048823724040.0volta_dgemm_128x64_tt
2.15165960672818449859.514831409.514518870399733798741168.7geqr2_gmem_domino
0.51106814881110681488.0110681488.01106814881106814880.0cupy_power__float64_float_float64
0.41089404501108940450.0108940450.01089404501089404500.0gen_sequenced
0.4104990105138076161.912932875.02720164961307829506.4cupy_subtract__float64_float64_float64

Here we can see extremely high utilization of the GPU with significant number of warps active per cycle, fully loaded GR and SM, high memory bandwidth.

Metric Description Value Avg Avg Warps per Cycle
GR ActiveThroughput %100.0
Compute in FlightAsync Compute in Flight100.0
Compute in FlightSync Compute in Flight0.0
SM ActiveThroughput %100.0
SM Warp OccupancyCompute Warps in Flight95.0169,023,58161
SM Warp OccupancyUnallocated Warps in Active SMs5.09,502,7423
DRAM BandwidthDRAM Read Bandwidth67.0
DRAM BandwidthDRAM Write Bandwidth1.0

For the deflated case with m=100000 I was able to compute 512 highly accurate eigenvectors (relative residuals ~1e-12) and the rest after this decayed in accuracy, as is typical with block methods.

Appendices

Machine

I used a GCP instance for these measurements:

Resource Specification
Machine typen1-standard-8 (8 vCPUs, 30 GB RAM)
CPU platformIntel Haswell
GPUs1 × NVIDIA T4

Code

I have started collecting many related functions into a github repository that has examples and other useful functions not showcased here as well.

References

  1. Friedman, Joel. “A Proof of Alon’s Second Eigenvalue Conjecture and Related Problems.” arXiv:cs/0405020. Preprint, arXiv, May 5, 2004. https://doi.org/10.48550/arXiv.cs/0405020.
  2. Kahl, K., and H. Rittich. “The Deflated Conjugate Gradient Method: Convergence, Perturbation and Accuracy.” Linear Algebra and Its Applications 515 (February 2017): 111–29. https://doi.org/10.1016/j.laa.2016.10.027.
  3. Zhaojun, Bai, Demmel James, Dongarra Jack, Ruhe Axel, and van der Vorst Henk, eds. Templates for the Solution of Algebraic Eigenvalue Problems. Software, Environments and Tools. Society for Industrial and Applied Mathematics, 2000. https://doi.org/10.1137/1.9780898719581.