Nightmare Matrices: When Sparse Solvers Fail on GPUs

August 22, 2025

I introduce a new family of symmetric-positive-definite sparse matrices (“Nightmare matrices”) \(A\) that defeats all known preconditioners, sparse direct solvers, and Krylov iterative methods for solving \(Ax = b\). I then analyze the performance of Krylov methods for these matrices on GPUs and show how the expander structure leads to uncoalesced memory access in cusparse::csrmv, causing warp stalls and poor GPU utilization per iteration. Next I demonstrate how deflation can improve GPU utilization by adding work the hardware handles efficiently while still accelerating convergence. Finally, I outline how to compute a deflation subspace that uses the GPU effectively and support these claims with performance counter and profiler results.

Nightmare Sparse Matrices

These matrices are created in three phases. First, nonzeros are distributed randomly across the matrix (each with a value of 1) to match a prescribed number of nonzeros per row. The identity matrix is then added to ensure a nonzero diagonal, and the result is symmetrized. This has high probability of creating an expander graph, which breaks many fill-reducing orderings used for sparse direct solvers. In the second phase the nonzero values (currently all 1s) are replaced with numbers uniformly distributed in [-1,1], centering the eigenvalue distribution. Finally, I set \(A \leftarrow A^T A\). This increases the number of nonzeros, shrinks the smallest absolute eigenvalue, and concentrates the spectrum at the endpoints. Because the nonzero values are sampled randomly, the primary eigenvalue distribution centers near 0 and tapers off. This is a very poor (but not the absolute worst) distribution for unpreconditioned Krylov solvers. The worst case would be chebyshev points, but it’s not trivial to match an arbitrary spectrum to a fixed sparsity pattern.

I will investigate unpreconditioned Krylov methods in more detail later, but just to illustrate the poor performance I used Conjugate Gradients on a nightmare matrix with m=100000 rows on an Nvidia T4 GPU and it never really converged (note I used relative error here instead of relative residual)

it: 175593 relative error = 0.005438385018617882

real    3m52.127s
user    3m44.893s
sys     0m8.932s

Since unpreconditioned Krylov methods fail, we might consider preconditioners or sparse direct solvers, but these approaches also fall short

Failure of sparse direct solvers on Nightmare matrices

Expander graphs ensure that any two points in the underlying graph are connected by a short path. The graph can be quite sparse, yet in many important ways it behaves like a dense matrix. In particular, sparse direct methods produce significant fill-in regardless of the fill-reducing ordering, as illustrated below with a banded matrix as reference

These figures show that sparse direct methods effectively require storing a fully dense version of \(A\) even though \(A\) is extremely sparse. For matrices with 100000 rows this is impossible on a GPU with limited memory. We therefore turn to iterative methods and preconditioning, but those options mostly fail as well.

Failure of Sparsifying Preconditioners

“Sparsifying” preconditioners here means approaches that do not use the full matrix—for example incomplete LU, algebraic multigrid, or additive Schwarz/Block Jacobi. For Nightmare matrices these approaches fail to approximate the underlying system in important ways. One could design expander matrices where they succeed; for instance, Block Jacobi could work if off-diagonal blocks have low rank, even when the underlying graph is an expander. Nightmare matrices are specifically crafted to avoid such properties. I omit numerical results here to keep the post manageable, though I may explore them later.

But in a nutshell here is what ends up happening in each case:

Incomplete LU
- Drops so many nonzeros to meet the threshold that the approximation collapses
Algebraic multigrid
- Coarse spaces become dense quickly, defeating the point.
Additive Schwarz/Block Jacobi
- Discards too many off-diagonal components

These techniques may offer some benefit, but not enough to outweigh the additional memory and computational burden they impose

Solving Nightmare Systems on GPU

Given the failure of sparse direct solvers, we turn to iterative methods. Most preconditioners offer little help, leaving unpreconditioned Krylov methods such as Conjugate Gradients. Here we encounter another awful property of Nightmare matrices: evaluating \(Ax\) for any \(x\) resembles pure random access, the worst possible memory-access pattern on the GPU.

Plain Conjugate Gradient

Each step of Conjugate Gradient involves a single evaluation of \(Ax\), and this dominates the runtime. For a Nightmare matrix that work makes poor use of the GPU. I profiled a run with m=100000 and ~60 nonzeros per row using cupy which mostly calls into cusparse in this case. The table below from an nsys profile confirms this, with 70% of the time spent in csrmv which is the cusparse sparse matrix-vector product for CSR matrices

Time (%)	Total Time (ns)	Instances	Avg (ns)	Med (ns)	Min (ns)	Max (ns)	StdDev (ns)	Name
70.5	525524967	1003	523953.1	482678.0	117693	757360	101194.5	`cusparse::csrmv_v3_kernel`
5.7	42286307	6999	6041.8	6431.0	3519	9440	1453.2	`cupy_multiply__float_float64_float64`
4.7	35381985	5001	7075.0	6400.0	5375	11904	1741.4	`cub::DeviceReduceKernel`
4.4	32736008	6001	5455.1	5184.0	3809	87518	1639.6	`cupy_subtract__float64_float64_float64`
3.4	25017723	5001	5002.5	4544.0	3711	8800	1205.1	`cupy_multiply__float64_float64_float64`
2.2	16511519	5001	3301.6	3008.0	2527	5600	799.0	`cub::DeviceReduceSingleTileKernel`
1.6	11606399	1003	11571.7	11073.0	8544	16160	1566.6	`cusparse::csrmv_v3_partition_kernel`
1.1	8207993	1000	8208.0	8000.0	6880	10336	689.7	`cupy_add__float64_float64_float64`
1.0	7301007	3001	2432.9	2176.0	1919	3680	544.3	`cupy_sqrt__float64_float64`

The problem with csrmv is that the matrix comes from an expander graph, so access is nearly random. This leads to significant uncoalesced memory traffic and warp stalls on the GPU, as shown by the performance counters below

Metric	Throughput %	Avg	Avg Warps per Cycle
Compute Warps in Flight	21.0	43,237,130	14
Unallocated Warps in Active SMs	20.0	40,177,614	13
Vertex/Tess/Geometry Warps in Flight	0.0	0	0
Pixel Warps in Flight	0.0	0	0
DRAM Read Bandwidth	21.0	—	—
DRAM Write Bandwidth	2.0	—	—
GR Active	47.0	—	—
Async Compute in Flight	45.0	—	—
Sync Compute in Flight	0.0	—	—

The profile shows poor DRAM bandwidth (21% utilization) and few warps in flight because uncoalesced reads cause stalls. In the next section I show how to recover GPU utilization by doing more work per iteration, but choosing work the GPU executes efficiently.

Deflated Conjugate Gradient

Deflated methods seek an orthonormal matrix \(V\) that projects out components slowing Krylov convergence. The challenge is computing such a \(V\) and integrating it with a Krylov method, each of which handles \(V\) slightly differently. The scheme below is due to Kahl and Rittich

For ( A x = b ), let ( V \in \mathbb{R}^{n \times k} ) have orthonormal columns and define

\[ M = V^\top A V. \]

The deflated system is

\[ (I - A V M^{-1} V^\top)A x_h = (I - A V M^{-1} V^\top)b. \]

After solving for \( x_h \), the reconstruction is

\[ x = V M^{-1} V^\top b + \bigl( x_h - V M^{-1} V^\top A x_h \bigr). \]

I choose \(V\) to be the matrix of eigenvectors associated with the smallest-magnitude eigenvalues of \(A\) and postpone discussion of how to compute it until the next section. This adds work per iteration on top of evaluating \(Av\), but the extra work is productive for the following reasons:

It involves level-2 blas ( \(Vx\), \(V^Tx \) with dense \(V\) which coalesces well and achieves higher GPU utilization
\( V \) projects out the extremely small eigenvalue components which caused our observed CG to stagnate

and indeed we see point (1) plainly in the trace, with aggregates below. Almost 80% of the time goes to *gemv-style level-2 BLAS, and the share spent on csrmv drops from about 70% to roughly 9%. The csrmv call is unchanged; each iteration now includes work that uses the GPU efficiently, so far fewer iterations are needed to reach the same error (quantified later, after the GPU efficiency discussion).

Time (%)	Total Time (ns)	Instances	Avg (ns)	Med (ns)	Min (ns)	Max (ns)	StdDev (ns)	Name
47.0	9334022559	5800	1609314.2	1609599.0	1601119	1617375	2385.3	`gemv2N_kernel`
30.9	6138250408	3867	1587341.7	1587103.0	1580799	1599136	2410.3	`gemv2T_kernel_val`
8.5	1683961456	3868	435357.1	435542.5	116638	754257	11635.0	`cusparse::csrmv_v3_kernel`
5.7	1134962046	1	1134962046	1134962046	1134962046	1134962046	0.0	`cusparse::load_balancing_kernel`
1.8	357572186	3867	92467.6	92510.0	74046	117502	3270.0	`trsv_ln_exec_up`
1.8	357424521	3867	92429.4	92350.0	74655	117278	3253.3	`trsv_ln_exec`
1.5	288257525	1	288257525	288257525	288257525	288257525	0.0	`volta_dgemm_64x64_tn`
0.7	144908215	17405	8325.7	2240.0	1248	11954831	91079.1	`cupy_copy`
0.4	80594963	15460	5213.1	4912.0	2976	87775	1761.2	`cupy_subtract`

Our utilization metrics improve markedly. DRAM bandwidth nearly triples, and many more warps are active because memory access patterns become largely coalesced.

Metric	Throughput %	Avg	Avg Warps per Cycle
Compute Warps in Flight	71.0	142,502,227	46
Unallocated Warps in Active SMs	18.0	36,210,158	12
Vertex/Tess/Geometry Warps in Flight	0.0	0	0
Pixel Warps in Flight	0.0	0	0
DRAM Read Bandwidth	58.0	—	—
DRAM Write Bandwidth	1.0	—	—
GR Active	91.0	—	—
Async Compute in Flight	91.0	—	—
Sync Compute in Flight	0.0	—	—

And just for ease of comparison I include both runs - with and without deflation - below

Metric	Run 1 (Deflated CG, large deflation space)	Run 2 (CG on expander matrix, no deflation)
Compute Warps in Flight	71% (142M, 46 warps/cycle)	21% (43M, 14 warps/cycle)
Unallocated Warps in Active SMs	18% (36M, 12 warps/cycle)	20% (40M, 13 warps/cycle)
Vertex/Tess/Geometry Warps in Flight	0%	0%
Pixel Warps in Flight	0%	0%
DRAM Read Bandwidth	58%	21%
DRAM Write Bandwidth	1%	2%
GR Active	91%	47%
Async Compute in Flight	91%	45%
Sync Compute in Flight	0%	0%

and finally the payoff that we actually care about: achieving much lower error in significantly less wall time:

Run	Iterations	Final Error	Runtime (s)	Iterations / s	log10(Error) / Iteration
k = 0 (no deflation)	175,593	5.44e-03	232.1	~757	-1.28e-04
k = 1024 (deflation)	1,931	3.59e-06	28.2	~68.5	-1.70e-03

Computing the deflation subspace

The results from deflated conjugate gradient are impressive but of course require computing a suitable \( V \). This takes us back to the beginning in a way because now we need an eigensolver capable of producing the smallest-magnitude eigenvalues for \( A \), which is a nightmare matrix and most solution techniques are no-go. The gold standard for a Krylov eigensolver are arnoldi or lanczos iterations, but for these to adequately yield small-magnitude eigenvalues we have to proceed sequentially one vector at a time because these method maintains orthogonality with previous iterates. I have already established that evaluating \(Ax\) for single vectors uses the GPU poorly. The solution is to attempt a block method that jointly advances multiple eigenvectors simultaneously. This brings us back to coalesced accesses because evaluating \(AX\) for a matrix \(X\) especially if \( X \) is row-major will allow each nonzero of \( A \) to load in a significant amount of data.

To accomplish this I make use of the fact that symmetric matrices can be diagonalized with orthonormal eigenvectors \(W \)

\[ A = W \Lambda W^T \]

and this decomposition shows us that all polynomials \( P \) satisfy

\[ P(A) = W P(\Lambda) W^T \]

This means that if we choose \( P \) to approximate \( P(x) \approx \frac{1}{x} \) then we can magnify the small eigenvalues. Now this is basically restating the underlying principle of a Krylov solver, but we can statically compute a polynomial for this as well. This is the principle underlying Chebyshev iteration.

For an SPD \(A\) with \(\operatorname{spec}(A)\subset[\lambda_{\min},\lambda_{\max}]\), set \[ \mu = \tfrac{\lambda_{\max}+\lambda_{\min}}{2}, \qquad \nu = \tfrac{\lambda_{\max}-\lambda_{\min}}{2}. \]

Initialize \(x^{(0)}\) and \(r^{(0)} = b - A x^{(0)}\). Let \(\alpha_0 = \tfrac{1}{\mu},\; \beta_0 = 0\) and form \[ x^{(1)} = x^{(0)} + \alpha_0 r^{(0)}. \]

For \(k \ge 1\), \[ \beta_k = \left(\frac{\nu \,\alpha_{k-1}}{2}\right)^2, \qquad \alpha_k = \frac{1}{\mu - \dfrac{\beta_k}{\alpha_{k-1}}}, \] \[ x^{(k+1)} = x^{(k)} + \alpha_k r^{(k)} + \beta_k \big(x^{(k)} - x^{(k-1)}\big), \qquad r^{(k)} = b - A x^{(k)}. \]

The advantage of Chebyshev iteration in this context compared to just using another Krylov method is that it is simple to advance multiple vectors simultaneously because the polynomial coefficients they get are all exactly the same (as opposed to say Conjugate Gradient, which computes the coefficients through dot products of previous iterates, which will result in different coefficients for each vector).

Here is the pseudocode for computing \( V \) with all of the above observations in mind:

# Chebyshev–Rayleigh–Ritz for k smallest eigenpairs
# Assume a routine:
#   Y = ChebyshevSolve(A, RHS, inner, λ_min, λ_max)
# which does 'inner' Chebyshev iterations to approximately solve A·Y = RHS.

Inputs:
  A            # symmetric SPD n×n operator (matvec/matmat supported)
  k            # number of eigenpairs
  outer        # outer RR iterations
  inner        # inner Chebyshev iterations
  λ_min, λ_max # spectral bounds

Procedure:
  V ← RandomOrthonormal(n, k)      # e.g., QR of random matrix
  Θ ← undefined

  for t = 1..outer:
      # 1) Block Chebyshev solve: approximate A^{-1}·V
      W ← ChebyshevSolve(A, V, inner, λ_min, λ_max)   # n×k

      # 2) Orthonormalize subspace
      Q, _ ← qr(W)                                    # n×k with QᵀQ = I

      # 3) Rayleigh–Ritz on span(Q)
      T ← Qᵀ A Q                                      # k×k
      S, diag(Θ) ← eig(T)                             # Θ ascending
      V ← Q S                                         # n×k (orthonormal)

      # (optional) stop if max_j ||A v_j − Θ_j v_j|| ≤ tol
  end for

Outputs:
  V  # Ritz vectors (≈ k smallest eigenvectors)
  Θ  # Ritz values (≈ k smallest eigenvalues)

I also provide code in the appendices.

From the trace we can see significant time in cusparse::spmm (note the asterisk: I renamed this from the symbol in cusparse which had a lot of template magic in it) and subsequent calls to DGEMM which is level-3 blas, absolutely the best place to be on a GPU.

Time (%)	Total Time (ns)	Instances	Avg (ns)	Med (ns)	Min (ns)	Max (ns)	StdDev (ns)	Name
76.0	18422615475	8	2302826934.4	2287765279.5	2268521089	2368998982	37822090.5	cusparse::spmm( * )
6.2	1508488502	102	14789103.0	3460175.0	48991	213567593	31727229.9	volta_dgemm_64x64_tn
4.7	1137174777	1	1137174777.0	1137174777.0	1137174777	1137174777	0.0	volta_dgemm_128x64_tn
4.0	966627174	44	21968799.4	6511025.5	1578352	164595260	34956297.7	volta_dgemm_128x64_nn
3.6	882372404	1	882372404.0	882372404.0	882372404	882372404	0.0	volta_dgemm_128x64_tt
2.1	516596067	28	18449859.5	14831409.5	14518870	39973379	8741168.7	geqr2_gmem_domino
0.5	110681488	1	110681488.0	110681488.0	110681488	110681488	0.0	cupy_power__float64_float_float64
0.4	108940450	1	108940450.0	108940450.0	108940450	108940450	0.0	gen_sequenced
0.4	104990105	13	8076161.9	12932875.0	2720	16496130	7829506.4	cupy_subtract__float64_float64_float64

Here we can see extremely high utilization of the GPU with significant number of warps active per cycle, fully loaded GR and SM, high memory bandwidth.

Metric	Description	Value	Avg	Avg Warps per Cycle
GR Active	Throughput %	100.0	—	—
Compute in Flight	Async Compute in Flight	100.0	—	—
Compute in Flight	Sync Compute in Flight	0.0	—	—
SM Active	Throughput %	100.0	—	—
SM Warp Occupancy	Compute Warps in Flight	95.0	169,023,581	61
SM Warp Occupancy	Unallocated Warps in Active SMs	5.0	9,502,742	3
DRAM Bandwidth	DRAM Read Bandwidth	67.0	—	—
DRAM Bandwidth	DRAM Write Bandwidth	1.0	—	—

For the deflated case with m=100000 I was able to compute 512 highly accurate eigenvectors (relative residuals ~1e-12) and the rest after this decayed in accuracy, as is typical with block methods.

Appendices

Machine

I used a GCP instance for these measurements:

Resource	Specification
Machine type	n1-standard-8 (8 vCPUs, 30 GB RAM)
CPU platform	Intel Haswell
GPUs	1 × NVIDIA T4

Code

I have started collecting many related functions into a github repository that has examples and other useful functions not showcased here as well.

References

Friedman, Joel. “A Proof of Alon’s Second Eigenvalue Conjecture and Related Problems.” arXiv:cs/0405020. Preprint, arXiv, May 5, 2004. https://doi.org/10.48550/arXiv.cs/0405020.
Kahl, K., and H. Rittich. “The Deflated Conjugate Gradient Method: Convergence, Perturbation and Accuracy.” Linear Algebra and Its Applications 515 (February 2017): 111–29. https://doi.org/10.1016/j.laa.2016.10.027.
Zhaojun, Bai, Demmel James, Dongarra Jack, Ruhe Axel, and van der Vorst Henk, eds. Templates for the Solution of Algebraic Eigenvalue Problems. Software, Environments and Tools. Society for Industrial and Applied Mathematics, 2000. https://doi.org/10.1137/1.9780898719581.