Int8-Quantized Sparse QR Factorization

Here I preview RatchesQR, a low-memory sparse QR factorization-based preconditioner that achieves rapid preconditioned GMRES convergence on highly nonsymmetric and indefinite sparse matrices with a static execution graph and no numerical pivoting. I have been exploring ways to use bytes “most efficiently” when solving challenging sparse linear systems, particularly systems that have historically resisted size reduction techniques, e.g. nonsymmetric indefinite systems. For such systems the normal equations are not viable: they lead to pathological spectral distributions and inadequate convergence. Restarted GMRES fails except for restart parameters bordering on requiring the storage of a full dense matrix. Incomplete LU (ILU), multigrid, and domain decomposition fail except when parameters are chosen that ultimately result in storage bordering on a complete (sparse) LU factorization. In terms of “bytes efficiency” it has proven incredibly challenging to “beat” a plain sparse LU factorization with partial pivoting such as SuperLU, and doing so has required considerable brute-force search over parameter spaces of such preconditioners, with the vast majority of them simply failing due to small pivots, solver breakdown, or other numerical issues. Before describing the preconditioner in too much detail, I want to set the stage for why it should exist.

Problem Statement: Nonsymmetric and Indefinite Sparse Matrices

In this section I illustrate the problem of solving nonsymmetric and indefinite systems iteratively. Systems possessing this difficulty may be generated quite simply in Python, for example with a stencil problem such as:

import numpy as np
import scipy.sparse as sp
rng = np.random.default_rng(0)
mx=64
my=64
m=mx*my
bands = [-mx,-1,0,1,mx] # A 2D stencil pattern
A = sp.diag([rng.uniform(-1,1,size=m) for _ in bands],bands,shape=(m,m))

This system is nonsymmetric, indefinite, and will prove very challenging to known iterative methods in spite of a mild condition number.

Symmetric indefinite sparse linear systems are known to cause difficulty for iterative methods; a classic survey illustrating the problem for the Helmholtz equation by Ernst and Gander [1] can be found in the references below. In that review they illustrate a wide variety of iterative and preconditioning techniques with nonconvergence and pathological behavior such as increasing a preconditioner’s apparent fidelity (e.g. increasing drop tolerance in ILU) leading to worse convergence. If, in addition to indefiniteness, we have a highly nonsymmetric system, it significantly worsens our prospects at solving it because short-term recurrence Krylov methods such as Paige and Saunders’ minres [3] will not work. This leads instead to algorithms like restarted GMRES [4], or Conjugate Gradients applied to the normal equations (CGN). Here we must contend with non-normality (in the GMRES case) or a squared condition number with spectral clustering at endpoints (CGN). The non-normality point leads to a very rich theory which is outside the scope of this post, but Trefethen and Embree [1] have an excellent book on the subject.

I will illustrate below the difficulty of solving such a system iteratively by applying a variety of approaches. Note that it is usually possible if one is aggressive enough with preconditioning to solve such systems iteratively, but typically what happens is that the total bytes consumed for the preconditioned iterative method exceeds that of a standard sparse factorization such as sparse LU factorization, in which case it becomes harder to justify using the iterative method instead of simply factorizing the system.

Difficulty with Unpreconditioned Iterative Methods

If we decide to store only the matrix \(A \) as well as some intermediate data for the iterative methods, we find that nonsymmetric and indefinite systems simply do not converge in a reasonable amount of time (CGN) or result in explosive memory requirements to achieve convergence at all (restarted GMRES).

Thus some preconditioner is necessary but most available black-box preconditioners will not work here.

Difficulty with Incomplete Factorizations

The incomplete LU factorization in this case will suffer from singular subsystems and near-zero pivots. Eventually it works, but only once it has started to consume roughly as much memory as a standard sparse LU factorization. This is especially true once you start accounting for the Krylov subspace you also have to store for restarted GMRES.

I also show a few ILU parameter settings that “fix” the memory cost at 100% of a sparse LU factorization equivalent.

Difficulty with Multigrid

Multigrid is another method that compresses the problem by passing through to “coarse spaces”. Multigrid works best when an invariant subspace has some known compressibility with a “code” that is known in advance. For example the smallest eigenmodes might be very “graph smooth” like in the graph Laplacian case, meaning interpolatory coarsening operators approximate this subspace very well.

For nonsymmetric and indefinite systems (and in particular the kind that I test against here) this is just not the case.

Difficulty with Domain Decomposition

As an alternative to incomplete factorizations or global coarsening, we can partition the problem and solve the subproblems as a preconditioner, discarding off-diagonal interactions as an approximation. However, this faces much the same problem as pure ILU preconditioning: we often get singular factors which cause the preconditioner to break down, or once we have found parameters which work, it proves little better than simply LU factorizing.

And I do a similar experiment as with ILU: choosing parameters so that our effective bytes utilization is equivalent to a full sparse LU factorization:

A QR-based Preconditioner with Quantization and Thresholding

The preconditioner I propose here is a quantized left-looking Householder-based QR factorization. Left-looking in this context means that rather than applying the current Householder factor to all future columns, at which point we are “done” with that reflector and don’t need it later in the factorization algorithm, we instead “look left” and apply all reflectors to the current column. The left-looking approach means we need to persist Householder reflectors until the end of the algorithm, but it also means that once we have completed a column of R it is never consumed again. For the purpose of a quantized and approximate factorization left-looking factorizations appear to do better because R is much more sensitive to errors than the reflectors. This is because triangular solves are highly path dependent and small errors get magnified. It is important that R is consistent with the applied reflectors, so the reflectors can be highly approximate and roughly encode their directions so long as R accurately captures that approximation. In the left-looking approach we apply approximate reflectors to exact R and quantize a panel of R only once, without needing to dequantize it except at the solve phase.

Just for reference the dense left-looking Householder algorithm looks as follows in pythonish pseudocode:

def dense_left_looking_householder_qr(A):
    """
    Dense left-looking Householder QR.

    Previous reflectors are applied only when a column is reached. No trailing
    matrix update is pushed eagerly to the right.
    """
    m, n = A.shape
    R = zeros((n, n))
    V = []
    tau = []

    for j in range(n):
        col = A[:, j].copy()

        # Look left: bring column j up to date using earlier reflectors.
        for k in range(j):
            col[k:] = apply_householder_left(V[k], tau[k], col[k:])

        R[:j, j] = col[:j]

        # Factor the active tail. The leading entry of v is implicit 1.
        v_j, tau_j, r_jj = householder(col[j:])
        V.append(v_j)
        tau.append(tau_j)
        R[j, j] = r_jj

    return V, tau, R

This is effectively the dense natural-order factorization. However, for the sparse case we can use a fill-reducing order and an elimination tree, which does two things:

If we further augment this with supernodes, then we can recover dense level-3 BLAS performance characteristics.

I give an overview next of how I did this, but I omit some details and focus on the high-level idea.

Supernodal Sparse QR Factorization

The steps for minimizing fill-in and exposing a task graph for a sparse QR factorization are as follows:

Most operations are done in terms of supernodes, which allows us to turn otherwise sparse operations into dense BLAS or LAPACK, but supernodes also significantly compress index information. Otherwise we might have to store a full sparsity pattern for the factors, but supernodes allow us to store structural information only once per supernode and numerical data can be stored contiguously. For “exact” supernodes this results in no additional numerical zeros, but one can relax this to allow some introduction of zeros in exchange for less index storage and better utilization of BLAS and LAPACK. I do not enforce exact supernodal structure here.

After computing this information, the left-looking factorization looks in pseudocode like below:

def supernodal_left_looking_qr(A, ordering, etree, supernodes, panel_width):
    """
    Sparse left-looking QR after reordering and supernode detection.

    The details of row/column maps are hidden behind gather/scatter helpers.
    """
    A = symmetrically_permute(A, ordering)
    symbolic = build_symbolic_patterns(A, etree, supernodes)
    factor = allocate_factor(symbolic, panel_width)

    for sn in children_before_parent_order(etree):
        rows = symbolic.v_rows(sn)
        cols = supernodes.columns(sn)

        # Gather the dense front for this supernode.
        front = gather_original_entries(A, rows, cols)

        # Look left: apply completed contributors touching this front.
        for prev in symbolic.left_looking_contributors(sn):
            block = gather_factor_block(factor, prev, rows, cols)
            apply_stored_qt(block, front)

        # Factor the supernode in panels to limit dense work size.
        for panel in split_columns(cols, panel_width):
            active_rows = symbolic.active_rows(sn, panel)
            panel_view = front[active_rows, panel]

            V_panel, tau_panel, R_panel = dense_geqrf(panel_view)
            store_householder_panel(factor, sn, panel, active_rows, V_panel, tau_panel)
            store_r_panel(factor, sn, panel, R_panel)

            rest = columns_after(panel, cols)
            apply_panel_qt(V_panel, tau_panel, front[active_rows, rest])

        scatter_front_to_factor(factor, sn, front)

    return factor

Since this is a QR factorization, the need for numerical pivoting is fairly light and the resulting task graph is fully static with very few surprises. I take advantage of this for RatchesQR to achieve good CPU utilization even with unbalanced trees and supernode sizes.

Tree and DAG Parallelism

In addition to minimizing fill, the reordering done in the symbolic phase also promotes parallelism. Supernodal columns of R can be processed once all of their predecessor columns have been processed (meaning their Householder reflectors are also available). Since each level of the supernodal tree has many independent supernodes, these can be processed in parallel. This parallelism collapses towards the top of the tree, but supernodes become larger, which means parallelism can move into dense linear algebra such as BLAS and LAPACK. To make this transition seamless I use oneTBB in conjunction with TBB-enabled MKL.

Note that doing triangular solves with R and applications of Q require some different considerations for parallelism, but I omit those details for now; they are readily available in the code.

Dividing Supernodes into Panels

Some supernodes can be quite large (e.g. separators computed from 3D grids) and use a large amount of memory. I also include a planning phase after symbolic factorization which further subdivides supernodes into panels based on a memory budget provided by the user. This panelization works both for the quantized and unquantized code paths. This essentially sets a limit on the high watermark of memory needed to produce the factorization at the cost of suboptimal dense linear algebra.

Quantized Sparse QR Factorization Preconditioner

The above factorization describes a sparse supernodal QR factorization. We can turn this into a quantized factorization by inserting a dequantize step when gathering a panel and a quantize step when scattering a panel. The trick, however, is dealing with high dynamic range in both the R and the V (householder) factors. This turns out to be achievable.

The R Factor

In the left-looking formulation our R factor is less sensitive to accumulation effects because all of its accumulations occur in working precision. However, it can still have a high dynamic range. To balance the need for global connectivity in R (correct accounting for supernodal elimination tree dependencies) while still allowing for some numerical dropping, I segment each column of R by supernode. Each column gets a list of exponents and is quantized separately. This guarantees that the dominant values in every supernode are preserved independently, and the smallest relative values are discarded. The exception here is the diagonal of R, which gets stored in full working precision. Because of this thresholding, we see compression rates well above what the quantization would directly imply. For example, quantizing from float64 to int8 has resulted in compression rates of 90x (compared to a QR factorization on the same column ordering) because of dropped entries outside of our representable dynamic range.

I illustrate a before/after quantization for a single column of R below, using the elimination tree to help visualize the dependence of this column on its predecessor supernodes:

While R proves very tolerant of accumulation effects in the left-looking formulation, those effects get moved instead to the Householder factor, which requires a bit more care. I discuss this next.

The Householder Factor: Exponent Bins

In the quantized QR factorization, the Householder reflectors are stored in the usual compact form:

The leading entry of \(v _ j\) is implicit and equal to \(1\), and we only store the sparse tail of \(v_j\).

The hard part is that a single int8 scale for the whole tail can be too crude. The largest entry determines the exponent, and smaller but directionally important entries can round to zero. The supernodal fix for R does not work here because supernodes are calculated based on a column ordering of the input matrix \( A \). If \( A \) is not structurally symmetric these supernodes have no meaning in the row space, which is where the Householder reflectors live.

The naive fix is to store more than one exponent per reflector and attach a flag to every value saying which exponent it uses. That works, but it adds per-entry metadata.

For each reflector tail, we choose a small number of exponents. In the current useful mode, that number is two:

Each tail entry is assigned to the exponent that gives the lower quantization error. Then we store the entries for each exponent in a separate contiguous bin.

row_ids:    [ primary rows ... ][ secondary rows ... ]
mantissas:  [ primary vals ... ][ secondary vals ... ]

begin = col_offsets[j]
split = split_offsets[j]
end   = col_offsets[j + 1]

primary bin:   [begin, split)
secondary bin: [split, end)

No per-entry exponent flags are needed, and we can merge-iterate the bins. This gives us the main benefit of multiple scales without paying for per-value tags. For two-scale V, the result is still basically an int8 reflector representation, but small reflector entries no longer have to compete directly with the largest entry for the same exponent. That is why it helps preserve the reflector direction, and therefore the action of Q, while keeping the storage compact.

I’m still on the fence about this merge-iterable approach to segmenting \( V \) versus a bitset-of-flags approach. Neither option maps to an ideal computational pattern. Merge iteration can become expensive as the number of exponents grows beyond 2 here, but with bitset flags we add additional storage for every element and have to contend with alignment issues if we pack it to save space. It also entails a branch for every value, but this is possibly the lesser consideration because we generally are just streaming values out to be dequantized.

A Brief Note on Normality and Left/Right Preconditioning

It is usually not common to left+right precondition a nonsymmetric linear system. Left and right preconditioning is often employed to retain symmetry; for example, one could do \( L^{-T} A L^{-1} \) for a symmetric matrix \( A \) and incomplete Cholesky factor \(L\).

In the specific case of a QR-based preconditioner, though, it appears advantageous to precondition as \(Q^T A R^{-1} \) as opposed to \(A R^{-1}Q^T \) or \( R^{-1}Q^T \), especially when using this for GMRES. The simple explanation for this is that if

Note that \( Q \) is not fully orthogonal because of its approximate representation, so the RHS of this (approximate) equation is not the identity in general.

Thus this “almost symmetric” property makes the system “almost normal”, which helps GMRES-like Krylov methods because they do not invoke the transpose action of \(A \).

Results

Here I compare an extensive hyperparameter sweep for incomplete LU (ILU) preconditioned GMRES versus int8-quantized QR factorization on a 3D nonsymmetric indefinite system generated like so:

import numpy as np
import scipy.sparse as sp
rng = np.random.default_rng(0)
mx=32
my=32
mz=32
m=mx*my*mz
bands = [-mx*my,-mx,-1,0,1,mx,mx*my] # A 3D stencil pattern
A = sp.diag([rng.uniform(-1,1,size=m) for _ in bands],bands,shape=(m,m))

Topline Table

I share a quick summary table before diving deep into the full parameter exploration.

Full Table (Including ILU failures)

The table shows a sharp robustness/memory tradeoff between the methods. SuperLU full LU succeeds as expected, reaching a residual of 5.57e-12, but at a large factor storage of 283.35 MiB in values and 425.52 MiB in concrete factor storage. By contrast, the two-scale Q8/R8 QR preconditioner reaches a residual of 9.91e-09 using only 37.77 MiB of factor values, or 54.02 MiB including Krylov storage. Relative to full LU, this is about 7.5× less factor-value memory, 10.9× less concrete factor memory, and 5.25× less values+Krylov memory, while still converging to the requested tolerance.

The ILU results are more brittle. Small and moderate ILU fills fail across the tested drop tolerances, often with exploding residuals, NaN, or immediate stagnation at residual 1.00e+00. Even increasing fill to 20 or 40 with MMD ordering does not make ILU reliable. The first clearly successful ILU cases appear only at high fill, specifically MMD fill=80/160 with drop=1e-4 or drop=1e-6. Those runs converge in fewer Krylov iterations than QR, but they require about 120 MiB of factor values and 136 MiB including Krylov storage. So the convergent ILU configurations are much stronger per iteration, but they are also much denser: about 3.18× larger than QR in factor values and 2.52× larger in values plus Krylov storage.

The important point is not just that QR converges. It is that QR converges in the memory regime where ILU mostly fails. For example, ILU MMD fill=20 has factor values around 32.65 MiB, close to QR’s 37.77 MiB, and values+Krylov around 48.90 MiB, close to QR’s 54.02 MiB, but all those ILU runs fail. At fill=40, ILU is already above QR’s memory footprint, with 64.98 MiB factor values and 81.23 MiB values+Krylov, yet it still fails in the tested cases. ILU only converges once the factor grows to about 120 MiB, at which point it is no longer in the same memory class as the QR preconditioner.

Discussion

The above is a preview of a preconditioner concept, and the code remains untuned. While it does parallelize reasonably well, I have observed some loss of scalability on higher core count systems which I’m still debugging. Furthermore, this is based on a sparse QR factorization, which means the symbolic analysis happens on the graph of \( A^T A \) rather than \( A + A^T \). This means much larger supernodes. LU typically runs on the graph of \(A + A^T \), which results in smaller supernodes, but in practice for indefinite and nonsymmetric systems the need for numerical pivoting can bring it back into the territory of a QR factorization.

All this is to say that while the preconditioner I propose here does seem to achieve my aim of “bytes efficiency”, it has moved the problem from high-watermark memory storage into dense linear algebra computation. For some compute environments (e.g. GPU) this could be a very favorable tradeoff, but I haven’t yet explored this deeply.

Elimination Tree	Spy Plot	Explanation
		Original matrix order; etree is just one big chain
		Nested dissection reordered and postordered
		Supernode detection groups compatible scalar columns
		Scalar etree contracted into the supernodal etree

comparison	QR memory advantage
vs SuperLU, factor values	7.50× smaller
vs SuperLU, concrete factor storage	10.87× smaller
vs SuperLU, values + Krylov	5.25× smaller
vs converged ILU MMD fill=80/160 drop=1e-4, factor values	3.17× smaller
vs converged ILU MMD fill=80/160 drop=1e-4, concrete factor storage	4.61× smaller
vs converged ILU MMD fill=80/160 drop=1e-4, values + Krylov	2.52× smaller
vs converged ILU MMD fill=80/160 drop=1e-6, values + Krylov	2.52× smaller

method/config	status	residual	iters	factor values MiB	factor concrete MiB	Krylov MiB	values+Krylov MiB
QR two-scale Q8/R8	converged	`9.91e-09`	319	37.77	39.15	16.25	54.02
SuperLU full LU	direct	`5.57e-12`	1	283.35	425.52	0.00	283.35
ILU fill=2 drop=1e-5	failed	`2.06e+12`	15	3.15	5.22	16.25	19.40
ILU fill=2 drop=1e-6	failed	`9.93e+17`	10	3.15	5.22	16.25	19.40
ILU fill=2 drop=1e-8	failed	`1.58e+09`	28	3.15	5.23	16.25	19.40
ILU fill=2 drop=1e-3	failed	`1.56e+20`	13	3.15	5.23	16.25	19.40
ILU fill=2 drop=1e-4	failed	`2.10e+19`	17	3.15	5.23	16.25	19.40
ILU fill=2 drop=1e-2	failed	`2.62e+15`	11	3.15	5.23	16.25	19.40
ILU fill=5 drop=1e-6	failed	`1.46e+16`	25	7.59	11.89	16.25	23.84
ILU fill=5 drop=1e-3	failed	`7.44e+16`	9	7.59	11.89	16.25	23.84
ILU fill=5 drop=1e-8	failed	`3.26e+23`	5	7.59	11.89	16.25	23.84
ILU fill=5 drop=1e-4	failed	`1.65e+09`	12	7.59	11.89	16.25	23.84
ILU fill=5 drop=1e-5	failed	`1.00e+00`	1	7.60	11.89	16.25	23.85
ILU fill=5 drop=1e-2	failed	`1.00e+00`	1	7.60	11.89	16.25	23.85
ILU fill=10 drop=1e-5	failed	`nan`	128	15.37	23.55	16.25	31.62
ILU fill=10 drop=1e-4	failed	`1.00e+00`	1	15.37	23.56	16.25	31.62
ILU fill=10 drop=1e-3	failed	`1.00e+00`	1	15.38	23.57	16.25	31.63
ILU fill=10 drop=1e-2	failed	`1.00e+00`	1	15.38	23.57	16.25	31.63
ILU MMD fill=20 drop=1e-6	failed	`1.00e+00`	1	32.65	49.47	16.25	48.90
ILU MMD fill=20 drop=1e-4	failed	`1.00e+00`	1	32.65	49.47	16.25	48.90
ILU MMD fill=20 drop=1e-2	failed	`1.00e+00`	1	32.65	49.48	16.25	48.90
ILU MMD default-pivot fill=20 drop=1e-2	failed	`1.00e+00`	1	32.66	49.49	16.25	48.91
ILU MMD fill=40 drop=1e-6	failed	`1.00e+00`	1	64.98	97.97	16.25	81.23
ILU MMD fill=40 drop=1e-2	failed	`1.23e+07`	65	64.98	97.97	16.25	81.23
ILU MMD fill=40 drop=1e-4	failed	`1.00e+00`	1	64.98	97.97	16.25	81.23
ILU MMD fill=80 drop=1e-2	failed	`6.55e+01`	128	117.11	176.17	16.25	133.36
ILU MMD fill=160 drop=1e-2	failed	`6.55e+01`	128	117.11	176.17	16.25	133.36
ILU MMD fill=80 drop=1e-4	converged	`6.42e-09`	17	119.89	180.33	16.25	136.14
ILU MMD fill=160 drop=1e-4	converged	`6.42e-09`	17	119.89	180.33	16.25	136.14
ILU MMD fill=80 drop=1e-6	converged	`2.39e-10`	4	120.03	180.54	16.25	136.28
ILU MMD fill=160 drop=1e-6	converged	`2.39e-10`	4	120.03	180.54	16.25	136.28
ILU COLAMD pivot=0.1 fill=20/40/80	factor failed	singular	n/a	n/a	n/a	n/a	n/a