Note on Quantizing Deflation Subspaces

March 28, 2026

Here I achieve a roughly 4X reduction in size for a deflation basis without impacting convergence performance of that basis. This can in theory result in faster deflation solves with very little downside. Deflation methods allow you to precompute and store a deflation basis for a matrix A and use that stored basis to accelerate convergence on subsequent iterative solves involving A. It is an explicit tradeoff where we accept an upfront computational cost (subspace computation, usually an eigensolver) as well as additional memory footprint and memory traffic on subsequent iterations, since every iteration of a deflation method will require us to read the deflation vectors at least once.

I demonstrate my methodology and results below

Methodology and Results

Using my Krylov-bits library I formed indefinite sparse matrices of the following form (with different mx,my)

mx=32
my=32
m=mx*my
bands = [0,1,mx]
A = spla.diags([rng.uniform(-1,1,size=m) for _ in bands],bands,shape=(m,m))
#Symmetrize
A = A + A.T

In each case I precomputed a deflation set equal to 5% of the explored sizes:

mx=my	n=mx*my	Deflation subspace size k
32	1024	51
64	4096	205
128	16384	819

Next I solves a linear system using minres without deflation, with deflation, and then with deflation and unsetting mantissa bits. I record iteration counts and relative residuals. All results are fp64 so with 4 kept mantissa bits we achieve an effective 4X compression against the originally computed deflation basis becuase

fp64 = 1 sign bit + 11 exponent bits + 52 mantissa bits
packed_deflation_format = 1 sign bit + 11 exponent bits + 4 mantissa bits

I do not actually pack the above bits into a 16 bit integer in order to realize the benefits, but I will experiment with doing this in a later post. So these results should be viewed as a simulated benefit.

Small example (m=1024)

Run	Kept mantissa bits	MINRES iterations	Relative residual	Relative error
Plain MINRES	52	1490	6.088e-07	3.113e-03
Deflated MINRES, unquantized basis	52	254	3.592e-07	2.775e-06
Deflated MINRES, quantized basis	4	260	3.671e-07	3.387e-06
Deflated MINRES, quantized basis	1	372	9.465e-07	2.120e-04

Larger Example (m=4096)

Run	Kept mantissa bits	MINRES iterations	Relative residual	Relative error
Plain MINRES	52	6285	1.296e-06	7.321e-04
Deflated MINRES, unquantized basis	52	300	6.520e-07	6.050e-06
Deflated MINRES, quantized basis	4	300	6.622e-07	6.158e-06

Largest Example (m=16384)

Run	Kept mantissa bits	MINRES iterations	Relative residual	Relative error
Plain MINRES	52	9658	1.939e-06	8.364e-03
Deflated MINRES, unquantized basis	52	326	1.525e-06	1.374e-05
Deflated MINRES, quantized basis	4	327	1.410e-06	1.243e-05

Conclusion

I precomputed a deflation basis in FP64, then post hoc quantized that basis by truncating mantissa bits while preserving the FP64 sign and exponent fields. Keeping only 4 mantissa bits corresponds to an effective 16-bit representation per basis entry, or about a 4x reduction in storage relative to FP64. Across these experiments, this caused little to no degradation in the convergence of the subsequent deflated iterative solve. In principle, those 16 bits could be packed into an integer representation to reduce memory traffic and basis storage. Additional savings may be possible by compressing exponent information as well, for example with shared exponents over spatial blocks if the basis exhibits local correlation.