Recovering Sparsity from Noisy Data

May 18, 2019

This post will again concern the solution of a system of linear equations

\[ Ax = b \]

but will show how we can recover sparsity in the variable \( x \). If \( b \) is observed perfectly (no noise, no errors), then we can use standard least squares approach and \( x \) will contain the correct sparsity information (if it was sparse to begin with). However if we observe \(b \) with noise, then the recovered \( x \) will likely be dense. For regression applications this can be problematic when trying to fit high order polynomials, because the density of the solution vector will produce a highly oscillatory function that is not representative of the ground truth. Fortunately we can solve this with regularization, specifically regularizing with the L1 norm.

In my last post I showed that the L1 norm can be very robust to outliers when used as the loss function in a linear regression problem. Here I show that when used as a regularization term in the regression formulation then it promotes sparsity in the answer, and so can help recover sparse solutions even when \( b \) is not observed exactly. The regularized formulation is equivalent to a quadratic program with linear constraints which I solve with the NAG library routine lsq_lincon_solve

Polynomial Regression

While the application of this post is data fitting, the central issue will be finding robust ways to approximately solve a system of linear equations

\[ Vx = d \]

where \( d \) is some kind of measured data that we are trying to fit, and \( V \) is a matrix - the value of \( V \) depends on the type of data fitting algorithm we use.

I use a classic approach to data fitting for this post: polynomial regression. I do it slightly differently from the wikipedia article. Instead of the monomial basis (i.e. \(1,x ^ 1, x ^ 2, \ldots \) ), I use the Legendre basis. This has two benefits: the first is numerical stability improvements, and also NumPy has a nice module for calculating the matrix \( V \) and manipulating Legendre polynomials.

Thus we want to compute \( x \) in \( Vx = d \) where \( V \) is computed by the NumPy routine legvander, and \( d \) is a vector containing measured data that we would like to fit. The resulting fit can then be evaluated with the NumPy routine legval.

The first approach to solving \( Vx = d \) will be with linear least squares.

Linear Least Squares

The first attempt here will be to compute \( x \) as the minimizer of the Euclidean norm

\[ | d - Ax | _ 2 \]

I won’t derive the algorithm for this here, but I will mention that there are very numerically robust solutions to this minimization problem. NumPy uses these robust techniques for its function lstsq. I show an example python script solving a data fitting problem with normally distributed noise

#Seed RNG for reproducible results
np.random.seed(4484)
#Maximum polynomial degree
deg=20
#Sample our function at 100 equally spaced points
nsamples=100
xs=np.linspace(-1,1,nsamples)
#Form Vandermonde matrix for least squares problem
V=leg.legvander(xs,deg)
#Fabricate a sparse solution "ys"
ys=np.zeros(deg+1)
ys[2]=1.0
#Sample this solution with noisy "ys"
errs=np.random.normal(0.0,0.1,(nsamples))
b=V@ys+errs
#Solve by traditional least squares
(soln,res,rank,svd)=la.lstsq(V,b)
f=leg.legval(xs,soln)
exact=leg.legval(xs,ys)
#Plot results
plt.plot(xs,f)
plt.plot(xs,exact)
plt.legend(["Least-Squares Solution","Solution without noise"])
plt.ylim(-0.7,1.1)
plt.xlabel("Independent Variable")
plt.ylabel("Measured data")
plt.savefig("noise3.svg")

Test Case Data Fit
No noiseresult19result110
Normally distributed noise \( \sigma = 0.1 \)result111result112
Normally distributed noise \( \sigma = 0.2 \)result113result16
Normally distributed noise \( \sigma = 0.3 \)result114result115

L1-Regularized Linear Least Squares

To try and avoid over-fitting noise I add a regularization term. The new formulation then tries to minimize the following expression:

\[ | d - Ax | _ 2 ^ 2 + \lambda | x | _ 1 \]

Like with my post on L1 regression the term \( \lambda | x | _ 1 \) presents some challenges. In that post I reformulated the regression problem as a linear program, but the presence of the traditional least squares term makes this approach unikely to work because it is not a simple linear term. Fortunately we can reformulate the above as a constrained linear least squares in a very similar way as we did the linear program, by introducing a new variable \( t \) to represent the absolute values of \( x \):

$$ \begin{aligned} \min \lambda \sum _ {i=1}^n t_i + \| d - Ax \| ^2 _ 2 \\ \text{subject to} \\ x - t \leq 0 \\ -x - t \leq 0 \\ t \geq 0 \end{aligned} $$

With this formulation we can use the NAG library routine e04ncc. I do this through the NAG Python interface as follows:

import numpy as np
import scipy.linalg as la
from naginterfaces.library.opt import nlp1_init
from naginterfaces.library.opt import lsq_lincon_solve
from naginterfaces.library.opt import lsq_lincon_option_string


def lstsq_l1reg(A,b,L=1.0):
    (m,n)=A.shape
    nvars=2*n
    nclin=2*n
    comm = nlp1_init('lsq_lincon_solve')
    lsq_lincon_option_string("Cold Start",comm)
    lsq_lincon_option_string("Problem Type = LS2",comm)
    lsq_lincon_option_string("Iteration Limit = {}".format(1000*(nvars+nclin)),comm)
		#uncomment below for verbose output
    #lsq_lincon_option_string("Print Level = 10",comm)
		#Constraint matrix. This holds constraints between "t" and "x" variables
    C=np.zeros((nclin,nvars))	
    C[0:n,0:n]=-np.identity(n)
    C[0:n,n:2*n]=np.identity(n)
    C[n:2*n,0:n]=-np.identity(n)
    C[n:2*n,n:2*n]=-np.identity(n)
		#Coefficients for the linear term in the least squares problem
    cs=np.zeros(nvars)
    cs[0:n]=L
		#Upper and lower bounds for the constraints
    bl=np.zeros(nvars+nclin)
    bu=np.zeros(nvars+nclin)
    bl[0:n]=0.0
    bl[n:2*n]=-1.00E+20
    bl[2*n:]=-1.00E+20
    bu[0:n]=1.00E+20
    bu[n:2*n]=1.00E+20
    bu[2*n:]=0.0
    istate=np.zeros(nvars+nclin).astype(np.int)
    kx=np.zeros(nvars).astype(np.int)
    #initialize with unregularized least square solution
    xhat,res,rank,sv=la.lstsq(A,b)
    xs=np.zeros(nvars)
    xs[0:n]=np.abs(xhat) #the initial "t" values
    xs[n:2*n]=xhat #initial leasst squares solution
		#The matrix in the linear least squares formulation
    B=np.zeros((m,nvars))
    B[0:m,n:2*n]=A
    (istate,kx,x,tmpa,tmpb,itera,obj,clamba)=lsq_lincon_solve(C,bl,bu,cs,istate,kx,xs,B,b,comm)
    return (x,obj)    

I now repeat the exercise for normal least squares but with the regularization term added

np.random.seed(4484)
#Maximum polynomial degree
deg=20
#Sample our function at 100 equally spaced points
nsamples=100
xs=np.linspace(-1,1,nsamples)
#Form Vandermonde matrix for least squares problem
V=leg.legvander(xs,deg)
#Fabricate a sparse solution "ys"
ys=np.zeros(deg+1)
ys[2]=1.0
#Sample this function with noisy "ys"
errs=np.random.normal(0.0,0.1,(nsamples))
b=V@ys+errs
#Solve by L1-regularized least squares
soln_l1reg,obj=lstsq_l1reg(V,b,L=0.5*np.sqrt(nsamples))
soln_l1reg=soln_l1reg[deg+1:]
g=leg.legval(xs,soln_l1reg)
exact=leg.legval(xs,ys)
#Plot results
plt.plot(xs,g)
plt.plot(xs,exact)
plt.legend(["L1-Regularized Least-Squares Solution","Solution without noise"])
plt.ylim(-0.7,1.1)
plt.xlabel("Independent Variable")
plt.ylabel("Measured data")
plt.savefig("noise1_l1reg.svg")

This results in the following plots

Test Case Data Fit
No noiseresult9result10
Normally distributed noise \( \sigma = 0.1 \)result11result12
Normally distributed noise \( \sigma = 0.2 \)result13result6
Normally distributed noise \( \sigma = 0.3 \)result14result15

Sensitivity Analysis of Regularized and Classic Linear Least Squares

The above results only show point estimates, that is a single polynomial fit to a single noisy measurement of the right-hand-side vector \( d \). A useful analysis is to see what the distribution of fits is given an assumption on the input noise.

For this study I assume the noise is normally distributed and perform \(100,000\) fits, each on a different noisy measurement of \( d \). I then plot the range of possible values the fit could take. I plot three such values, the \( 100% \) range, the \( 99% \) range, and the \( 95% \) range.

Test Case Classic Least Squares L1 Regularized Least Squares
Normally distributed noise \( \sigma = 0.1 \)result99result910
Normally distributed noise \( \sigma = 0.2 \)result913result96
Normally distributed noise \( \sigma = 0.3 \)result914result915

The above shows that the regularized least squares is far less sensitive to noise than classic least squares.

Explanation of Results

The results here illustrate how attempting to enforce sparsity in the solution can help mitigate noise issues. The problem here is that I attempt to fit a 20th order polynomial to data that was generated by a 2nd order polynomial. Under normal circumstances this is fine because the least squares solver will be able to automatically detect that there are no coefficients of higher order in the data during the solution process. When there is noise however an unconstrained solver will confuse the noise with data and try to fit higher order polynomials into the solution, this presents itself as a solution that is dense when it should be sparse.

I give a table below for the solution, the traditional least squares solution, and the L1-regularized least squares solution. The solution is very sparse (only has a single component), but the least squares solution is fully dense. The L1 regularized solution correctly picks out the sparsity of the solution even though the right-hand-side vector was corrupted with noise

Legendre Coefficient True solution Least squares on noisy data Regularized least squares on noisy data
Legendre Coefficient 00.0-0.041563645930428460.0
Legendre Coefficient 10.0-0.022068452581563480.0
Legendre Coefficient 21.00.99077416124951320.035381591658564804
Legendre Coefficient 30.0-0.014213886435975820.0
Legendre Coefficient 40.0-0.0478028950005084940.0
Legendre Coefficient 50.00.061529516550181750.0
Legendre Coefficient 60.0-0.093117658239714660.0
Legendre Coefficient 70.00.181110815271481730.0
Legendre Coefficient 80.0-0.079057049067179360.0
Legendre Coefficient 90.00.158967836124546850.0
Legendre Coefficient 100.0-0.0249771724651356860.0
Legendre Coefficient 110.0-0.213963813424287660.0
Legendre Coefficient 120.00.309635166211724930.0
Legendre Coefficient 130.00.038963976708611260.0
Legendre Coefficient 140.00.080230469469339410.0
Legendre Coefficient 150.00.057888222490602110.0
Legendre Coefficient 160.00.03973426353775713-1.734723475976807e-18
Legendre Coefficient 170.0-0.41175889823898170.0
Legendre Coefficient 180.0-0.323512418518198340.0
Legendre Coefficient 190.00.493141808031109960.0
Legendre Coefficient 200.00.21813799682237480.0

Conclusion

Here I showed that the L1 norm can be useful for enforcing sparsity, and that this is a potentially beneficial property in some regression problems.

The usual warnings about regularization should apply here. L1 regularization should only be used if there is strongly justified reason to believe that the data source is actually sparse or approximately sparse. Regularization can be used to promote any kind of solution that you like, particularly when there are not that many data samples to work with and they are noisy.