This post will again concern the solution of a system of linear equations
\[ Ax = b \]
but will show how we can recover sparsity in the variable \( x \). If \( b \) is observed perfectly (no noise, no errors), then we can use standard least squares approach and \( x \) will contain the correct sparsity information (if it was sparse to begin with). However if we observe \(b \) with noise, then the recovered \( x \) will likely be dense. For regression applications this can be problematic when trying to fit high order polynomials, because the density of the solution vector will produce a highly oscillatory function that is not representative of the ground truth. Fortunately we can solve this with regularization, specifically regularizing with the L1 norm.
In my last post I showed that the L1 norm can be very robust to outliers when used as the loss function in a linear regression problem. Here I show that when used as a regularization term in the regression formulation then it promotes sparsity in the answer, and so can help recover sparse solutions even when \( b \) is not observed exactly. The regularized formulation is equivalent to a quadratic program with linear constraints which I solve with the NAG library routine lsq_lincon_solve
While the application of this post is data fitting, the central issue will be finding robust ways to approximately solve a system of linear equations
\[ Vx = d \]
where \( d \) is some kind of measured data that we are trying to fit, and \( V \) is a matrix - the value of \( V \) depends on the type of data fitting algorithm we use.
I use a classic approach to data fitting for this post: polynomial regression. I do it slightly differently from the wikipedia article. Instead of the monomial basis (i.e. \(1,x ^ 1, x ^ 2, \ldots \) ), I use the Legendre basis. This has two benefits: the first is numerical stability improvements, and also NumPy has a nice module for calculating the matrix \( V \) and manipulating Legendre polynomials.
Thus we want to compute \( x \) in \( Vx = d \) where \( V \) is computed by the NumPy routine legvander
, and \( d \) is a vector containing measured data that we would like to fit. The resulting fit can then be evaluated with the NumPy routine legval
.
The first approach to solving \( Vx = d \) will be with linear least squares.
The first attempt here will be to compute \( x \) as the minimizer of the Euclidean norm
\[ | d - Ax | _ 2 \]
I won’t derive the algorithm for this here, but I will mention that there are very numerically robust solutions to this minimization problem. NumPy uses these robust techniques for its function lstsq. I show an example python script solving a data fitting problem with normally distributed noise
#Seed RNG for reproducible results
np.random.seed(4484)
#Maximum polynomial degree
deg=20
#Sample our function at 100 equally spaced points
nsamples=100
xs=np.linspace(-1,1,nsamples)
#Form Vandermonde matrix for least squares problem
V=leg.legvander(xs,deg)
#Fabricate a sparse solution "ys"
ys=np.zeros(deg+1)
ys[2]=1.0
#Sample this solution with noisy "ys"
errs=np.random.normal(0.0,0.1,(nsamples))
b=V@ys+errs
#Solve by traditional least squares
(soln,res,rank,svd)=la.lstsq(V,b)
f=leg.legval(xs,soln)
exact=leg.legval(xs,ys)
#Plot results
plt.plot(xs,f)
plt.plot(xs,exact)
plt.legend(["Least-Squares Solution","Solution without noise"])
plt.ylim(-0.7,1.1)
plt.xlabel("Independent Variable")
plt.ylabel("Measured data")
plt.savefig("noise3.svg")
Test Case | Data | Fit |
---|---|---|
No noise | ||
Normally distributed noise \( \sigma = 0.1 \) | ||
Normally distributed noise \( \sigma = 0.2 \) | ||
Normally distributed noise \( \sigma = 0.3 \) |
To try and avoid over-fitting noise I add a regularization term. The new formulation then tries to minimize the following expression:
\[ | d - Ax | _ 2 ^ 2 + \lambda | x | _ 1 \]
Like with my post on L1 regression the term \( \lambda | x | _ 1 \) presents some challenges. In that post I reformulated the regression problem as a linear program, but the presence of the traditional least squares term makes this approach unikely to work because it is not a simple linear term. Fortunately we can reformulate the above as a constrained linear least squares in a very similar way as we did the linear program, by introducing a new variable \( t \) to represent the absolute values of \( x \):
With this formulation we can use the NAG library routine e04ncc. I do this through the NAG Python interface as follows:
import numpy as np
import scipy.linalg as la
from naginterfaces.library.opt import nlp1_init
from naginterfaces.library.opt import lsq_lincon_solve
from naginterfaces.library.opt import lsq_lincon_option_string
def lstsq_l1reg(A,b,L=1.0):
(m,n)=A.shape
nvars=2*n
nclin=2*n
comm = nlp1_init('lsq_lincon_solve')
lsq_lincon_option_string("Cold Start",comm)
lsq_lincon_option_string("Problem Type = LS2",comm)
lsq_lincon_option_string("Iteration Limit = {}".format(1000*(nvars+nclin)),comm)
#uncomment below for verbose output
#lsq_lincon_option_string("Print Level = 10",comm)
#Constraint matrix. This holds constraints between "t" and "x" variables
C=np.zeros((nclin,nvars))
C[0:n,0:n]=-np.identity(n)
C[0:n,n:2*n]=np.identity(n)
C[n:2*n,0:n]=-np.identity(n)
C[n:2*n,n:2*n]=-np.identity(n)
#Coefficients for the linear term in the least squares problem
cs=np.zeros(nvars)
cs[0:n]=L
#Upper and lower bounds for the constraints
bl=np.zeros(nvars+nclin)
bu=np.zeros(nvars+nclin)
bl[0:n]=0.0
bl[n:2*n]=-1.00E+20
bl[2*n:]=-1.00E+20
bu[0:n]=1.00E+20
bu[n:2*n]=1.00E+20
bu[2*n:]=0.0
istate=np.zeros(nvars+nclin).astype(np.int)
kx=np.zeros(nvars).astype(np.int)
#initialize with unregularized least square solution
xhat,res,rank,sv=la.lstsq(A,b)
xs=np.zeros(nvars)
xs[0:n]=np.abs(xhat) #the initial "t" values
xs[n:2*n]=xhat #initial leasst squares solution
#The matrix in the linear least squares formulation
B=np.zeros((m,nvars))
B[0:m,n:2*n]=A
(istate,kx,x,tmpa,tmpb,itera,obj,clamba)=lsq_lincon_solve(C,bl,bu,cs,istate,kx,xs,B,b,comm)
return (x,obj)
I now repeat the exercise for normal least squares but with the regularization term added
np.random.seed(4484)
#Maximum polynomial degree
deg=20
#Sample our function at 100 equally spaced points
nsamples=100
xs=np.linspace(-1,1,nsamples)
#Form Vandermonde matrix for least squares problem
V=leg.legvander(xs,deg)
#Fabricate a sparse solution "ys"
ys=np.zeros(deg+1)
ys[2]=1.0
#Sample this function with noisy "ys"
errs=np.random.normal(0.0,0.1,(nsamples))
b=V@ys+errs
#Solve by L1-regularized least squares
soln_l1reg,obj=lstsq_l1reg(V,b,L=0.5*np.sqrt(nsamples))
soln_l1reg=soln_l1reg[deg+1:]
g=leg.legval(xs,soln_l1reg)
exact=leg.legval(xs,ys)
#Plot results
plt.plot(xs,g)
plt.plot(xs,exact)
plt.legend(["L1-Regularized Least-Squares Solution","Solution without noise"])
plt.ylim(-0.7,1.1)
plt.xlabel("Independent Variable")
plt.ylabel("Measured data")
plt.savefig("noise1_l1reg.svg")
This results in the following plots
Test Case | Data | Fit |
---|---|---|
No noise | ||
Normally distributed noise \( \sigma = 0.1 \) | ||
Normally distributed noise \( \sigma = 0.2 \) | ||
Normally distributed noise \( \sigma = 0.3 \) |
The above results only show point estimates, that is a single polynomial fit to a single noisy measurement of the right-hand-side vector \( d \). A useful analysis is to see what the distribution of fits is given an assumption on the input noise.
For this study I assume the noise is normally distributed and perform \(100,000\) fits, each on a different noisy measurement of \( d \). I then plot the range of possible values the fit could take. I plot three such values, the \( 100% \) range, the \( 99% \) range, and the \( 95% \) range.
Test Case | Classic Least Squares | L1 Regularized Least Squares |
---|---|---|
Normally distributed noise \( \sigma = 0.1 \) | ||
Normally distributed noise \( \sigma = 0.2 \) | ||
Normally distributed noise \( \sigma = 0.3 \) |
The above shows that the regularized least squares is far less sensitive to noise than classic least squares.
The results here illustrate how attempting to enforce sparsity in the solution can help mitigate noise issues. The problem here is that I attempt to fit a 20th
order polynomial to data that was generated by a 2nd
order polynomial. Under normal circumstances this is fine because the least squares solver will be able to automatically detect that there are no coefficients of higher order in the data during the solution process. When there is noise however an unconstrained solver will confuse the noise with data and try to fit higher order polynomials into the solution, this presents itself as a solution that is dense when it should be sparse.
I give a table below for the solution, the traditional least squares solution, and the L1-regularized least squares solution. The solution is very sparse (only has a single component), but the least squares solution is fully dense. The L1 regularized solution correctly picks out the sparsity of the solution even though the right-hand-side vector was corrupted with noise
Legendre Coefficient | True solution | Least squares on noisy data | Regularized least squares on noisy data |
---|---|---|---|
Legendre Coefficient 0 | 0.0 | -0.04156364593042846 | 0.0 |
Legendre Coefficient 1 | 0.0 | -0.02206845258156348 | 0.0 |
Legendre Coefficient 2 | 1.0 | 0.9907741612495132 | 0.035381591658564804 |
Legendre Coefficient 3 | 0.0 | -0.01421388643597582 | 0.0 |
Legendre Coefficient 4 | 0.0 | -0.047802895000508494 | 0.0 |
Legendre Coefficient 5 | 0.0 | 0.06152951655018175 | 0.0 |
Legendre Coefficient 6 | 0.0 | -0.09311765823971466 | 0.0 |
Legendre Coefficient 7 | 0.0 | 0.18111081527148173 | 0.0 |
Legendre Coefficient 8 | 0.0 | -0.07905704906717936 | 0.0 |
Legendre Coefficient 9 | 0.0 | 0.15896783612454685 | 0.0 |
Legendre Coefficient 10 | 0.0 | -0.024977172465135686 | 0.0 |
Legendre Coefficient 11 | 0.0 | -0.21396381342428766 | 0.0 |
Legendre Coefficient 12 | 0.0 | 0.30963516621172493 | 0.0 |
Legendre Coefficient 13 | 0.0 | 0.03896397670861126 | 0.0 |
Legendre Coefficient 14 | 0.0 | 0.08023046946933941 | 0.0 |
Legendre Coefficient 15 | 0.0 | 0.05788822249060211 | 0.0 |
Legendre Coefficient 16 | 0.0 | 0.03973426353775713 | -1.734723475976807e-18 |
Legendre Coefficient 17 | 0.0 | -0.4117588982389817 | 0.0 |
Legendre Coefficient 18 | 0.0 | -0.32351241851819834 | 0.0 |
Legendre Coefficient 19 | 0.0 | 0.49314180803110996 | 0.0 |
Legendre Coefficient 20 | 0.0 | 0.2181379968223748 | 0.0 |
Here I showed that the L1 norm can be useful for enforcing sparsity, and that this is a potentially beneficial property in some regression problems.
The usual warnings about regularization should apply here. L1 regularization should only be used if there is strongly justified reason to believe that the data source is actually sparse or approximately sparse. Regularization can be used to promote any kind of solution that you like, particularly when there are not that many data samples to work with and they are noisy.