Fitting Noisy Data with Outliers

April 21, 2019

Data fitting algorithms give us a way to summarize complicated data by simple formulas. Most real-world data is corrupted by noise, which can potentially ruin the usefulness of data fitting algorithms. Fortunately as long as we have a good model for that noise, such as its distribution, we can account for it and still achieve high quality fitting. If we do not have a good model for the noise, for example if there are rare spikes in the data which do not fit a known distribution, then we start running into trouble.

Here I show how to make polynomial fitting work even if the data we want to fit has occasional outliers which pollute standard fitting algorithms. I start by showing a “standard” approach by linear least squares, and how it starts to break down in the presence of outliers in the data. I then formulate the fit using a different norm and show how to solve the minimization problem with a linear program. The new regression algorithm proves very robust to outliers.

Polynomial Regression

While the application of this post is data fitting, the central issue will be finding robust ways to approximately solve a system of linear equations

\[ Vx = d \]

where \( d \) is some kind of measured data that we are trying to fit, and \( V \) is a matrix - the value of \( V \) depends on the type of data fitting algorithm we use.

I use a classic approach to data fitting for this post: polynomial regression. I do it slightly differently from the wikipedia article. Instead of the monomial basis (i.e. \(1,x ^ 1, x ^ 2, \ldots \) ), I use the Legendre basis. This has two benefits: the first is numerical stability improvements, and also NumPy has a nice module for calculating the matrix \( V \) and manipulating Legendre polynomials.

Thus we want to compute \( x \) in \( Vx = d \) where \( V \) is computed by the NumPy routine legvander, and \( d \) is a vector containing measured data that we would like to fit. The resulting fit can then be evaluated with the NumPy routine legval.

The first approach to solving \( Vx = d \) will be with linear least squares.

Linear Least Squares

The first attempt here will be to compute \( x \) as the minimizer of the Euclidean norm

\[ | d - Ax | _ 2 \]

I won’t derive the algorithm for this here, but I will mention that there are very numerically robust solutions to this minimization problem. NumPy uses these robust techniques for its function lstsq. I show an example python script solving a data fitting problem

#Use 20th order polynomial fit
#Number of samples of our data d
#The independent variable sampled at regular intervals
#The Legendre Vandermonde matrix
#Generate some data to fit
f=np.cos(ys) + np.sin(ys)*np.sin(ys)*np.sin(ys)+np.cos(ys)*np.cos(ys)*np.cos(ys*ys)
#Do the fit 
#Evaluate the fit for plotting purposes


To generate these results I use essentially the same script as above but I modify it to add noise to f and then finally to add outliers to f so that we can see the impact of these things on the resulting fit. The new script is:

f=np.cos(ys) + np.sin(ys)*np.sin(ys)*np.sin(ys)+np.cos(ys)*np.cos(ys)*np.cos(ys*ys)

if noutliers>0:
Test Case Data Fit
No noise, no outliersresult1result2
Noise , no outliersresult3result4
Noise , 5 outliersresult5result6
Noise , 50 outliersresult7result8

Observe that the fit to the “true” data (without noise) is very good, the fit to noisy data still faithfully represents the noiseless data, but outliers start to pollute the fit.

Using a Different Norm

The problem above is that the outliers are an order of magnitude larger than the inherent noise of measuring the data. The Euclidean norm squares every term, so outliers rapidly swamp out the error that we minimize so that they dominate the whole optimization problem. We need to look for alternatives to the Euclidean norm which does not have this effect.

A common alternative to the Euclidean norm used above is the 1-norm. The 1-norm has the advantage of not squaring every term, so outliers impact the solution less than in the Euclidean case.

Thus we now need to find \( x \) which minimizes

\[ | d - Ax | _ 1 \]

The problem with this formulation is that unlike the Euclidean norm, there is not a robust formulaic approach to solving this minimization problem. Recall that before we simply made a call to NumPy’s lstsq.

We have to do a mild translation here, so that solving this hard problem becomes a simple library function call. It turns out that the unconstrained optimization problem

\[ \min | d - Ax | _ 1 \]

is equivalent to the constrained optimization problem

$$ \begin{aligned} \min \sum _ {i=1}^n t_i \\ \text{subject to} \\ Ax - t \leq d \\ -Ax - t \leq -d \\ t \geq 0 \end{aligned} $$

where we optimize over the variables \( x \) and the corresponding residuals \( t \). This is nothing more than a linear program. The NAG library has a great solver for dense linear programs like this called e04mfa. If that documentation intimidates, don’t worry - there is a python interface which I show below

#Find least-1-norm solution to Ax=b using linear programming
import numpy as np
from naginterfaces.library.opt import lp_solve
from naginterfaces.library.opt import nlp1_init
from naginterfaces.library.opt import lp_option_string

def lst1norm_nag(A,b):
    tcons=[k for k in range(m+n,2*m+n)]

    for i in range(0,m):
    for i in range(m,m+n):
    for i in range(m+n,m+n+m):
    for i in range(m+n+m,m+n+m+m):
    comm = nlp1_init('lp_solve')
    lp_option_string("Cold Start",comm)
    return x[m:2*m]


Test Case Data Fit
No noise, no outliersresult9result10
Noise , no outliersresult11result12
Noise , 5 outliersresult13result6
Noise , 50 outliersresult14result15

By visual inspection we can see the L1 fits match the trend of the data better than the L2 fits, even when there are some outliers.

Conclusion and a warning

By changing norm and switching to a linear programming formulation we were able to compute a polynomial fit to fairly noisy data with outliers. We should always be careful though about what constitutes an “outlier.” An outlier may actually be real data that we must account for when constructing models. In this blog post I had an implicit assumption that outliers were any data which severely violated the model for noise, and under that assumption it is therefore desirable to dampen out their impact on the polynomial fit. In other scenarios an outlier may actually be legitimate and informative data and simply damping it out could potentially result in drawing incorrect conclusions from that data.