Smoothing Covid19 Cases Data with Linear Programming

I live in Dallas, Texas. I have been following covid-19 cases as they increase in my local area to understand if aggressive social distancing measures are flattening the curve. Unfortunately the data is extremely noisy and very difficult to draw any conclusions from. I have found there are several sources of potential noise in this data:

These issues don’t impact the accuracy of the cumulative number of confirmed cases, but they make it hard to understand if social distancing measures are flattening the curve. Instead of a nice flat cases per day graph, we get a bunch of spikes that have no obvious trend. See the graph for Dallas below, for example.

Smoothing the Cases Per Day Graph

Fixing this while recovering a trend can be hard. What I see a lot of people doing is using a moving average with a window between 3 and 5 days. This doesn’t realy help when the data is very noisy. It will reduce the peak of the spikes, but the spikes remain. See below for a comparison between the raw dallas data and a 3-day-window moving average smoothing of that same data

I started to think about alternative ways to smooth this data and maybe pick out trends. I came up with the high-level idea as follows:

Translating to Linear Program

Supposing that $ c _ i $ represents the number of new cases on day $ i $, and $ d _ i $ is the smoothed version of $ c _ i $ which we wish to compute I formulate conditions (1) through (3) above as follows

$$ \begin{aligned} \min \max _ i | d _ i - d _ {i+1} | \\ \text{subject to} \\ \sum _ {i=0} ^ n c _ i &= \sum _ {i=0}^n d _ i \\ d _ i &= \alpha _ {i,0} c _ {i-1} + \alpha _ {i,1} c _ {i} + \alpha _ {i,2} c _ {i+1} (i=2,\ldots,n-1) \\ \alpha _ {i,0} + \alpha _ {i,1} + \alpha _ {i,2} &= 1 \\ \alpha _ {i,j} &>= 0 \end{aligned} $$

The min-max objective has standard techniques to reduce it to a linear objective, and the absolute values also pose no problems for this to be posed as a linear program.

Solving the Linear Program with SciPy

SciPy has the function linprog which can solve moderately sized linear programs. I used this to solve the linear program posed above.

def epi_smooth_dx(xs):
    n=len(xs)
    
    a0vars=range(0,n-2)
    a1vars=range(n-2,2*(n-2))
    a2vars=range(2*(n-2),3*(n-2))
    yvars=range(3*(n-2),3*(n-2)+n)
    dx=3*(n-2)+n
    
    nvars=len(a0vars)+len(a1vars)+len(a2vars)+len(yvars)+1
    lb=np.zeros(nvars)
    ub=np.zeros(nvars)
    for i in a0vars:
        lb[i]=0.0
        ub[i]=1.0
    for i in a1vars:
        lb[i]=0.0
        ub[i]=1.0
    for i in a2vars:
        lb[i]=0.0
        ub[i]=1.0
    for i in yvars:
        lb[i]=min(xs)
        ub[i]=max(xs)
    lb[dx]=0.0
    ub[dx]=100000.0
        
    #Fix the starting and ending values
    lb[yvars[0]]=min(xs)
    ub[yvars[0]]=min(xs)
    lb[yvars[-1]]=min(xs)
    ub[yvars[-1]]=max(xs)
        
    Aub=np.zeros((2*(n-1),nvars))
    bub=np.zeros(2*(n-1))
    for j,i in enumerate(yvars[:-1]):
        Aub[j,i]=-1.0
        Aub[j,i+1]=1.0
        Aub[j,dx]=-1.0
        Aub[n-1+j,i]=1.0
        Aub[n-1+j,i+1]=-1.0
        Aub[n-1+j,dx]=-1.0
        

    aeq=range(0,n-2)
    yeq=range(n-2,2*(n-2))
    neq=len(aeq)+len(yeq)+1
    Aeq=np.zeros((neq,nvars))
    beq=np.zeros(neq)
    

    for k,i in enumerate(aeq):
        a0=min(a0vars)+k
        a1=min(a1vars)+k
        a2=min(a2vars)+k
        Aeq[i,a0]=1.0
        Aeq[i,a1]=1.0
        Aeq[i,a2]=1.0
        beq[i]=1.0
    for k,i in enumerate(yeq):
        x=k+1
        y=min(yvars)+k+1
        a0=min(a0vars)+k
        a1=min(a1vars)+k
        a2=min(a2vars)+k
        Aeq[i,y]=1.0
        Aeq[i,a0]=-xs[x-1]
        Aeq[i,a1]=-xs[x]
        Aeq[i,a2]=-xs[x+1]
        beq[i]=0.0
    for i in yvars:
        Aeq[neq-1,i]=1.0
    beq[neq-1]=sum(xs)
    
    
    c=np.zeros(nvars)
    c[dx]=1.0
    result=opt.linprog(c,Aub,bub,Aeq,beq,list(zip(lb,ub)))
    
    return np.array([result.x[y] for y in yvars])

Results

Conclusion

The algorithm does appear to smooth the curves much more so than the moving average while still picking out potentially useful trends. This was for my own curiosity though and I’m not an epidemiologist so I don’t know whether there is real utility to this beyond satisfying my own curiosity.

	Raw Data	Moving Average	Linear Program Smoother
Dallas
Harris County
Italy
Spain
United States