I live in Dallas, Texas. I have been following covid-19 cases as they increase in my local area to understand if aggressive social distancing measures are flattening the curve. Unfortunately the data is extremely noisy and very difficult to draw any conclusions from. I have found there are several sources of potential noise in this data:
These issues don’t impact the accuracy of the cumulative number of confirmed cases, but they make it hard to understand if social distancing measures are flattening the curve. Instead of a nice flat cases per day graph, we get a bunch of spikes that have no obvious trend. See the graph for Dallas below, for example.
Fixing this while recovering a trend can be hard. What I see a lot of people doing is using a moving average with a window between 3
and 5
days. This doesn’t realy help when the data is very noisy. It will reduce the peak of the spikes, but the spikes remain. See below for a comparison between the raw dallas data and a 3-day-window moving average smoothing of that same data
Raw Dallas Data | Moving Average Dallas Data |
---|---|
I started to think about alternative ways to smooth this data and maybe pick out trends. I came up with the high-level idea as follows:
These three conditions can be roughly translated into a linear program
Supposing that \( c _ i \) represents the number of new cases on day \( i \), and \( d _ i \) is the smoothed version of \( c _ i \) which we wish to compute I formulate conditions (1) through (3) above as follows
The min-max objective has standard techniques to reduce it to a linear objective, and the absolute values also pose no problems for this to be posed as a linear program.
SciPy has the function linprog which can solve moderately sized linear programs. I used this to solve the linear program posed above.
def epi_smooth_dx(xs):
n=len(xs)
a0vars=range(0,n-2)
a1vars=range(n-2,2*(n-2))
a2vars=range(2*(n-2),3*(n-2))
yvars=range(3*(n-2),3*(n-2)+n)
dx=3*(n-2)+n
nvars=len(a0vars)+len(a1vars)+len(a2vars)+len(yvars)+1
lb=np.zeros(nvars)
ub=np.zeros(nvars)
for i in a0vars:
lb[i]=0.0
ub[i]=1.0
for i in a1vars:
lb[i]=0.0
ub[i]=1.0
for i in a2vars:
lb[i]=0.0
ub[i]=1.0
for i in yvars:
lb[i]=min(xs)
ub[i]=max(xs)
lb[dx]=0.0
ub[dx]=100000.0
#Fix the starting and ending values
lb[yvars[0]]=min(xs)
ub[yvars[0]]=min(xs)
lb[yvars[-1]]=min(xs)
ub[yvars[-1]]=max(xs)
Aub=np.zeros((2*(n-1),nvars))
bub=np.zeros(2*(n-1))
for j,i in enumerate(yvars[:-1]):
Aub[j,i]=-1.0
Aub[j,i+1]=1.0
Aub[j,dx]=-1.0
Aub[n-1+j,i]=1.0
Aub[n-1+j,i+1]=-1.0
Aub[n-1+j,dx]=-1.0
aeq=range(0,n-2)
yeq=range(n-2,2*(n-2))
neq=len(aeq)+len(yeq)+1
Aeq=np.zeros((neq,nvars))
beq=np.zeros(neq)
for k,i in enumerate(aeq):
a0=min(a0vars)+k
a1=min(a1vars)+k
a2=min(a2vars)+k
Aeq[i,a0]=1.0
Aeq[i,a1]=1.0
Aeq[i,a2]=1.0
beq[i]=1.0
for k,i in enumerate(yeq):
x=k+1
y=min(yvars)+k+1
a0=min(a0vars)+k
a1=min(a1vars)+k
a2=min(a2vars)+k
Aeq[i,y]=1.0
Aeq[i,a0]=-xs[x-1]
Aeq[i,a1]=-xs[x]
Aeq[i,a2]=-xs[x+1]
beq[i]=0.0
for i in yvars:
Aeq[neq-1,i]=1.0
beq[neq-1]=sum(xs)
c=np.zeros(nvars)
c[dx]=1.0
result=opt.linprog(c,Aub,bub,Aeq,beq,list(zip(lb,ub)))
return np.array([result.x[y] for y in yvars])
The code may also be found at this repository
I ran this smoothing algorithm on datasets for different areas in the world:
Raw Data | Moving Average | Linear Program Smoother | |
---|---|---|---|
Dallas | |||
Harris County | |||
Italy | |||
Spain | |||
United States |
The algorithm does appear to smooth the curves much more so than the moving average while still picking out potentially useful trends. This was for my own curiosity though and I’m not an epidemiologist so I don’t know whether there is real utility to this beyond satisfying my own curiosity.