I live in Dallas, Texas. I have been following covid-19 cases as they increase in my local area to understand if aggressive social distancing measures are flattening the curve. Unfortunately the data is extremely noisy and very difficult to draw any conclusions from. I have found there are several sources of potential noise in this data:

- Weekend effect: Some testing labs close on weekends, resulting in reduced reporting
- Holidays: same effect as weekends but more pronounced
- Backlogs: results may not process in timely manner, resulting in delayed reporting
- Large clusters that cause concentration of testing capacity on population with higher than usual prevalence

These issues don’t impact the accuracy of the *cumulative* number of confirmed cases, but they make it hard to understand if social distancing measures are flattening the curve. Instead of a nice flat cases per day graph, we get a bunch of spikes that have no obvious trend. See the graph for Dallas below, for example.

Fixing this while recovering a trend can be hard. What I see a lot of people doing is using a moving average with a window between `3`

and `5`

days. This doesn’t realy help when the data is very noisy. It will reduce the peak of the spikes, but the spikes remain. See below for a comparison between the raw dallas data and a 3-day-window moving average smoothing of that same data

Raw Dallas Data | Moving Average Dallas Data |
---|---|

I started to think about alternative ways to smooth this data and maybe pick out trends. I came up with the high-level idea as follows:

- Let nearby days “borrow” cases. So Tuesday could “borrow” cases from monday
- Borrow in a way that makes the “cases per day” graph as flat as possible
- Do not change cumulative cases. The total number of confirmed cases has to remain the same

These three conditions can be roughly translated into a linear program

Supposing that \( c _ i \) represents the number of new cases on day \( i \), and \( d _ i \) is the *smoothed* version of \( c _ i \) which we wish to compute I formulate conditions (1) through (3) above as follows

$$ \begin{aligned} \min \max _ i | d _ i - d _ {i+1} | \\ \text{subject to} \\ \sum _ {i=0} ^ n c _ i &= \sum _ {i=0}^n d _ i \\ d _ i &= \alpha _ {i,0} c _ {i-1} + \alpha _ {i,1} c _ {i} + \alpha _ {i,2} c _ {i+1} (i=2,\ldots,n-1) \\ \alpha _ {i,0} + \alpha _ {i,1} + \alpha _ {i,2} &= 1 \\ \alpha _ {i,j} &>= 0 \end{aligned} $$

The min-max objective has standard techniques to reduce it to a linear objective, and the absolute values also pose no problems for this to be posed as a linear program.

SciPy has the function linprog which can solve moderately sized linear programs. I used this to solve the linear program posed above.

```
def epi_smooth_dx(xs):
n=len(xs)
a0vars=range(0,n-2)
a1vars=range(n-2,2*(n-2))
a2vars=range(2*(n-2),3*(n-2))
yvars=range(3*(n-2),3*(n-2)+n)
dx=3*(n-2)+n
nvars=len(a0vars)+len(a1vars)+len(a2vars)+len(yvars)+1
lb=np.zeros(nvars)
ub=np.zeros(nvars)
for i in a0vars:
lb[i]=0.0
ub[i]=1.0
for i in a1vars:
lb[i]=0.0
ub[i]=1.0
for i in a2vars:
lb[i]=0.0
ub[i]=1.0
for i in yvars:
lb[i]=min(xs)
ub[i]=max(xs)
lb[dx]=0.0
ub[dx]=100000.0
#Fix the starting and ending values
lb[yvars[0]]=min(xs)
ub[yvars[0]]=min(xs)
lb[yvars[-1]]=min(xs)
ub[yvars[-1]]=max(xs)
Aub=np.zeros((2*(n-1),nvars))
bub=np.zeros(2*(n-1))
for j,i in enumerate(yvars[:-1]):
Aub[j,i]=-1.0
Aub[j,i+1]=1.0
Aub[j,dx]=-1.0
Aub[n-1+j,i]=1.0
Aub[n-1+j,i+1]=-1.0
Aub[n-1+j,dx]=-1.0
aeq=range(0,n-2)
yeq=range(n-2,2*(n-2))
neq=len(aeq)+len(yeq)+1
Aeq=np.zeros((neq,nvars))
beq=np.zeros(neq)
for k,i in enumerate(aeq):
a0=min(a0vars)+k
a1=min(a1vars)+k
a2=min(a2vars)+k
Aeq[i,a0]=1.0
Aeq[i,a1]=1.0
Aeq[i,a2]=1.0
beq[i]=1.0
for k,i in enumerate(yeq):
x=k+1
y=min(yvars)+k+1
a0=min(a0vars)+k
a1=min(a1vars)+k
a2=min(a2vars)+k
Aeq[i,y]=1.0
Aeq[i,a0]=-xs[x-1]
Aeq[i,a1]=-xs[x]
Aeq[i,a2]=-xs[x+1]
beq[i]=0.0
for i in yvars:
Aeq[neq-1,i]=1.0
beq[neq-1]=sum(xs)
c=np.zeros(nvars)
c[dx]=1.0
result=opt.linprog(c,Aub,bub,Aeq,beq,list(zip(lb,ub)))
return np.array([result.x[y] for y in yvars])
```

The code may also be found at this repository

I ran this smoothing algorithm on datasets for different areas in the world:

- Dallas, Texas
- Harris County (includes Houston), Texas
- Italy
- Spain
- United States

Raw Data | Moving Average | Linear Program Smoother | |
---|---|---|---|

Dallas | |||

Harris County | |||

Italy | |||

Spain | |||

United States |

The algorithm does appear to smooth the curves much more so than the moving average while still picking out potentially useful trends. This was for my own curiosity though and I’m not an epidemiologist so I don’t know whether there is real utility to this beyond satisfying my own curiosity.