Logo

The Data Daily

Machine Learning Finds Just How Contagious (R-Naught) the Coronavirus Is

Last updated: 03-30-2020

Read original article here

Machine Learning Finds Just How Contagious (R-Naught) the Coronavirus Is

Machine Learning Finds Just How Contagious (R-Naught) the Coronavirus Is
The answer is probably not what you think
Mar 23 · 8 min read
The R-Naught of a disease, or the ‘contagiousness’, represents how transmissible the disease is. An R-Naught of 2 means that for every one person with the disease, two more people are infected. A fractional R-Naught means that the epidemic is dying down. Hepatitis C and Ebola have an R-Naught of 2, HIV and SARS 4, and Measles 18, to give a few examples.
The R-Naught (R0) of a disease is usually publicly declared by the World Health Organization after careful and lengthy analysis of various factor such as the infectious period, contact rate, mode of transmission, etc.
In this article, we’ll handwrite a program that optimizes an exponential model to the data to find the R0 of the coronavirus in Python. While this is in no means a substitute for the WHO and other health agencies’ findings, it is a good way to gauge just how contagious the coronavirus is with the current lack of information.
*If you are not interested in the process, feel free to skip down to the findings.
Data
The data is from Kaggle’s Novel Coronavirus Dataset . This dataset is routinely updated and is cleanly separated by country and province/state.
Because global data is spread out over several regions that have different geography, medical care levels, and general development, it would be irresponsible to perform this analysis on global coronavirus cases.
Instead, we’ll use American coronavirus cases:
The Model
Our model will have a very simple equation:
…where y is the forecasted number of cases and x is the number of days since the first confirmed case. a and b are the only two parameters that will allow changing.
a controls how steep the curve will be. A smaller a value represents a less steep curve, and a higher a value represents a steeper curve.
Graphed in Desmos.
a is also the R0 value. For each number of days x after the epidemic begins, the number of new cases multiplies by a factor of a — for every one person infected on day x, a more people will be infected on day x + 1.
This also provides another representation of what different quantities of R0 values mean. The red exponential has a base (R0) of 5 — thus, it increases. The black exponential has a base of 1, meaning that for every additional day, one more person gets infected; thus, the epidemic does not get worse or better. The green exponential has a base of 1/2, meaning that for every additional day, 1/2 the number of people on the previous day are infected.
Graphed in Desmos.
The b value controls how left or right the exponential shifts.
Graphed in Desmos.
This will provide an additional degree of freedom for the model to shift left and right to further adapt to the data.
Exponential functions usually have another degree of freedom, which is a coefficient that the entire expression is multiplied by.
However, as it might be clear by the diagram, much of this effect is already captured and can be learned by the b parameter. On top of that, if this new coefficient were to be a learnable parameter, a would not be the same thing as the R-Naught.
Fitting the Model
To fit the model, we will implement a very simple yet effective gradient descent algorithm. For those unaware, a gradient descent algorithm adjusts parameters in a direction of the error space that leads to the minima.
The process is:
Initialize a and b to 1 and 30, respectively.
Initialize lr1 and lr2 to 0.00005 each. (lr1 and lr2 are the learning rates for a and b, respectively. More on this soon.)
Take the current value of b and create two new variables for consideration, b + lr1 and b — lr2. These two will be named b1 and b2.
Evaluate the mean absolute error between the model using b1 against the real data.
Evaluate the mean absolute error between the model using b2 against the real data.
Whichever variant of b has a lower mean absolute error becomes the new b.
Repeat steps 3 through 6 for a. Since a is a more important metric than b, a is second (its value is updated last, meaning it has the ‘final say’).
Repeat steps 3 through 7 ten thousand times.
Summarizing the steps, this approach towards adjusting a and b takes linear steps in the right direction. It has no momentum (imagine a ball rolling down the error space — as it rolls down, it speeds up), meaning that as soon as it reaches a local minima, it stays there. While this would be a problem for neural networks that have hundreds of thousands of parameters, it works fine for only two variables: using advanced optimizers is definitely overkill in this scenario.
Let’s get started implementing the algorithm.
The real coronavirus data from the United States will be stored in y, while x is simply a counter, starting from 0 and the same length as y.
y = data[data['Country/Region']=='US'].sum().drop(['Province/State','Country/Region','Lat','Long']).tolist()
x = range(len(y))
First, let’s define a function get_error that takes in the parameters of a and b and returns how far off, on average, an exponential model with parameters a and b is from the real data.
def get_error(a,b):
The last line appends the error for the a and b parameter combination.
That’s it! If you wanted to, you could stick on this stylistic update print:
if iteration % 1_000 == 0:
print("{'A':"+str(a)+", 'B':"+str(b)+"}")
print('Error:',global_mem[-1],'\n')
This prints out the values for a and b, as well as the error, every one thousand iterations.
Plotting out global_mem:
plt.figure(figsize=(20,6))
plt.plot(global_mem)
The model slowly makes progress and then begins plummeting exponentially until it reaches a convergence of about 200 at around the 8000th iteration. It’s incredible that a completely linear model — that is, simply choosing whether to go up or go down by a fixed amount — can have such a curved path towards convergence!
Results
The last iteration, iteration 9999, has the following parameters.
The plot is definitely a great fit. The real-world data almost entirely matches the model!
This puts the R0 value at around 1.4 in the United States. This means that if you are living in the United States, tomorrow, 1.4 people will infected for every currently infected person today. Compare that number with other R-naught values:
Source . Image free to distribute with credit.
*This does not mean that COVID-19’s R-Naught is definitively 1.4. I did not choose to analyze the global R0 because it is not a one-size-fits-all statistic; different countries have different geographical properties, populations, healthcare quality, etc. Given the United States’ medical system and population, the R-naught for COVID-19 on American soil is 1.4.
If America was able to get through HIV/AIDS, Polio, SARS, Pertussis, Measles, and other diseases with R-naught values much higher than the coronavirus’, be rest assured the US will be able to get through COVID-19.
This is in no means a call to ignore heath advisories — the coronavirus is dangerous in many ways besides its low R-naught, and it’s never worth risking. The purpose of this article is simply to abate fears that COVID-19 is much more contagious than other diseases.
Better be safe than sorry!
Thanks for reading!
The complete code, in a forkable notebook form, can be found here on Kaggle.
If you enjoyed, feel free to check out some of my other work on the coronavirus:


Read the rest of this article here