The Importance of Statistical Modeling for the COVID-19 Pandemic

Coronavirus illustration

Andreas Voyages


The primary objective of this report is to provide a general overview of the role of statistical modeling throughout the progression of the COVID-19 pandemic. This overview will focus on analyzing various forecast statistical models utilized by academia, government, and medical companies, comparing and contrasting characteristics of these models, identifying which models are most effective and why, and also providing potential ideas about how to enhance the efficacy of statistical modeling.


Coronavirus illustration

On December 31st of 2019, the World Health Organization’s Country Office in the People’s Republic of China picked up a media statement from the Wuhan Municipal Health Commission’s website reporting a cluster of cases presenting with ‘viral pneumonia’ in Wuhan, People’s Republic of China. These were the first cases to be traced to the COVID-19 virus, a pathogen pictured to the right, which would soon ravage the world and force countries around to implement some sort of lockdown almost overnight, virtually changing life as people knew it.

Since then, this cluster of cases in China has multiplied exponentially to over 25,000,000 cases and approximately 800,000 deaths worldwide with the numbers still rising rapidly, therefore classifying the COVID-19 virus as a global pandemic. The epicenter of this virus has shifted many times, starting off in Asia and then moving to Europe, North America, South America, and more recently back to North America.

Countries around the world wasted no time in creating various methods of testing to identify if people presenting with symptoms of COVID-19 are actually infected. As testing became more widely available, the number of people contagious and perishing from the virus was recorded into various systems, allowing for the creation of statistical models to occur.

Statistical models are representations of numerical data pertaining to specific samples or groups. These models often appear as line graphs and scatterplots, which are often utilized to analyze trends in the data depicted. While statistical models present data within various scenarios, those relating to COVID-19 are extremely popular in modern times because they encapsulate numerical data relating to this pandemic, including the number of cases and deaths caused by COVID-19. For this reason, people refer to these models in order to get a sense of the trajectory of this virus.

These models have also been extremely useful to many by tracing cases worldwide to individual countries, regions, cities, and specific areas within cities, allowing the leaders within these areas to take proper action in combating this virus. Additionally, models have zeroed into a large number of important characteristics among people presenting with COVID-19, including age, race, gender, pre-existing conditions, etc. This allows scientists to see which groups of people are put most at risk by this pandemic.

This image portrays the predictions for weekly and total COVID-19 cases and deaths within the United States from various sources.


Forecasting teams predict numbers of cases, deaths, and hospitalizations by analyzing these various types of data (i.e., COVID-19 data, demographic data, mobility data), methods, and estimates of the impacts of interventions (i.e., social distancing, use of face coverings). These forecasts are developed and then shared publicly by various sources working independently. Comparing and contrasting these forecasts is key in not only picking out which models have been most effective throughout the duration of the virus, but also compiling and analyzing data such as finding median data points which can serve as an overall or common forecast, as compiled by the Centers for Disease Control and Prevention (CDC), which is pictured above.

Popular Statistical Models

The United States has observed the most variety in statistical modeling pertaining to the COVID-19 virus. These models have counted a total number of over 6,000,000 cases and approximately 180,000 deaths in the United States, the most out of any country globally.

A variety of data is portrayed by 37 reliable statistical models listed on the CDC web page titled ‘Cases, Data & Surveillance’, in addition to numerous other models displayed throughout the internet, including the Worldometer website, displayed to the right.

This image showcases a screenshot of the Worldometer website, which is one of the many models found on the internet responsible for displaying the number of cases and deaths attributed to COVID-19 both on a global scale and according to individual countries.

This CDC webpage groups these models into two categories: those making assumptions about how levels of social distancing will change in the future, and those assuming that current social distancing measures will remain intact throughout the projected four-week time period.

Nine out of the 37 statistical models portrayed on this webpage fall into the category focusing on how levels of social distancing will change in the future. These models include:

  • Columbia University (Model: Columbia)
  • Google and Harvard School of Public Health (Model: Google-HSPH)
  • Georgia Institute of Technology, Center for Health and Humanitarian Systems (Model: GT-CHHS)
  • John Burant (Model: JCB)
  • Johns Hopkins University, Infectious Disease Dynamics Lab (Model: JHU)
  • Notre Dame University (Model: NotreDame-FRED)
  • Predictive Science Inc. (Model: PSI)
  • University of California, Los Angeles (Model: UCLA)
  • Youyang Gu (Model: YYG)

The additional 28 statistical models conveyed on this webpage are grouped under the category assuming that existing social distancing measures will continue through the projected four-week time period. These models include:

  • Auquan Data Science (Model: Auquan)
  • Carnegie Mellon University (Model: CMU)
  • Columbia University and University of North Carolina (Model: Columbia-UNC)
  • Covid-19 Simulator Consortium (Model: Covid19Sim)
  • Discrete Dynamical Systems (Model: DDS)
  • Georgia Institute of Technology, College of Computing (Model: GT-DeepCOVID)
  • Iowa State University (Model: ISU)
  • Karlen Working Group (Model: Karlen)
  • LockNQuay (Model: LNQ)
  • Los Alamos National Laboratory (Model: LANL)
  • Massachusetts Institute of Technology, COVID-19 Policy Alliance (Model: MIT-CovAlliance)
  • Massachusetts Institute of Technology, Operations Research Center (Model: MIT-ORC)
  • Northeastern University, Laboratory for the Modeling of Biological and Sociotechnical Systems (Model: MOBS)
  • Notre Dame University (Model: NotreDame-Mobility)
  • Oliver Wyman (Model: Oliver Wyman)
  • Qi-Jun Hong (Model: QJHong)
  • Rensselaer Polytechnic Institute and University of Washington (Model: RPI-UW)
  • Robert Walraven (Model: ESG)
  • Steve Horstman (Model: STH)
  • US Army Engineer Research and Development Center (Model: ERDC)
  • University of Arizona (Model: UA)
  • University of California, Merced (Model: UCM)
  • University of Geneva/Swiss Data Science Center (Model: Geneva)
  • University of Georgia, Center for the Ecology of Infectious Disease (Model: UGA-CEID)
  • University of Massachusetts, Amherst (Models: UMass-MB and Ensemble)
  • University of Michigan (Model: UM)
  • University of Southern California (Model: USC)
  • University of Texas, Austin (Model: UT)

These statistical models are all deemed reliable by the CDC and are therefore credible when referencing data.

Characteristics of an Ideal Statistical Model

Statistical models can be helpful as tools to make informed guesses about the disease, its future spread, and the effects of different actions and interventions. These models are particularly helpful in situations where certain data cannot be collected due to the contagious nature of COVID-19. Fortunately, statistical models can help address these information gaps by describing the effects of this virus. This includes forecasting the number of cases, deaths, hospitalizations, and recoveries, that are possible or likely to happen within a given location over a period of time, in addition to comprehending the possible effects of interventions and policies enforced by political leaders.

While there are numerous statistical models to posit statistics on the COVID-19 pandemic, many of these models contain statistics that use additional information in order to create more precise and accurate forecasts.

Not all 37 statistical models touted by the CDC have to be analyzed to understand the statistics of COVID-19, but certain models might be preferred over others due to differences in the presentation of data. For example, some statistical models might present more specific data that applies to specific counties and cities in the United States rather than an entire state. This information is more valuable due to the relativity of a set of data to a designated area rather than a more expansive area (i.e. Fairfax County as opposed to the entire state of Virginia).

As clarified by the CDC, a defining benefit of a statistical model is whether or not it considers fluctuating levels of social distancing in a set area as a factor in determining forecasts in the following weeks. This is key because government regulations placed on the mobilization of the public can either diminish or fuel the spread of a pandemic such as COVID-19, therefore making it necessary to take all factors into consideration when predicting the spread of the virus in the future.

Basic United States statistical models portray the number of global confirmed cases, global deaths, national confirmed cases, and national deaths, with some of the most complex models such as the JHU model, pictured to the left, containing additional confirmed cases and deaths pertaining to individual states, counties, cities, and even specific areas within those cities. /var/folders/5g/xd3f69ls76x5lrmf0scr8ypm0000gn/T/

This image portrays the JHU model, which displays the number of cases and deaths resulting from COVID-19 within different areas, zeroing in from a global scale to a local scale.

In addition to this data, complex statistical models such as the JHU model display color-coded maps indicating where the virus is most contagious, as well as line graphs displaying the information gathered over a period of time and since the beginning of the virus.

How to Enhance Statistical Model Efficacy

With COVID-19 being an unprecedented pandemic infecting tens of millions of people around the globe, epidemiologic statistical models are critical planning tools for policymakers, clinicians, and public health practitioners. Many statistical models displayed throughout the internet are notorious for the vast dichotomy that they present in forecasting the future effects of this virus based on a variety of factors that are taken into effect, for example, the extent of social distancing being enforced in a set population.

Statistical models are at their most useful when they identify something that is not obvious. For example, one valuable function was to flag that temperature screening at airports will miss most coronavirus-infected people.

While statistical models are extremely useful to scientists around the world, there are a lot of factors that they do not capture. For example, they cannot anticipate the creation of an antiviral drug or vaccine, nor account for anguish and agitation among people practicing social distancing. Additionally, economic hardship is not a factor that epidemic models are able to account for.

While most statistical models are mathematically correct in calculating their forecasts, it is more accurate to apply some sort of basic calculation to all models for all scenarios, so that when human-controlled factors such as social distancing are applied in the future, the model or prediction that best fits the scenario is highlighted or utilized by the public.

Inga Holmdahl, S.M., and Caroline Buckee, D.Phil., authors of “Wrong but Useful—What COVID-19 Epidemiologic Models Can and Cannot Tell Us”, have composed five questions to further analyze model efficacy, which are portrayed below.

  1. “What is the purpose and time frame of this model? For example, is it a purely statistical model intended to provide short-term forecasts or a mechanistic model investigating future scenarios? These two types of models have different limitations.”
  2. “What are the basic model assumptions? What is being assumed about immunity and asymptomatic transmission, for example? How are contact parameters included?”
  3. “How is uncertainty being displayed? For statistical models, how are confidence intervals calculated and displayed? Uncertainty should increase as we move into the future. For mechanistic models, what parameters are being varied? Reliable modeling descriptions will usually include a table of parameter ranges — check to see whether those ranges make sense.”
  4. “If the model is fitted to data, which data are used? Models fitted to confirmed Covid-19 cases are unlikely to be reliable. Models fitted to hospitalization or death data may be more reliable, but their reliability will depend on the setting.”
  5. “Is the model general, or does it reflect a particular context? If the latter, is the spatial scale — national, regional, or local — appropriate for the modeling questions being asked and are the assumptions relevant for the setting? Population density will play an important role in determining model appropriateness, for example, and contact-rate parameters are likely to be context-specific.”

The majority of the questions displayed above hint at the importance of taking multiple factors into consideration, emphasizing that all levels of social distancing in the future should be accounted for.

Building off the factor of social distancing, Question 5 highlights the importance of population density within certain regions, for example living in rural areas versus suburban or urban areas and cities. March saw the initial North American epicenter of COVID-19 in New York City, the most populated city in the United States, eventually moving to the more rural southern part of the United States once social distancing restrictions were relaxed by policymakers in that area. These factors prove that human actions and movement control the spread of this virus and should be

accounted for in all statistical models. Such factors are taken into consideration by models such as the University of Texas COVID-19 model, portrayed above, which forecasts a vast cone for daily COVID-19 deaths in Minnesota, depending on the level of human mobility and social distancing practiced among the state’s population. The University of Texas' COVID-19 model predictions



This image conveys the University of Texas model which predicts the number daily COVID-19 deaths considering a multitude of human-based scenarios, such as social distancing.



In conclusion, statistical modeling differs from other scientific endeavors in that it makes several assumptions in a highly nonlinear process, increasing the possibility of error. Like all science, statistical modeling is an attempt to approach the truth using real-time observations; but when done with many assumptions (due to lack of data and the need for speedy utilization) these models have a higher chance of erring. Thus, models are constrained by what is known and what is assumed, but if utilized appropriately and with an understanding of these limitations, they can and should help guide everyone through this pandemic.


Centers for Disease Control and Prevention. 2020. United States Forecast For New Weekly COVID-19 Deaths And Total COVID-19 Deaths Through September 15. Image.

“Coronavirus Disease 2019 (COVID-19)”. 2020. Centers For Disease Control And Prevention.

“COVID-19 Update For Aug. 3, 2020: Global, National And State Perspective – Lynnwood Today”. 2020. Lynnwood Today.

Eckert, A. 2020. This CDC Illustration Reveals Ultrastructural Morphology Exhibited By Coronaviruses. Image.

Holmdahl, I., and C. Buckee. 2020. “Wrong But Useful — What Covid-19 Epidemiologic Models Can And Cannot Tell Us | NEJM”. New England Journal Of Medicine.

“Listings Of WHO’s Response To COVID-19”. 2020. Who.Int.

Michaud, J., J. Kates, and L. Levitt. 2020. “COVID-19 Models: Can They Tell Us What We Want To Know?”. KFF.

Montgomery, D.H. 2020. Texas Model: Predicted COVID-19 Deaths. Image.

Worldometer. 2020. Screenshot Of Worldometer Website. Image.


About the Author

Andreas Voyages is a senior at Langley High School. He is a member of the National Society of High School Scholars (NSHSS) and has conducted research on the Zika virus in the past. He has also served as a volunteer at two hospitals in prior years and shadowed doctors for a significant period of time.

One thought on “The Importance of Statistical Modeling for the COVID-19 Pandemic

  1. I stopped the Parkinson’s medications due to severe side effects and started on natural treatments from VineHealth Center (VHC) in California, the treatment has made a huge difference for me….. My symptoms including tremors disappeared after the months long treatment! Go to w w w. vinehealthcenter. c om…..

Leave a Reply

Your email address will not be published. Required fields are marked *