Coronavirus has caused a data deluge. Everywhere we look, statistics abound, among them counts of confirmed cases, number of jobs lost, the declining price of oil, and the vast sums governments are spending to preserve their economies.
I have dedicated my life to the analysis and communication of data, and still I find myself constantly confused. Are the numbers we have on the virus’s trajectory meaningful if so many people have asymptomatic cases? Can we possibly measure the impact of coronavirus on the job market if fewer people are answering surveys at the moment? Can I actually get oil for free now?
Of course, it’s not just because of coronavirus that interpreting data is challenging. Numbers reported by policymakers, academics and the media have always been estimates of what we really want to know—estimates that invariably need caveats. But understanding the strengths and weaknesses of data feels particularly important right now. More than ever, how we interpret statistics could cost people’s lives or livelihoods.
So how should you approach data in this crucial time? Whenever you come across a statistic, I suggest you examine it through the prism of three concepts: bias, variability, and lag.
Bias
Bias is the most important data concept to think about right now. Statistical bias is the idea that a statistic may be an overestimate or underestimate due to one group being disproportionately likely to be part of a study.
The key example of bias right now is in terms of the share of people who have had coronavirus. On April 22, there were about 2.5 million confirmed cases of coronavirus worldwide, according to the New York Times, meaning about 33 out of every 100,000 people in the world have had a confirmed case.
What most people really want to know is the actual share of people who have had the virus, but the number of confirmed cases is almost certainly biased to be an underestimate. With a lack of testing in many countries, and many asymptomatic cases, we know there are probably many more cases than have been reported, perhaps millions more.
That doesn’t mean confirmed cases statistics are totally useless. It just means they should be used with caution. If the number of cases appears to be falling, might it be because there was less testing? Or because the kinds of people getting test were less likely to have the virus? When reading a study or story with case statistics, it is more trustworthy if there is discussion of why the numbers might be biased up or down.
There are also more subtle types of bias in coronavirus statistics. For example, there are a number of US surveys to test a random group of people for antibodies to SARS-CoV-2. These surveys attempt to assess the share of people in a place that have had coronavirus. A form of bias in these studies, like many other surveys, comes from who chooses to participate.
One study of Santa Clara county, California residents recruited participants from Facebook ads. These survey estimates might be biased because the type of people who respond to such ads could be particularly likely or unlikely to have been infected. Researchers can try to use statistical methodology to account for that bias, but it’s not foolproof. (Note that the survey results may also be large overestimates because of “false positives“—an issue that is particularly vexing in places with few actual cases.)
Almost no estimate can avoid some type of bias. As a data consumer, your goal should be to think about the effect that bias may have.
Variability
Another major issue with data is sampling variability. Many statistics are calculated using a randomly selected share of the population. To calculate unemployment figures, governments don’t ask every single person whether they are employed; they extrapolate from a small subset. For example, to calculate its unemployment, the UK surveys just over 80,000 people every month, less than 1.5% of the population. Still, that 80,000 is enough to get a very accurate estimate.
It’s not always the case that a survey’s sample size is so large, and the estimate so precise.
For example, many polling organizations are conducting surveys on how Americans view Donald Trump’s handling of coronavirus. A poll run from April 19-20 by The Hill and HarrisX found that 51% of registered voters approved of his actions. This is among the higher approval ratings Trump has received, and a person reading about this poll might think the majority of Americans approve of his response. But this poll is based on just 958 people. Sampling variability tells us that if a different 958 people were surveyed, that number could have easily been 47% or 55%.
When looking at poll results, it’s typically best to look at multiple polls rather than one. The website FiveThirtyEight averages poll results on Trump’s handling of coronavirus. It’s a much better source for tracking those numbers than any one poll. (As of April 23, they found a 46% approval rating of Trump’s coronavirus response for all Americans.)
The sampling-variability issue is exacerbated by that fact that polls or studies with extreme results are more likely to get reported on (in academia this is referred to as “publication bias“). If Trump gets an unusually low approval-rating number in one poll, or a SARS-CoV-2 antibody study finds a surprisingly large share of positives, they attract the media looking for interesting new numbers.
The lesson is that for any statistic, it’s important to check the size of the population its based on, and whether multiple people have tried to measure it. For any result that is far outside what others have found, it’s good to be suspicious.
Lag
Lag is perhaps the simplest concept of three presented here, but particularly important during coronavirus. Data is sometimes hard to collect and slow to be released, meaning it can be weeks or months before we know the answer to an essential question.
One key metric that tends to lag is number of coronavirus-caused deaths. Although many countries and cities try to produce daily death counts, those numbers are typically undercounts. In New York City, an epicenter for the virus, overburdened hospitals may take days or weeks to report deaths to the city’s health departments. On April 1, the city reported just 1,374 deaths from Covid-19 on or before that date. But by April 9, updated data showed that 2,253 people had died by that time. This lag in deaths data can lead people to underestimate the deadliness of the virus.
Lag is also an issue when assessing the state of the economy. In the US, for instance, job-market data collected by the Bureau of Labor Statistics is released on a three-week lag. The most recent data on unemployment and jobs added to the economy was released on April 3. This data was collected from March 8-14, before most states ordered citizen to shelter at home.
This lag on unemployment data typically isn’t a big problem. Most economic downturns don’t happen in just a few weeks, and the wait for data doesn’t impact policy. In this case, job losses were precipitous.
To get a sense of the scale of the problem in the job market, people have been turning to unemployment insurance claims, which only have a five-day lag. Researchers have also looked to privately run surveys, which have more up-to-date data, but lack the same methodological rigor and history as the government data. Other economic statistics that have long lags include GDP growth and international trade data. It’s very hard to estimate the effect of the virus on these areas.
The answer to dealing with lag as a data consumer is pretty simple. Make sure to check the date of the estimate, and whether it is likely to be updated.
Please, don’t ignore data
A large part of my life involves telling people about the issues with a data point. Often, after I go into detail about those problems, people will tell me that the data point is “total garbage” or “useless.” But very rarely is that the case. Even with bias, variability, and lag issues, a statistic can be very meaningful.
The reported number of daily confirmed coronavirus cases has many issues, but it’s almost certainly better than nothing. If we understand that it is likely an underestimate that is reported on a lag, that makes it even more valuable. The age of coronavirus is not the time to ignore statistics, but the time to examine them even more closely.