Statistics came well before computers. It would be very different if it were the other way around.
The stats most people learn in high school or college come from the time when computations were done with pen and paper. “Statistics were constrained by the computational technology available at the time,” says Stanford statistics professor Robert Tibshirani. “People use certain methods because that is how it all started and that’s what they are used to. It’s hard to change it.”
People who have taken intro statistics courses might recognize terms like “normal distribution,” “t-distribution,” and “least squares regression.” We learn about them, in large part, because these were convenient things to calculate with the tools available in the early 20th century. We shouldn’t be learning this stuff anymore—or, at least, it shouldn’t be the first thing we learn. There are better options.
As a former data scientist, there is no question I get asked more than, “What is the best way to learn statistics?” I always give the same answer: Read An Introduction to Statistical Learning. Then, if you finish that and want more, read The Elements of Statistical Learning. These two books, written by statistics professors at Stanford University, the University of Washington, and the University Southern California, are the most intuitive and relevant books I’ve found on how to do statistics with modern technology. Tibsharani is a coauthor of both. You can download them for free.
The books are based on the concept of “statistical learning,” a mashup of stats and machine learning. The field of machine learning is all about feeding huge amounts of data into algorithms to make accurate predictions. Statistics is concerned with predictions as well, says Tibshirani, but also with determining how confident we can be about the importance of certain inputs.
This is important in areas like medicine, where a researcher doesn’t just want to know whether a medicine worked, but also why it worked. Statistical learning is meant to take the best ideas from machine learning and computer science, and explain how they can be used and interpreted through a statistician’s lens.
The beauty of these books is that they make seemingly impenetrable concepts—“cross-validation,” “logistical regression,” “support vector machines”—easily understandable. This is because the authors focus on intuition rather than mathematics. Unlike many statisticians, Tibshirani and his coauthors don’t come from a math background. He believes this helps them think conceptually. “We try to explain [concepts] intuitively by explaining the underlying idea first,” he says. “Then we give examples of a situation you would expect it work. And also, a situation where it might not work. I think people really appreciate that.” I certainly did.
For example, a section of An Introduction to Statistical Learning is dedicated to explaining the use of “bootstrapping”—a statistical technique only available in the age of computers. Bootstrapping is a way to assess the accuracy of an estimate by generating multiple datasets from the same data.
For example, lets say you collected the weights of 1,000 randomly selected adult women in the US, and found that the average was 130 pounds. How confident can you be in this number? In conventional statistics, to answer this question you would use a formula developed more than a century ago, which relies on many assumptions. Today, rather than make those assumptions, you can use a computer to take thousands of samples of 500 people from your original 1,000 (this is the bootstrapping) and see how many of these results are close to 130. If most of them are, you can be more confident in the estimate.
Theory and application
These books, mercifully, don’t require high-level math, like multivariate calculus or linear algebra. (If you’re into that sort of thing, there is a wealth of worthy but dry academic literature out there for you.) “While knowledge of those topics is very valuable, we believe that they are not required in order to develop a solid conceptual understanding of how statistical learning methods work, and how they should be applied,” says Daniela Witten, a coauthor of An Introduction to Statistical Learning.
Helpfully, the books also provide code you can use to apply the tools with the statistical programming language R. I recommend putting their examples to work on a dataset you are excited about. If you are into novels, use it to analyze Goodreads ratings. If you like basketball, apply their examples to numbers at Basketball Reference. The statistical learning tools are wonderful in themselves, but I’ve found they work best for people who are motivated by a personal or professional project.
Data and statistics are an increasingly important part of modern life, and nearly everyone would be better off with a deeper understanding of the tools that help explain our world. Even if you don’t want to become a data analyst—which happens to be one of the fastest-growing jobs out there, just so you know—these books are invaluable guides to help explain what’s going on.