How big data tells us when the “United States” literally became one

“The United States is increasingly divided” or “The United States are increasingly divided”? To most, the former seems more grammatically correct than the latter. Referring to the United States as a singular noun is now common usage—the plural sounds strange to modern ears.

“The United States is increasingly divided” or “The United States are increasingly divided”? To most, the former seems more grammatically correct than the latter. Referring to the United States as a singular noun is now common usage—the plural sounds strange to modern ears.

It wasn’t always this way. In his recently published book Everybody Lies, data scientist and New York Times opinion writer Seth Stephens-Davidowitz explains that “the United States are” was standard usage through the first half of the 19th century. Many historians speculated that the shift took place after the Civil War, which established that the nation was truly indivisible. But it was impossible to answer the question with much accuracy, because comprehensive data on language use going back this far wasn’t readily available.

The mystery was finally solved, Davidowitz writes, by biologists Erez Aiden and Jean-Baptiste Michel. They used Google $GOOGL Ngrams, Google’s service that allows users to search for the prevalence of words and phrases in books going back to 1800, to show that the United States didn’t become singular in literature until the 1880s. “Military victories happen quicker than changes in mindsets,” Davidowitz writes.

Davidowitz’s new book is a celebration of the power of data. Though the book is called Everybody Lies, it is not so much focused on people’s prevarications as all the ways in which access to new forms of data help us understand the truth.

In a conversational tone, reminiscent of Stephen Dubner and Steven Levitt’s Freakonomics, Davidowitz walks us through how new forms of data—like Google searches, Facebook $META likes, satellite imagery, and digitized text—combined with creative analysis methods now allow us to solve mysteries that once seemed impossible. The book is a combination of Davidowitz’s original analysis as well as the work of others he admires.

These are three of our other favorite examples of data sleuthing highlighted in Everybody Lies.

1️⃣ Using a combination of Google search data and Facebook information, Davidowitz estimates that about 5% of men in the US are gay. This contrasts with the 2-to-3% of men that self-identify as gay in representative surveys.

His method relies on Facebook self-identification data and Google searches for porn. Davidowitz finds that although twice as many men identify as gay in the state of Rhode Island, a relatively liberal state, than in Mississippi, the percentage of searches for gay-male porn is only slightly higher in Rhode Island (5.2%) than Mississippi (4.8%). The slightly higher rate of searches in Rhode Island is likely a result of gay men choosing to move to a place where their sexuality is more publicly accepted.

Davidowitz also examined the relationship between the acceptance of gay marriage in a state and the percentage of men self-identifying as gay. He estimates that if acceptance for gay marriage was at 100%, self identification as a gay man would reach 5%.

Given the consistency of the search data with this statistical relationship, Davidowitz is confident his 5% estimate is more accurate than the surveys. The larger point of this analysis, he says, is to show that people’s online behavior reveals more about them than their statements.

2️⃣ Davidowitz relays the story of how the race horse analyst Jeff Seder used modern data collection methods and statistical correlation to predict the greatness of racehorses, most notably 2015 Triple Crown winner American Pharoah.

Historically, horses were judged by their pedigree. If a horse’s parents or siblings were great performers, then it was thought they also had a shot at success on the racetrack, and would sell for a lot of money at auction. Agents might also examine a young horse’s size or gait.

But Seder, who has an MBA and law degree from Harvard University, found that none of these factors were all that predictive of greatness. Armed with a variety of tools, including a portable ultrasound machine, he went on the hunt for more telling factors.

After years of collecting data on the attributes of young horses and comparing them to their earnings on the track, Seder found one unusual physical attribute that was highly predictive: the size of the horse’s left ventricle. American Pharoah had an enormous left ventricle. Along with the fact that all of his other attributes were in the range of what’s expected of a good thoroughbred, this made American Pharoah a good bet to be a great horse.

Attributes of American Pharaoh as a 1-year-old

Davidowitz uses this story to make the point that new data sources are most powerful in domains where deep data analysis has been rare. There is not as much to learn from new data in analytics-saturated fields like finance or baseball, but there are still plenty of fields, like horse racing, that still rely heavily on traditions and gut feel.

3️⃣ Another of Davidowitz’s analysis shows that privilege is everywhere, even where you might least expect it, like the National Basketball Association (NBA).

He analyzed a variety of datasets—including the US Census, the stats site Basketball Reference, and Ancestry.com—to show that professional basketball players, the vast majority of whom are black Americans, are more likely to come from middle-income families than poor ones. This defies the narrative that the NBA is made up of black men who grew up in poverty and saw basketball as the only way out. Family financial resources make success in nearly any field more likely.

Davidowitz builds his case in three parts. First, he used census data to show that young men from wealthier counties are more likely to make it to the NBA than those from poor ones. Second, Davidowitz collected data on the family background of the 100 players born in the 1980s who have scored the most points; these players were 30% less likely than the average black American of a similar age to have been born to a single or teenage mother. Third, he found that the most common names of NBA players, like Kevin and Chris, were names typically given by parents with higher incomes.

Davidowitz points out that this analysis did not rely on any “big” data, per se. It is not just the massive, comprehensive datasets that allow data scientists to answer heretofore unanswerable questions, but a clever use of the plethora of smaller collections of information now available.

In the latter portion of the book, Davidowitz warns of the risks inherent in a world overflowing with data—privacy issues, spurious correlations, and an overemphasis on what is measurable. But mostly, he is excited by all the things we will learn. For the empirically minded problem-solver, there has never been a better time to be alive.

How big data tells us when the “United States” literally became one

“The United States is increasingly divided” or “The United States are increasingly divided”? To most, the former seems more grammatically correct than the latter. Referring to the United States as a singular noun is now common usage—the plural sounds strange to modern ears.

Attributes of American Pharaoh as a 1-year-old

📬 Sign up for the Daily Brief

Our free, fast and fun briefing on the global economy, delivered every weekday morning.