William Cukierski, a PhD candidate in biomechanical engineering at Rutgers University in New Jersey, is describing what sounds like a garden-variety online-gaming compulsion. “It’s pretty addicting,” he says. “I’m up until 2 a.m., clacking away on my keyboard while the rest of the world is sleeping.”
But this game he’s so keyed up about is what most people would call work. Kaggle, the website that is consuming all his time, hosts dozens of “big data” contests that pit thousands of data scientists against each other, crunching numbers in real time on behalf of various companies. These businesses are glutted with data, everything from users’ musical preferences to US census data. They want skilled number crunchers who can act as the 21st-century equivalent of oracles, using past data to scry the future—like predicting which product a buyer will want based solely on its features. Using Kaggle to round up these oracles turns out to be a pretty good deal for the companies: So far, Cukierski has put in thousands of hours on these projects but has won only $250 for his efforts.
Increasingly, though, Kaggle competitions that don’t pay are the exception rather than the rule. In one ongoing contest, the Heritage Provider Network (HPN), a physicians’ group in California, is offering a $3 million pot to the team that comes up with the algorithm that best predicts which patients will be hospitalized in the next year. But with 1431 teams competing, the odds of winning the prize are minuscule. Which is to say, most Kaggle contestants aren’t in the competition for the money any more than Cukierski is. They’re in it for the sport.
A typical Kaggle contest works something like this. A company gives Kaggle a huge pool of historical data—in the case of the Heritage prize, that means patient health records covering pretty much everything Heritage knows about a person except his name and address—but withholds the most recent slice of data. To the physicists, mathematicians, programmers, philosophers, and moonlighting Wall Street quants who compete on Kaggle, these data could be about anything. It doesn’t matter whether the data come from a NASA data set, or from the Chinese equivalent of Twitter, or from a bond-trading firm. Whatever their source or subject, to a person well-versed in the art of data science, the data are just a big pile of potential correlations waiting to be untangled via algorithms.
Contestants then write software that that parses the data into predictive “signals.” These signals are blended into a single algorithm that attempts to predict the likelihood of a future event—in the Heritage case, the chances of hospitalization. (One of the most important signals that a patient will be hospitalized is that she is pregnant.) Kaggle tests each contestant’s algorithm on recent hospitalization data, calculates its predictive power, and publishes the results on the site’s leader-board. Once each contestant sees where he stands, he can refine his algorithm further and resubmit it. By the time the Heritage contest closes, next year, most contestants will have submitted and resubmitted algorithms dozens of times.
Kaggle was inspired by the Netflix Prize, which in 2009 awarded $1 million to a pair of Australians for improving the website’s movie recommendation engine. Kaggle co-founder Anthony Goldbloom was then a finance journalist, and after he had reported on the prize, it occurred to him that such contests could help solve other companies’ data overloads. Kaggle isn’t bringing these techniques to companies for the first time, of course: Businesses have been using machine learning and related statistical tools to solve their “big data” problems for years, even decades—the automatic fraud detection algorithms employed by credit card companies starting in the 1980s being the classic example. Instead, Kaggle’s innovation has been in constructing an X-Prize-style contest that allows companies to get more and better work done on their problem than they could otherwise afford. The X-Prize foundation runs high-stakes contests that encourage inventors to “bring about radical breakthroughs for the benefit of humanity.” In the same way, Kaggle’s founders figure that the thrill of competition, as much as the lure of a cash prize, will motivate its contestants.
So far, Kaggle’s assembled data scientists have successfully predicted: the outcome of the World Cup, the progression of HIV in patients, the likelihood that US auto insurance company Allstate will have to pay for bodily injury in the event of a crash, when shoppers will visit a grocery store and how much they will spend, and the location of dark matter in the universe. The contests that could have the biggest commercial impact are invite-only, and the company does not disclose who sponsors them or what questions they’re intended to solve.
Goldbloom says his ultimate goal is for data scientists, whose nascent field has yet to crystallize into a university course or even a textbook, to make Wall Street-level money by competing at a mental sport. “We’re building one of the first meritocratic labor markets that really matters,” he says. However immodest that claim may be, employers have begun taking notice of Kaggle rankings. Cukierski, the Rutgers doctoral student, says he recently received a mysterious solicitation from a Wall Street proprietary trading firm that had come across him on the site; other regular competitors say that certain companies have begun requesting Kaggle rankings as part of their hiring process.
But if this continues—if Goldbloom achieves his goal, and all this becomes work, rather than a guilty distraction from work—will Kaggle retain its addictive appeal? Jonathan Gluck, who works for the Heritage Provider Network and is helping to coordinate the Heritage contest, says his company plans to offer more contests in the future. As he talks about Kaggle’s approach, he keeps coming back to the X-prize—HPN’s founder, Richard Merkin, sits on the board of that contest, and it was the direct inspiration for the Heritage Health Prize. “You can almost think about it as fantasy football for data mining,” he says.