If you want to upgrade your data analysis skills, which programming language should you learn?

Crunching numbers.
Crunching numbers.
Image: Reuters/Gabrielle Lurie
By
We may earn a commission from links on this page.

For a growing number of people, data analysis is a central part of their job. Increased data availability, more powerful computing, and an emphasis on analytics-driven decision in business has made it a heyday for data science. According to a report from IBM, in 2015 there were 2.35 million openings for data analytics jobs in the US. It estimates that number will rise to 2.72 million by 2020.

A significant share of people who crunch numbers for a living use Microsoft Excel or other spreadsheet programs like Google Sheets. Others use proprietary statistical software like SAS, Stata, or SPSS that they often first learned in school.

While Excel and SAS are powerful tools, they have serious limitations. Excel cannot handle datasets above a certain size, and does not easily allow for reproducing previously conducted analyses on new datasets. The main weakness of programs like SAS are that they were developed for very specific uses, and do not have a large community of contributors constantly adding new tools.

For those who have reached the frontiers of these programs, there is a next step: learn R or Python. R and Python are the two most popular programming languages used by data analysts and data scientists. Both are free and and open source, and were developed in the early 1990s—R for statistical analysis and Python as a general-purpose programming language. For anyone interested in machine learning, working with large datasets, or creating complex data visualizations, they are godsends.

But which of these programs is best to learn? As a former data analyst, it is the question I have been asked more than any other in my professional life. Though you could just try to learn both R and Python, each requires a significant time investment—particularly if you have never coded before.

Personally, I’m biased. Learning R radically changed my life for the better (I’m not exaggerating), but I know only a smidgeon of Python. Luckily, Quartz’s former data editor, Chris Groskopf, is a user of both languages. When I asked him which program he would recommend, he offered the nuanced answer all those non-coders asking me were seeking. (For a more technical discussion of the debate and others’ opinions on the matter, see here.)

In a nutshell, he says, Python is better for for data manipulation and repeated tasks, while R is good for ad hoc analysis and exploring datasets.

He went with Python when working on elections coverage, since it was a relatively routine, predictable process. From pulling the data, to running automated analyses over and over, to producing visualizations like maps and charts from the results, Python was the better choice. “If I had done the analysis in R, then I would have had to switch to a different tool to create the website and automate the process, but Python also works well for those things,” he says.

R, by contrast, is good for statistics-heavy projects and one-time dives into a dataset. Take text analysis, where you want to deconstruct paragraphs into words or phrases and then identify patterns. “I often don’t know where I’ll end up when I start a process like that, and R makes it easy to try a lot of different ideas quickly,” Groskopf says. “In Python, I would inevitably end up writing a bunch of generic code to solve this pretty narrow problem.”

Which is easier to learn? R has a steep learning curve, and people without programming experience may find it overwhelming. Python is generally considered easier to pick up.

Another advantage of Python is that it is a more general programming language: For those interested in doing more than statistics, this comes in handy for building a website or making sense of command-line tools. The way Python works reflects the way computer programmers think. R, on the other hand, reflects its origins in statistics. Many programmers find the design of R irritating, because it’s so different to what they’re used to, Groskopf says. For someone interested in becoming a general-purpose programmer, Python is a better choice. 

But for data analysis, the differences between R and Python are starting to break down, he says. Most of the common tasks once associated with one program or the other are now doable in both. They are similar enough, in fact, that if most of your colleagues are already using R or Python, you should probably just pick up that language.

So the great R-versus-Python debate is settled. If all you’re doing is data analysis, it doesn’t really matter which one you use.