Just like spoken languages, programming languages evolve. Popular languages like C++, Java, and Python have all changed dramatically over time. As the needs of programmers change, languages adopt new features or modify existing ones. These alterations are usually made by committees that manage the language or users who create libraries (plugins that make it simpler to complete tasks) that are shared and become integral parts of the language.
In the case of the wildly popular statistical programming language R, it’s been more about revolution than evolution. The changes keep coming.
Last month, R users from across the world gathered in Toulouse, France to discuss new developments at the useR! conference, the language’s premier international gathering. At nearly every talk I attended, the name Hadley Wickham was mentioned. Wickham is the language’s most important developer. Over the past decade, along with his collaborators, Wickham built a set of popular data analysis and visualization libraries (also known as packages) called the “tidyverse,” which has almost become its own language. Wickham’s libraries are among the most popular in R, and have become the standard for new learners. (R is free to use.)
People who stopped using R years ago would barely recognize how people typically use the language today. Some R users are displeased by the dominance of the tidyverse, in part because it is now backed by the company RStudio, which employes Wickham and most of his collaborators. RStudio offers a free user interface for the language, but charges companies for enterprise support.
In Toulouse, I spoke with Wickham about the current state of R and what he sees for the future of the language. The conversation has been edited and condensed.
Quartz: How would you describe the current state of the R language and community today compared with a decade age?
Wickham: Ten years ago, R was just this like little weird language that was popular among statisticians. For me, the story of the last 10 years is this incredible growth.
Along with that, there has also been a pretty profound change in the community. Ten years ago, the main place to ask an R related question was the R Help mailing list. If you asked a dumb question, someone would flame you. It was a pretty intimidating environment. Now, people tell me they love R because the because the community is so welcoming. I think a large part of that is because of that it has become much more diverse. It was mostly just statisticians before, and now the people are from much wider academic backgrounds.
What is a recent development that you are really excited about?
I think R Markdown is an amazing contribution to R. It has made it a lot easier to make reproducible documents with R regardless of whether you are an academic [writing a paper] or making powerpoint decks for a company.
I am also excited about the idea of “tidyevaluation.” This is an idea that allows you to program in R very naturally. R has this tension being a programming language and also being an interactive environment for data analysis. When you are doing data analysis typing speed is actually a bottleneck. As developers, “tidyevalutions” helps us make sure the user do as little typing as possible and can express really rich ideas [for analysis]. This is what underlies ggplot2 and some of our other libraries (Editor’s note: ggplot2 is a popular data visualization library.) The idea is to get things out of your head and on to the computer as quickly as possible.
Also, in the community, R-Ladies has been fantastic in improving the gender diversity. (Editor’s note: R-Ladies are Meetup groups across the world for women and other gender minorities to discuss R.) I am tremendously excited about that. It is cool to see them moving beyond data analysis, and making packages and becoming contributors and giving back.
What are some of the issues you see R facing?
Generally, there are a lot of people who talk about R versus Python like it’s a war that either R or Python is going to win. I think that is not helpful because it is not actually a battle. These things exist independently and are both awesome in different ways.
A pattern that I see is that the data science team in a company uses R and the data engineering team uses Python. The Python people tend to have a background in software engineering and are very confident about their programming skills. They see R and it looks very weird, and say with a lot of certainty these facts about R [not being as good].
The R users are generally not as confident in their programming skills. They really like R, but can’t argue with the engineering team, because they don’t have the language to make that argument. People using R tend to have these backgrounds in biology or marketing and they don’t have the vocabulary. R is a weird language but it is weird for good reasons, and it’s a really good fit for data science. It’s not a general purpose programming language, but there are good reasons for a lot of the things it does.
For a lot of people using Python, it’s the only programming language they have used and they think it’s great. They are right. It is great. But there are multiple ways of attacking the same problem, and sometimes the reason R is different is good.
So I sort of worry about that. Of course, this does not characterize all python or R users.
I don’t really believe it’s a division. Use whatever makes you happy. It doesn’t hurt my feelings if you don’t use the tidyverse.
In the next year or so, what is a development you are looking forward to?
I have just been working on this thing called “dtplyr” which will allow people to write in “dplyr” code and than it will translate it into data.table code. (Editor’s note: dplyr and data.table are popular libraries for data analysis in R.)
My vision for R in the future that you can separate the description of what you want to do with the data from the actual computation. This is part of a trend in the R community. You can write tools to translate to high performance computing environment. You can write the same code, but the [backend can be different].
In this case, you can use dplyr to express what you want to do with the data, but dtplyr will allow you to get the speed of data.table. Also, if you want to learn data.table, this can help. You can use it to see how the code translates.
Let’s talk about the next five years. What are your hope and dreams over that time?
My hope is that the integrations between R and Python continue to grow and for R to more seamlessly fit in the data science workflow. There are certain things that it will never be as good at as Python, and we want to make sure people can collaborate with Python people. But also we want to provide the basics in R, so there is a choice. If you want cutting edge stuff, you can use whatever tool you want, but R should be good enough at nearly everything that you can stay in R if you want.
In terms of visualization, I would like to return to bringing more interactivity to graphics in R. The vision is that you would be able to create graphics like in ggplot2, [with code] that also describes how you can interact with them. This is so important for big datasets where you can’t display all the data simultaneously, where you need to be able interactively drill down and think about what’s happening. It’s quite a hard technical problem. With RStudio the company, we now have the resources to pour in these areas. In a year or two we might have enough money to work on this full-time. That is the level of engineering effort I think is needed for this problem.
One thing I have been thinking about a lot is diversity. In terms of gender diversity we have made a massive improvement, and the direction seems positive. When I look in the US, there are big racial inequities. I personally know more African R users than African American R users. I have this worry that there are other communities that we should be reaching out to. Can we take the R-Ladies model and help other groups that are currently underserved?
I am trying to find the leverage points where I can help R reach new communities and help new communities learn about R. I am teaching a two-day workshop and Spellman College, a female historically black college. My sense is that you need to build up a nucleus of people who know each other and who can network and support each other. I want to do that as much as I possibly can.
There is also this broader question about how we make sustainable open source software. Companies get this huge economic benefit from it, and they are not required to give back. It’s very hard to rely on philanthropy, so how do we extract some of the economic value open-source is generating and reinvest it back into the community?