What R’s most popular tools say about the state of data science

“Packages” are what make data science work.
“Packages” are what make data science work.
Image: Reuters/Gabrielle Lurie
By
We may earn a commission from links on this page.

The programming language R is one of the most important tools in data science, used by millions of people across the world. Along with spreadsheet programs, like Microsoft Excel and Google Sheets, and other programming languages, such as Python, SQL and Julia, R is what many analysts use to gain insights from the huge amount of data that’s collected these days.

This week, R users from across the world are gathering in Toulouse, France to discuss new developments at the useR! conference, the language’s premier international gathering. It’s the Super Bowl for data nerds, and nearly everything they—okay, we—will be talking about revolves around “packages.”

Almost every user of R relies heavily on packages (also known as libraries). These are plugins for R that make it simpler to complete tasks. The most famous R package is probably ggplot2, perhaps the world’s best freely available charting tool, and one of the main reasons many users are drawn to R in the first place. R is open source and free to use, and so are packages. Packages are typically made by particularly knowledgeable users.

Examining the most popular packages in R is a simple way to get a feel for what is happening in the world of data science. Fortunately, there is a package for that!

The package cranlogs allows R users to see how many times every package was downloaded on a given day. Users can download packages from many locations, and cranlogs focuses on just one of the most popular locations, so it not perfect but broadly representative of overall downloads.

The most downloaded packages are rlang and Rcpp. These are packages made for the development of packages or to make R work more efficiently, so they don’t tell use much about the world outside of R. The packages magrittr (tied for fifth), pillar (eighth) and R6 (10th) are also tools for developers.

The third-most popular download is dplyr, a package used to simplify data manipulation. It allows users to easily summarize data and make new variables out of existing ones. It reflects the fact that data scientists spend most of their time on data cleaning and preparation, not analysis. The tool was created by superstar R programmer Hadley Wickham, who also a developer or co-developer of top-10 packages tibble, ggplot2, glue, and pillar.

Glue, the seventh-most popular package is designed for working with data that is text. Textual analysis and text manipulation is rising in importance, as much of the data produced today, such as on social media, is made up of words rather than numbers.

The ninth-most popular package is data.table, which is all about the need for speed. Data.table can make millions of calculations in seconds, important for running analysis on datasets with billions of rows.

In addition to the most popular packages, looking at the fastest-growing packages over the past year is also informative. Four packages had at least 750,000 more downloads in June 2019 than in June 2018. First among these, with an increase of nearly 900,000 downloads, is aws.s3, a package that allows R users to easily work with data held by Amazon Web Services (AWS). The fourth-fastest growing package, aws.ec2metadata, is a related tool for working with AWS. Clearly, data scientists are moving to the cloud.

The third-fastest growing package is rsconnect, a tool for making interactive applications like dashboards. Making dashboards to organize and visualize data for a wide range of users has become a central part of the data scientist’s toolkit.