How Packages Make R Complete
Users of R, the statistical programming language, are a surprisingly passionate bunch. If you're skeptical of the devotion, just search "love #rstats" in Twitter and you will see an outpouring of ecstatic declarations of love for the language.
For the outsider, it can be difficult to understand why someone would feel such fondness statistical programming language. It's just a way to do work with data and do statistical analysis, why would people care so much?
The answer likely lies in R's history as a collaborative project. R is an open source language, which means that no one "owns" R, and anyone can contribute to it. R has two million users globally, and many of these users have contributed to the language's development. R's main competitors in the statistical programming space (SPSS, SAS, and Stata) are all privately owned, and more difficult to modify.
For many people, R is not just a statistical software, but a vibrant community. The strong Meetup culture of R users is evidence of this. Even though they have similar user numbers, well over 100,000 people have signed up for R Meetups across the world, while only 5,000 people have signed up for SAS Meetups.
The main way R users can contribute to the "R Project" is by creating packages. Packages are programming tools that simplify the code necessary to complete common tasks such as aggregating and plotting data. R users have become accustomed to the idea that if they can't figure out how to do something, it won't be a problem, because, as statistician Roger Peng says, "There's an R Package for That."
This article explores the rise of the package and some examples of the coolest and most unique packages available to R users.
In many ways, the history of R is a story of the effort to make statistical programming easier and easier.
R was created in 1993 by the Kiwi statisticians Ross Ihaka and Robert Gentleman. The language is a modification of S, a programming language developed at Bell Labs in the 1970s. S was created to make programming simple enough for statisticians without computer science skills - the vast majority of them.
Prior to the development of S, statisticians would have had to learn to code in Fortran, an early programming language with a high barrier to entry, if they wanted to use computers for statistics. Instead of having to write out complicated code in Fortran to find the average of a set of numbers, a simple one word function could be used in S. R is very similar to S, but with a few key technical differences that make it even easier to use.
Even still, R is not the easiest language to learn. There are lots of tasks for which writing R code would be quite challenging for a non-expert or someone without a computer science background. And this is definitely relevant, since less than 15% of students who take Udemy's R courses have an academic background in computer science.
This is where packages come in. Packages are add-ons users can choose to download that simplify the code necessary for a task. Any R user who figures out some nifty code to solve a problem can "package" it up and share the code with other R users. The creators of these packages make no money off them, and likely do it out of their impulse to share and gain prestige within the community.
A simple example of how packages make life easier for the data analyst comes from the gdata package. Using "base R", the language without any add-ons, it can be a nuisance to import an Excel file into R. You have to save your file in another format (like a text file or a csv) and then when you read that new file into R, you have to specify whether the first row is a header.
Using the gdata package, created by R user Gregory Warnes, getting your data from Excel to R is a cinch. All you do is put the name of your file in parentheses and write "read.xls" before it.
The number of R packages has grown exponentially over the last 10 years. In 2005, there were only a couple hundred packages, but there are now well over 6,000. A list of every package is available here. The chart below displays this package explosion.
Hundreds of R users have "authored" packages. In order to have a package listed officially on the R website, the package must be scrupulously documented so it is clear to R users how to use it. One of the authors of the package must be put forward as the "maintainer", the person who will fix the any errors that appear.
The next chart highlights the individuals who maintain the most packages.
Packages these individuals develop improve the coding lives of the community and users are thankful. At the top of the list of maintainers is the influential package developer Hadley Wickham. His contributions to R have led to his being "nerd-famous" in the R community.
The following chart displays the most popular R packages, as defined by the number of downloads in August, 2015. Five of the ten most popular packages were authored by Wickham.
The most popular package is Rccp. Rccp is what allows many other R packages to run quickly (it calls C++ from R). Without it, R users would be spending a lot more time waiting around. The second most popular package, plyr, makes it easy to summarize data. It's the equivalent of pivot tables in Excel, but much more flexible. Other packages are for making cool charts (ggplot2), simplifying working with text variables (stringr), and providing the perfect color scheme for graphics (RColorBrewer).
Most of 20 most popular packages are attempts to uncomplicate common tasks, but users also make packages for endeavors that are surprisingly specific. We wanted to highlight a few of these more eccentric packages. They exhibit the breadth of cool stuff R users have created.
The "gender" Package
Have you ever needed to predict the gender of individuals in your dataset based on their name? You are in luck. The R programmer Lincoln Mullen created the package gender to make this a simple process for R users. With some straightforward code, you can find out the probability of a name belonging to a man or woman based on historical datasets.
For example, to find out the probability of a person named Hillary born in the United States in 1942 being male you can input the following code.
gender("Hillary", years = 1942)
The answer is 66% male. By 1970, Hillary was an exclusively female name.
This tool has come in handy for data scientists studying historical documents in which gender was not listed.
The "rvest" Package
R is not just for statistical analysis anymore. In a way that the creators of R could have never dreamed of, R has been developed to interact with the web. For example, the rvest package allows R users to crawl websites for interesting data.
Want a dataset of every cast member in a movie listed on the movie database website IMDB.com? Want to get all of the headlines from the New York Times on a given day? Get data on the best time to sell your used books? rvest gives R users the power to do this with easy to learn code.
Particularly convenient for R users is that when data from a website is pulled into R, it is in a format ready for statistical analysis.
The "sqlDF" Package
There are some R users who want to do everything they can within the R environment, including using other languages. The sqlDF package allows R users to use the popular database management language SQL (structure query language) within R. Users can take any SQL command, put it in between parentheses, with "sqlDF" before the parentheses, and R will execute the command.
This ability to use other languages within R can be convenient because periodically it is faster or easier to complete a task in another language. There are similar packages that allow users to run the languages Python and C++ from R.
The "acs" Package
While most packages are to make coding simpler, there are also a number of packages intended to give R users easy access to useful datasets. For example, the acs acs package gives R users the ability to download and analyze data from the United State Census. Before acs was created, getting census data into R was a laborious process. Now, thanks to package creator Ezra Glenn, it's a snap.
There are numerous other packages that give R users access to datasets. For example, the USABoundaries package contains state and country boundaries for the United States from 1629 to 2000 and the fueleconomy holds data on the fuel economy of all cars sold in the US from 1984 to 2014.
For many R users, the best aspect of working with R is that, like Isaac Newton, you are "standing upon the shoulders of giants."" Whatever your statistical problem, it is likely that some R user before has encountered that same issue, and made a package to ease the process for you.
It is just this feeling, that thousands of other people are helping you complete their project, which leads many R users to fall in love.