R Bootcamp: Making a Subset
Data Manipulation: Subsetting
Making a subset of a data frame in R is one of the most basic and necessary data manipulation techniques you can use in R. If you are brand new to data analysis, a data frame is the most common data storage object in R and subsets are a collection of rows from that data frame based on certain criteria.
V1 | V2 | V3 | V4 | V5 | V6 | V7 |
Row1 | ||||||
Row2 | ||||||
Row3 | ||||||
Row4 | ||||||
Row5 | ||||||
Row6 |
V1 | V2 | V3 | V4 | V5 | V6 | V7 |
Row2 | ||||||
Row5 | ||||||
Row6 |
The Data
For this example, I’m using data from FanGraphs. You can get the exact data set here, and it’s provided in my GitHub. This data set has players names, teams, seasons and stats. We are able to create a subset based on any one or more of these variables.
The Code
I’m going to show four different ways to subset data frames: using a boolean vector, using the which()
function, using the subset()
function and using filter()
function from the dplyr
package. All of these functions are different ways to do the same thing. The dplyr
package is fast and easy to code, and it is my recommended subsetting method, so let’s start with that. This is especially true when you have to loop an operation or run something that will be run repeatedly.
dplyr
The filter()
requires the dplyr
package to be loaded in your R environment, and it removes the filter()
function from the default stats
package. You don’t need to worry about but it does tell you that when you first install and load the package.
#install.packages('dplyr') library(dplyr) #load the package #from http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2015&month=0&season1=2010&ind=1&team=&rost=&age=&filter=&players=&page=2_30 setwd('***PATH***') data <- read.csv('FanGraphs Leaderboard.csv') #loads in the data
Aside from the loading the package, you'll have to load the data in as well.
#finds all players who played for the Marlins data.sub.1 <- filter(data, Team=='Marlins') #finds all the NL East players NL.East <- c('Marlins','Nationals','Mets','Braves','Phillies') #makes the division data.sub.2 <- filter(data, Team %in% NL.East) #finds all players that are in the NL East #Both of these find players in the NL East and have more than 30 home runs. data.sub.3 <- filter(data, Team %in% NL.East, HR > 30) #uses multiple arguments data.sub.3 <- filter(data, Team %in% NL.East & HR > 30) #uses & sign #Finds players in the NL East or has more than 30 HR data.sub.4 <- filter(data, Team %in% NL.East | HR > 30) #Finds players not in the NL East and who have more than 30 home runs. data.sub.5 <- filter(data, !(Team %in% NL.East), HR > 30)
The filter()
function is rather simple to use. The examples above illustrate a few simple examples where you specify the data frame you want to use and create true/false expressions, which filter()
uses to find which rows it should keep. The output of the function is saved into a separate variable, so we can reuse the original data frame for other subsets. I put a few other examples in the code to demonstrate how it works.
Built-in Functions
#method 1 -- using a T/F vector data.sub.1 <- data[data$Team == 'Marlins',] #method 2 -- which() data.sub.2 <- data[which(data$Team == 'Marlins'),] #method 3 -- subset() data.sub.3 <- subset(data,subset = (Team=='Marlins')) #other comparison functions data.sub.4 <- data[data$HR > 30,] #greater than data.sub.5 <- data[data$HR < 30,] #less than data.sub.6 <- data[data$AVG > .320 & data$PA > 600,] #duel requirements using AND (&) data.sub.7 <- data.sub3 <- subset(data, subset = (AVG > .300 & PA > 600)) #using subset() data.sub.8 <- data[data$HR > 40 | data$SB > 30,] #duel requirements using OR (|) data.sub.9 <- data[data$Team %in% c('Marlins','Nationals','Mets','Braves','Phillies'),] #finds values in a vector data.sub.10 <- data[data$Team != '- - -',] #removes players who played for two teams
If you don't want to use the dplyr
package, you are able to accomplish the same thing uses the basic functionality of R. #method 1
uses a boolean vector to select rows for the subset. #method 2
uses the which()
function. This function finds the index of a boolean vector of True values. Both of these techniques use the original data frame and uses the row index to create a subset.
The subset()
function works much like the filter()
function, except the syntax is slightly different and you don't have to download a separate package.
Efficiency
While subset works in a similar fashion, it doesn't perform the same way. While some data manipulation might only happen once or a few times throughout a project, many projects require constant subsetting and possibly from a loop. So while the gains might seem insignificant for one run, multiply that difference and it adds up quickly.
I timed how long it would take to run the same [complex] subset of a 500,000 row data frame using the four different techniques.
Subset Method | Elapsed Time (sec) |
boolean vector | 0.87 |
which() | 0.33 |
subset() | 0.81 |
dplyr filter() | 0.21 |
The dpylr filter()
function was by far the quickest, which is why I prefer to use it.
The full code I used to write up this tutorial is available on my GitHub .
References:
Introduction to dplyr. https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html