Data Manipulation: Subsetting
Making a subset of a data frame in R is one of the most basic and necessary data manipulation techniques you can use in R. If you are brand new to data analysis, a data frame is the most common data storage object in R and subsets are a collection of rows from that data frame based on certain criteria.
For this example, I’m using data from FanGraphs. You can get the exact data set here, and it’s provided in my GitHub. This data set has players names, teams, seasons and stats. We are able to create a subset based on any one or more of these variables.
I’m going to show four different ways to subset data frames: using a boolean vector, using the
which() function, using the
subset() function and using
filter() function from the
dplyr package. All of these functions are different ways to do the same thing. The
dplyr package is fast and easy to code, and it is my recommended subsetting method, so let’s start with that. This is especially true when you have to loop an operation or run something that will be run repeatedly.
filter() requires the
dplyr package to be loaded in your R environment, and it removes the
filter() function from the default
stats package. You don’t need to worry about but it does tell you that when you first install and load the package.
library(dplyr) #load the package
data <- read.csv('FanGraphs Leaderboard.csv') #loads in the data
Aside from the loading the package, you’ll have to load the data in as well.
#finds all players who played for the Marlins
data.sub.1 <- filter(data, Team=='Marlins')
#finds all the NL East players
NL.East <- c('Marlins','Nationals','Mets','Braves','Phillies') #makes the division
data.sub.2 <- filter(data, Team %in% NL.East) #finds all players that are in the NL East
#Both of these find players in the NL East and have more than 30 home runs.
data.sub.3 <- filter(data, Team %in% NL.East, HR > 30) #uses multiple arguments
data.sub.3 <- filter(data, Team %in% NL.East & HR > 30) #uses & sign
#Finds players in the NL East or has more than 30 HR
data.sub.4 <- filter(data, Team %in% NL.East | HR > 30)
#Finds players not in the NL East and who have more than 30 home runs.
data.sub.5 <- filter(data, !(Team %in% NL.East), HR > 30)
filter() function is rather simple to use. The examples above illustrate a few simple examples where you specify the data frame you want to use and create true/false expressions, which
filter() uses to find which rows it should keep. The output of the function is saved into a separate variable, so we can reuse the original data frame for other subsets. I put a few other examples in the code to demonstrate how it works.
#method 1 -- using a T/F vector
data.sub.1 <- data[data$Team == 'Marlins',]
#method 2 -- which()
data.sub.2 <- data[which(data$Team == 'Marlins'),]
#method 3 -- subset()
data.sub.3 <- subset(data,subset = (Team=='Marlins'))
#other comparison functions
data.sub.4 <- data[data$HR > 30,] #greater than
data.sub.5 <- data[data$HR < 30,] #less than
data.sub.6 <- data[data$AVG > .320 & data$PA > 600,] #duel requirements using AND (&)
data.sub.7 <- data.sub3 <- subset(data, subset = (AVG > .300 & PA > 600)) #using subset()
data.sub.8 <- data[data$HR > 40 | data$SB > 30,] #duel requirements using OR (|)
data.sub.9 <- data[data$Team %in% c('Marlins','Nationals','Mets','Braves','Phillies'),] #finds values in a vector
data.sub.10 <- data[data$Team != '- - -',] #removes players who played for two teams
If you don’t want to use the
dplyr package, you are able to accomplish the same thing uses the basic functionality of R.
#method 1 uses a boolean vector to select rows for the subset.
#method 2 uses the
which() function. This function finds the index of a boolean vector of True values. Both of these techniques use the original data frame and uses the row index to create a subset.
subset() function works much like the
filter() function, except the syntax is slightly different and you don’t have to download a separate package.
While subset works in a similar fashion, it doesn’t perform the same way. While some data manipulation might only happen once or a few times throughout a project, many projects require constant subsetting and possibly from a loop. So while the gains might seem insignificant for one run, multiply that difference and it adds up quickly.
I timed how long it would take to run the same [complex] subset of a 500,000 row data frame using the four different techniques.
|Subset Method||Elapsed Time (sec)|
dpylr filter() function was by far the quickest, which is why I prefer to use it.
The full code I used to write up this tutorial is available on my GitHub .
Introduction to dplyr. https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html