# Introduction to Correlation with R | Anscombe’s Quartet

Correlation is one the most commonly [over]used statistical tool. In short, it measures how strong the relationship between two variables. It’s important to realize that correlation doesn’t necessarily imply that one of the variables affects the other.

# Basic Calculation and Definition

Covariance also measures the relationship between two variables, but it is not scaled, so it can’t tell you the strength of that relationship. For example, Let’s look at the following vectors a, b and c. Vector c is simply a 10X-scaled transformation of vector b, and vector a has no transformational relationship with vector b or c.

 a b c 1 4 40 2 5 50 3 6 60 4 8 80 5 8 80 5 4 40 6 10 100 7 12 120 10 15 150 4 9 90 8 12 120

Plotted out, a vs. b and a vs. c look identical except the y-axis is scaled differently for each. When the covariance is taken of both a & b and a & c, you get different a large difference in results. The covariance between a & b is much smaller than the covariance between a & c even though the plots are identical except the scale. The y-axis on the c vs. a plot goes to 150 instead of 15.

$latex cov(X, Y) = \frac{\Sigma_i^N{(X_i – \bar X)(Y_i – \bar Y)}}{N-1} &s=2$

$latex cov(a, b) = 8.5 &s=2$

$latex cov(a, c) = 85 &s=2$

To account for this, correlation is takes covariance and scales it by the product of the standard deviations of the two variables.

$latex cor(X, Y) = \frac{cov(X, Y)}{s_X s_Y} &s=2$

$latex cor(a, b) = 0.8954 &s=2$

$latex cor(a, c) = 0.8954 &s=2$

Now, correlation describes how strong the relationship between the two vectors regardless of the scale. Since the standard deviation in vector c is much greater than vector b, this accounts for the larger covariance term and produces identical correlations terms. The correlation coefficient will fall between -1 and 1. Both -1 and 1 indicate a strong relationship, while the sign of the coefficient indicates the direction of the relationship. A correlation of 0 indicates no relationship.

Here’s the R code that will run through the calculations.

# Caution | Anscombe’s Quartet

Correlation is great. It’s a basic tool that is easy to understand, but it has its limitations. The most prominent being the correlation =/= causation caveat. The linked BuzzFeed article does a good job explaining the concept some ridiculous examples, but there are real-life examples being researched or argued in crime and public policy. For example, crime is a problem that has so many variables that it’s hard to isolate one factor. Politicians and pundits still try.

Another famous caution about using correlation is Anscombe’s Quartet. Anscombe’s Quartet uses different sets of data to achieve the same correlation coefficient (0.8164 give or take some rounding). This exercise is typically used to emphasize why it’s important to visualize data.

The graphs demonstrates how different the data sets can be. If this was real-world data, the green and yellow plots would be investigated for outliers, and the blue plot would probably be modeled with non-linear terms. Only the red plot would be consider appropriate for a basic, linear model.

I created this plot in R with ggplot2. The Anscombe data set is included in base R, so you don’t need to install any packages to use it. Ggplot2 is a fantastic and powerful data visualization package which can be download for free using the install.packages('ggplot2') command. Below is the R code I used to make the graphs individually and combine them into a matrix.

The full code I used to write up this tutorial is available on my GitHub .

References:

Chatterjee, S., Hadi, A. S., & Price, B. (2000). Regression analysis by example. New York: Wiley.

Because Twitter is fun and so are graphs, I have tweet volume graphs from my Twitter scraper that collects tweets with the team-specific nicknames and Twitter handles. After a trade (or non-trade), the data can be collected and a graphical picture of the reaction can be produced. The graph represents the volume of sampled tweets that contained the specific keywords: Mets, Gomez, Flores and Hamels. “Tears” is a collection of any tweets which mentioned either “tears” or “crying”, since Flores was in tears as he took the field.

Here are the reactions to the Gomez non-trade and Hamels trade last night:

And here’s the timeline of necessary tweets:
[All times are EDT.]

July 29
9:00 PM

9:45 PM

9:54 PM

10:15 PM

10:55 PM

July 30
12:13 AM

Some of the times were rounded if there wasn’t a clear single tweet that caused the peak on Twitter.

# Making a Correlation Matrix in R

This tutorial is a continuation of making a covariance matrix in R. These tutorials walk you through the matrix algebra necessary to create the matrices, so you can better understand what is going on underneath the hood in R. There are built-in functions within R that make this process much quicker and easier.

The correlation matrix is is rather popular for exploratory data analysis, because it can quickly show you the correlations between variables in your data set. From a practical application standpoint, this entire post is unnecessary, because I’m going to show how to derive this using matrix algebra in R.

First, the starting point will be the covariance matrix that was computed from the last post.

$latex {\bf C } = \begin{bmatrix} V_a\ & C_{a,b}\ & C_{a,c}\ & C_{a,d}\ & C_{a,e} \\ C_{a,b} & V_b & C_{b,c} & C_{b,d} & C_{b,e} \\ C_{a,c} & C_{b,c} & V_c & C_{c,d} & C_{c,e} \\ C_{a,d} & C_{b,d} & C_{c,d} & V_d & C_{d,e} \\ C_{a,e} & C_{b,e} & C_{c,e} & C_{d,e} & V_e \end{bmatrix}&s=2$

This matrix has all the information that’s needed to get the correlations for all the variables and create a correlation matrix [V — variance, C — Covariance]. Correlation, we are using the Pearson version of correlation, is calculated using the covariance between two vectors and their standard deviations [s, square root of the variance]:

$latex cor(X, Y) = \frac{cov(X,Y)}{s_{X}s_{Y}} &s=2$

The trick will be using matrix algebra to easily carry out these calculations. The variance components are all on the diagonal of the covariance matrix, so in matrix algebra notation we want to use this:

$latex {\bf V} = diag({\bf C}) = \begin{bmatrix} V_a\ & 0\ & 0\ & 0\ & 0 \\ 0 & V_b & 0 & 0 & 0 \\ 0 & 0 & V_c & 0 & 0 \\ 0 & 0 & 0 & V_d & 0 \\ 0 & 0 & 0 & 0 & V_e \end{bmatrix} &s=2$

Since R doesn’t quite work the same way as matrix algebra notation, the diag() function creates a vector from a matrix and a matrix from a vector, so it’s used twice to create the diagonal variance matrix. Once to get a vector of the variances, and a second time to turn that vector into the above diagonal matrix. Since the standard deviations are needed, the square root is taken. Also the variances are inverted to facilitate division.

After getting the diagonal matrix, basic matrix multiplication is used to get the all the terms in the covariance to reflect the basic correlation formula from above.

$latex {\bf R } = {\bf S} \times {\bf C} \times {\bf S}&s=2$

And the correlation matrix is symbolically represented as:

$latex {\bf R } = \begin{bmatrix} r_{a,a}\ & r_{a,b}\ & r_{a,c}\ & r_{a,d}\ & r_{a,e} \\ r_{a,b} & r_{b,b} & r_{b,c} & r_{b,d} & r_{b,e} \\ r_{a,c} & r_{b,c} & r_{c,c} & r_{c,d} & r_{c,e} \\ r_{a,d} & r_{b,d} & r_{c,d} & r_{d,d} & r_{d,e} \\ r_{a,e} & r_{b,e} & r_{c,e} & r_{d,e} & r_{e,e} \end{bmatrix}&s=2$

The diagonal where the variances where in the covariance matrix are now 1, since a variable’s correlation with itself is always 1.

# Making a Covariance Matrix in R

The full R code for this post is available on my GitHub.

Understanding what a covariance matrix is can be helpful in understanding some more advanced statistical concepts. First, let’s define the data matrix, which is the essentially a matrix with n rows and k columns. I’ll define the rows as being the subjects, while the columns are the variables assigned to those subjects. While we use the matrix terminology, this would look much like a normal data table you might already have your data in. For the example in R, I’m going to create a 6×5 matrix, which 6 subjects and 5 different variables (a,b,c,d,e). I’m choosing this particular convention because R and databases use it. A row in a data frame represents represents a subject while the columns are different variables. [The underlying structure of the data frame is a collection of vectors.] This is against normal mathematical convention which has the variables as rows and not columns, so this won’t follow the normal formulas found else where online.

The covariance matrix is a matrix that only concerns the relationships between variables, so it will be a k x k square matrix. [In our case, a 5×5 matrix.] Before constructing the covariance matrix, it’s helpful to think of the data matrix as a collection of 5 vectors, which is how I built our data matrix in R.]

The data matrix (M) written out is shown below.

Each value in the covariance matrix represents the covariance (or variance) between two of the vectors. With five vectors, there are 25 different combinations that can be made and those combinations can be laid out in a 5×5 matrix.

There are a few different ways to formulate covariance matrix. You can use the cov() function on the data matrix instead of two vectors. [This is the easiest way to get a covariance matrix in R.]

But we’ll use the following steps to construct it manually:

1. Create a matrix of means (M_mean).
2. $latex {\bf M\_mean} = \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ \end{bmatrix} \times \begin{bmatrix} \bar{x_{a}} & \bar{x_{b}} & \bar{x_{c}} & \bar{x_{d}} & \bar{x_{e}}\end{bmatrix}&s=2$

3. Create a difference matrix (D) by subtracting the matrix of means (M_mean) from data matrix (M).
4. $latex {\bf D = M – M\_mean} &s=2$

5. Create the covariance matrix (C) by multiplying the transposed the difference matrix (D) with a normal difference matrix and inverse of the number of subjects (n) [We will use (n-1), since this is necessary for the unbiased, sample covariance estimator. This is covariance R will return by default.
6. $latex {\bf C = } (n-1)^{-1} \times {\bf D^T} \times {\bf D} &s=2$

The final covariance matrix made using the R code looks like this:

It represents the various covariances (C) and variance (V) combinations of the five different variables in our data set. These are all values that you might be familiar with if you’ve used the var() or cov() functions in R or similar functions in Excel, SPSS, etc.

$latex \begin{bmatrix} V_a\ & C_{a,b}\ & C_{a,c}\ & C_{a,d}\ & C_{a,e} \\ C_{a,b} & V_b & C_{b,c} & C_{b,d} & C_{b,e} \\ C_{a,c} & C_{b,c} & V_c & C_{c,d} & C_{c,e} \\ C_{a,d} & C_{b,d} & C_{c,d} & V_d & C_{d,e} \\ C_{a,e} & C_{b,e} & C_{c,e} & C_{d,e} & V_e \end{bmatrix}&s=2$

This matrix is used in applications like constructing the correlation matrix and generalized least squares regressions.

# One-Sample t-Test [With R Code]

The one sample t-test is very similar to the one sample z-test. A sample mean is being compared to a claimed population mean. The t-test is required when the population standard deviation is unknown. The t-test uses the sample’s standard deviation (not the population’s standard deviation) and the Student t-distribution as the sampling distribution to find a p-value.

The t-Distribution

While the z-test uses the normal distribution, which is only dependent on the mean and standard deviation of the population. Any of the various t-tests [one-sample, independent, dependent] use the t-distribution, which has an extra parameter over the normal distribution: degrees of freedom (df). The theoretical basis for degrees of freedom deserves a lot of attention, but for now for the one-sample t-test, df = n – 1.

The distributions above show how the degrees of freedom affect the shape of t distribution. The gray distribution is the normal distribution. Low df causes the tails of the distribution to be fatter, while a higher df makes the t-distribution become more like the normal distribution.

The practical outcome of this will be that samples with smaller n will need to be further from the population mean to reject the null hypothesis than samples with larger n. And compared to the z-test, the t-test will always need to have the sample mean further from the population mean.

Just like the one-sample z-test, we have to define our null hypothesis and alternate hypothesis. This time I’m going to show a two-tailed test. The null hypothesis will be that there is NO difference between the sample mean and the population mean. The alternate hypothesis will test to see if the sample mean is significantly different from the population mean. The null and alternate hypotheses are written out as:

• $latex H_0: \bar{x} = \mu&s=2$
• $latex H_A: \bar{x} \neq \mu&s=2$

The graphic above shows a t-distribution with a df = 5 with the critical regions highlighted. Since the shape of the distribution changes with degrees of freedom, the critical value for the t-test will change as well.

The t-stat for this test is calculated the same way as the z-stat for the z-test, except for the σ term [population standard deviation] in the z-test is replaced with s [sample standard deviation]:

$latex z = \frac{\bar{x} – \mu}{\sigma/\sqrt{n}} \hspace{1cm} t = \frac{\bar{x} – \mu}{s/\sqrt{n}} &s=2$

Like the z-stat, the higher the t-stat is the more certainty there is that the sample mean and the population mean are different. There are three things make the t-stat larger:

• a bigger difference between sample mean and population mean
• a small sample standard deviation
• a larger sample size

Example in R

Since the one-sample t-test follows the same process as the z-test, I’ll simply show a case where you reject the null hypothesis. This will also be a two-tailed test, so we will use the null and alternate hypotheses found earlier on this page.

Once again using the height and weight data set from UCLA’s site, I’ll create a tall-biased sample of 50 people for us to test.

Now using the population mean, the sample mean, the sample standard deviation, and the number of samples (n = 50) we can calculate the t-stat.

Now you could look up the critical value for the t-test with 49 degrees of freedom [50-1 = 49], but this is R, so we can find the area under the tail of the curve [the blue area from the critical region diagram] and see if it’s under 0.025. This will be our p-value, which is the probability that the value was achieved by random chance.

The answer should be 0.006882297, which is well below 0.025, so the null hypothesis is rejected and the difference between the tall-biased sample and the general population is statistically significant.

You can find the full R code including code to create the t-distribution and normal distribution data sets on my GitHub .

# Twitter Sentiment — Penguins VS. Rangers Gm 4

Game 4 of the Penguins-Rangers series featured a brief overtime period that overshadowed the rest of the game as far as tweet volume goes. Rangers fans were more negative at the beginning of the game after the Penguins scored their first goal. Twitter volume picked up for both teams during the overtime period and Rangers fans’ tweets spiked when they won the game and continued throughout the night.

# Using New, Diverse Emojis for Analysis in Python

I haven’t been updating this site often since I’ve started to perform a similar job over at FanGraphs. All non-baseball stat work that I do will continued to be housed here.

Over the past week, Apple has implemented new emojis with a focus on diversity in their iOS 8.3 and the OS X 10.10.3 update. I’ve written quite a bit about the underpinnings of emojis and how to get Python to run text analytics on them. The new emojis provide another opportunity to gain insights on how people interact, feel, or use them. Like always, I prefer to use Python for any web scraping or data processing, and emoji processing is no exception. I already wrote a basic primer on how to get Python to find emoji in your text. If you combine the tutorials I have for tweet scraping, MongoDB and emoji analysis, you have yourself a really nice suite of data analysis.

Modifier Patch

These new emojis are a product of the Unicode Consortium’s plan for how to incorporate racial diversity into the previously all-white human emoji line up. (And yes, there’s a consortium for emoji planning.) The method used to produce new emojis isn’t quite as simple as just making a new character/emoji. Instead, they decided to include a modifier patch at the end of human emojis to indicate skin color. As a end-user, this won’t affect you if you have all the software updates and your device can render the new emojis. However, if you don’t have the updates, you’ll get something that looks like this:

That box at the end of the emoji is the modifier patch. Essentially what is happening here is that there is a default emoji (in this case it’s the old man) and the modifier patch (the box). For older systems it doesn’t display, because the old system doesn’t know how to interpret this new data. This method actually allows the emojis to be backwards compatible, since it still conveys at least part of the meaning of the emoji. If you have the new updates, you will see the top row of emoji.

Using a little manipulation (copying and pasting) using my newly updated iPhone we can figure out this is what really is going on for emojis. There are five skin color patches available to be added to each emoji, which is demonstrated on the bottom row of emoji. Now you might notice there are a lot of yellow emoji. Yellow emojis (Simpsons) are now the default. This is so that no single real skin tone is the default. The yellow emojis have no modifier patch attached to them, so if you simply upgrade your phone and computer and then go back and look at old texts, all the emojis with people in them are now yellow.

New Families

The new emoji update also includes new families. These are also a little different, since they are essentially combinations of other emoji. The original family emoji is one single emoji, but the new families with multiple children and various combinations of children and partners contain multiple emojis. The graphic below demonstrates this.

The man, woman, girl and boy emoji are combined to form that specific family emoji. I’ve seen criticisms about the families not being multiracial. I’d have to believe the limitation here is a technical one, since I don’t believe the Unicode consortium has an effective method to apply modifier patches and combine multiple emojis at once. That would result in a unmanageable number of glyphs in the font set to represent the characters. (625 different combinations for just one given family of 4, and there are many different families with different gender iterations.)

New Analysis

So now that we have the background on the how the new emojis work, we can update how we’ve searched and analyzed them. I have updated my emoji .csv file, so that anyone can download that and run a basic search within your text corpus. I have also updated my github to have this file as well for the socialmediaparse library I built.

The modifier patches are searchable, so now you can search for certain swatches (or lack there of). Below I have written out the unicode escape output for the default (yellow) man emoji and its light-skinned variation. The emoji with a human skin color has that extra piece of code at the end.

Here are all the modifier patches as unicode escape.

The easiest way to search for these is to use the following snippet of code:

You can throw that snippet into a for loop for a Pandas data frame or a MongoDB cursor. I’m planning on updating my socialmediaparse library with patch searching, and I’ll update this post when I do that.

Spock

Finally, there’s Spock!

The unicode escape for Spock is:

Collecting Twitter data is a great exercise in data science and can provide interesting insights in how people behave on the social media platform. Below is an overview of the steps to build a Twitter analysis from scratch. This tutorial will go through several steps to arrive at being able to analyze Twitter data.

1. Overview of Twitter API does
2. Get R or Python
4. Get Developer API Key from Twitter
5. Write Code to Collect Tweets
6. Parse the Raw Tweet Data [JSON files]
7. Analyze the Tweet Data

Introduction

Before diving into the technical aspects of how to use the Twitter API [Application Program Interface] to collect tweets and other data from their site, I want to give a general overview of what the Twitter API is and isn’t capable of doing. First, data collection on Twitter doesn’t necessarily produce a representative sample to make inferences about the general population. And people tend to be rather emotional and negative on Twitter. That said, Twitter is a treasure trove of data and there are plenty of interesting things you can discover. You can pull various data structures from Twitter: tweets, user profiles, user friends and followers, what’s trending, etc. There are three methods to get this data: the REST API, the Search API, and the Streaming API. The Search API is retrospective and allows you search old tweets [with severe limitations], the REST API allows you to collect user profiles, friends, and followers, and the Streaming API collects tweets in real time as they happen. [This is best for data science.] This means that most Twitter analysis has to be planned beforehand or at least tweets have to be collected prior to the timeframe you want to analyze. There are some ways around this if Twitter grants you permission, but the run-of-the-mill Twitter account will find the Streaming API much more useful.

The Twitter API requires a few steps:

1. Authenticate with OAuth
2. Make API call
4. Interpret JSON file

The authentication requires that you get an API key from the Twitter developers site. This just requires that you have a Twitter account. The four keys the site gives you are used as parameters in the programs. The OAuth authentication gives your program permission to make API calls.

The API call is an http call that has the parameters incorporated into the URL like
This Streaming API call is asking to connect to Twitter and tracks the keyword ‘twitter’. Using prebuilt software packages in R or Python will hide this step from you the programmer, but these calls are happening behind the scenes.

JSON files are the data structure that Twitter returns. These are rather comprehensive with the amount of data, but hard to use without them being parsed first. Some of the software packages have built-in parsers or you can use a NoSQL database like MongoDB to store and query your tweets.

Get R or Python

While there are many different programing languages to interface with the API, I prefer to use either Python or R for any Twitter data scraping. R is easier to use out of the box if you are just getting started with coding, and Python offers more flexibility. If you don’t have either of these, I’d recommend installing then learning to do some basic things before tackling Twitter data.

R Studio: http://www.rstudio.com/ [optional]

The easiest way to access the API is to install a software package that has prebuilt libraries that makes coding projects much simpler. Since this tutorial will primarily be focused on using the Streaming API, I recommend installing the streamR package for R or tweepy for Python. If you have a Mac, Python is already installed and you can run it from the terminal. I recommend getting a program to help you organize your projects like PyCharm, but that is beyond the scope of this tutorial.

R
[in the R environment]

Python
[in the terminal, assuming you have pip installed]

# 2015 State of the Union Address — Text Analytics

I collected tweets about the 2015 State of the Union address [SOTU] in real time from 10am to 2am using the keywords [obama, state of the union, sotu, sotusocial, ernst]. The tweets were analyzed for sentiment, content, emoji, hashtags, and retweets. The graph below shows Twitter activity over the course of the night. The volume of tweets and the sentiment of reactions were the highest during the latter half of the speech when Obama made the remark “I should know; I won both of them” referring to the 2008 & 2012 elections he won.

Throughout the day before the speech, there weren’t many tweets and they tended to be neutral. These tweets typically contained links to news articles previewing the SOTU address or reminders about the speech. Both of these types of tweets are factual but bland when compared to the commentary and emotional reaction that occurred during the SOTU address itself. The huge spike in Twitter traffic didn’t happen until the President walked onto the House floor which was just before 9:10 PM. When the speech started, the sentiment/number of positive words per tweet increased to about 0.3 positive words/tweet suggesting that the SOTU address was well received. [at least to the people who bothered to tweet]

Around 7:45-8:00 PM the largest negative sentiment occurred during the day. I’ve looked back through the tweets from that time and couldn’t find anything definitive that happened to cause that. My conjecture would be that is when news coverage started [and strongly opinionated] people started watching the news and began to tweet.

The highest sentiment/number of positive words came during the 15-minute polling window where the President quipped about winning two elections. Unfortunately, that sound bite didn’t make a great hashtag, so it didn’t show up else where in my analysis. However, there are many news articles and discussion about that off-the-cuff remark, and it will probably be the most memorable moment from the SOTU address.

Emoji

Once again [Emoji Popularity Link], the crying-my-eyes-out emoji proved to be the most used emoji in SOTU tweets, typically being used in tweets which aren’t serious and generally sarcastic. Not surprisingly, the clapping emoji was the second most popular emoji mimicking the copious ovations the SOTU address receives. Other notable popular emoji are the fire, US flag, the zzzz emoji and skull. The US flag reflects the patriotic themes of the entire night. The fire is generally reflecting praise for Obama’s speech. The skull and zzzz are commenting on spectators in the crowd.

Two topic-specific emoji counts were interesting. For the most part in all of my tweet collections, the crying-my-eyes-out emoji is exponentially more popular than any other emoji. Understandably, the set of tweets that contained language associated with terrorism had more handclaps, flags, and angry emoji reflecting the serious nature of the subject.

Then tweets corresponding to the GOP response had a preponderance of pig-related emojis due to Joni Ernst’s campaign ad.

#Hashtags

The following hashtag globe graphic is rather large. Please enlarge to see the most popular hashtags associated with the SOTU address. I removed the #SOTU hashtag, because it was use extensively and overshadowed the rest. For those wondering what #TCOT means, it stands for Top Conservatives on Twitter. The #P2 hashtag is its progressive counterpart. [Source]

RTs

The White House staff won the retweeting war by being the two most retweeted [RT] accounts during the speech last night. This graph represents the total summed RTs over all the tweets they made. Since the White House and the Barack Obama account tweeted constantly during the speech, they accumulated the most retweets. Michael Clifford has the most retweeted single tweet stating he is just about met the President. If you are wondering who Michael Clifford is, you aren’t alone, because I had to look him up. He’s the 19-yo guitarist from 5 Seconds of Summer. The tweet is from August, however, people did retweet it during the day. [I was measuring the max retweet count on the tweets.] Rand Paul was the most retweeted non-President politician, and the Huffington Post had the most for a news outlet.

The Speech

Obama released his speech online before starting the State of the Union address. I used this for a quick word-count analysis, and it doesn’t contain the off-the-cuff remarks just the script, which he did stick to with few exceptions. The first graph uses the count of single words with ‘new’ being by far the most used word.

This graph shows the most used two-word combinations [also known as bi-grams].

Further Notes

I was hoping this would be the perfect opportunity to test out my sentiment analysis process, and the evaluation results were rather moderate achieving about 50% accuracy on three classes [negative, neutral, positive]. In this case 50% is an improvement over a 33% random guess, but not very encouraging overall. For the sentiment portion in the tweet volume graph, I used the bag-of-words approach that I have used many times before.

A more interesting and informative classifier might look try to classify the tweet into sarcastic/trolling, positive, and angry genres. I had problems classifying some tweets as positive and negative, because there were many news links, which are neutral, and sarcastic comments, which look positive but feel negative. For politics, classifying the political position might be more useful, since a liberal could be mocking Boehner one minute, then praising Obama the next. Having two tweets classified as liberal rather than a negative tweet and a positive tweet is much more informative when aggregating.

# MLB — Pace of Play [Working Post]

This post is a work in progress. The data concerning the pace of play is rather messy and this project is rather large compare to what I normally tackle. For that reason I’m going start this post and update it as a ‘working post’. Please feel free to contact me if anyone has any input: @seandolinar on Twitter or
sean.dolinar@gmail.com

Having collected the time between pitches from PITCH/fx, I was able to look at the different factors that affect how long pitchers took between plays. [I’m defining this as the pitch pace.] PITCH/fx has a time stamp associated with each pitch. Using that time stamp, I was able to calculate the time between each pitch. I used the resulting calculation combined with other information available about each at-bat to draw some conclusions about what affects pace of play.

The most obvious influence on the time between pitches is whether or not there was a baserunner. This was rather simple to explore since PITCH/fx provides information on whether or not there is a runner on 1B, 2B, or 3B. Using this I was able to create the following table of median pitch pace. [I’ll explain why I decided to use the median and not the mean/average later.]

The data matches what your experience with baseball suggests. Pitchers will slow down the game when there is a runner on base. This will happen for several reasons: run-game tactics, conferences on the mound, and even time for the ball to get back to the pitcher after the play. Given the fact there is a slight drop off for when there isn’t an open base or there are two outs, I would conclude that the run-game prevention tactics play a rather significant role in the pitch pace.

The distribution of pitch pace data shows how often pitchers take 5-10 seconds, 10-15 seconds, 15-20 seconds, etc. between pitches. Both distributions are highly skewed right, so the average pitch pace isn’t representative of the central tendency of the data set; the median works a lot better in this situation to describe the most likely outcome.

The pitch pace with the highest frequency with the bases empty is the 15-20 second range, while the most frequent pitch pace bumps up to 20-25 seconds when runners are on base. MLB is kicking around the idea of having a 20 second pitch clock. From the distribution, it becomes apparent that keeping the pace to under 20 seconds would have an impact on the pitch pace of play.

I created a box plot to show another perspective of the distributions. The mean of the runners on base pitch pace is significantly higher than the mean of the pitch pace with bases empty.

Data Background

PITCH/fx data isn’t designed to accurately measure the time between pitches; it has some problems. A human operator is needed to enter data on each pitch such as ball/strike, information about the hit or if runs scored. For this reason, the data is very messy. It has problems where subtracting the time of each subsequent pitch from the pitch prior yields negative numbers because of the operator entered the previous pitch after the pitcher threw the next pitch. For these reasons I have to re-examine cleaning and processing the PITCH/fx data.

Further Work

I need to clean the data further. This will include identifying and excluding first pitches from at-bats and aggregating each at-bat. This should alleviate some of the delay problems associated with the human entry component of PITCH/fx.

I want to look at leverage’s impact on the pitch pace. My initial analysis is that leverage doesn’t matter all too much when you consider if there’s a player on base or not since leverage and having a player on base are collinear. With cleaner data the effect of leverage or post season play might be more apparent.

I’m going look at the time between innings. This should change depending on the broadcast; national broadcasts have longer commercial breaks. There also should be artifacts for weather delays.

Pitching changes should also be included. Inning breaks with new pitchers tend to be longer, it would be nice to see how much longer they are on the aggregate.

All of these need to be programmed into a parser that looks at the data sequentially. My plan is to update this page once I have more research available.