MLB — Pace of Play [Working Post]
This post is a work in progress. The data concerning the pace of play is rather messy and this project is rather large compare to what I normally tackle. For that reason I’m going start this post and update it as a ‘working post’. Please feel free to contact me if anyone has any input: @seandolinar on Twitter or
sean.dolinar@gmail.com
Having collected the time between pitches from PITCH/fx, I was able to look at the different factors that affect how long pitchers took between plays. [I’m defining this as the pitch pace.] PITCH/fx has a time stamp associated with each pitch. Using that time stamp, I was able to calculate the time between each pitch. I used the resulting calculation combined with other information available about each at-bat to draw some conclusions about what affects pace of play.
The most obvious influence on the time between pitches is whether or not there was a baserunner. This was rather simple to explore since PITCH/fx provides information on whether or not there is a runner on 1B, 2B, or 3B. Using this I was able to create the following table of median pitch pace. [I’ll explain why I decided to use the median and not the mean/average later.]
The data matches what your experience with baseball suggests. Pitchers will slow down the game when there is a runner on base. This will happen for several reasons: run-game tactics, conferences on the mound, and even time for the ball to get back to the pitcher after the play. Given the fact there is a slight drop off for when there isn’t an open base or there are two outs, I would conclude that the run-game prevention tactics play a rather significant role in the pitch pace.
The distribution of pitch pace data shows how often pitchers take 5-10 seconds, 10-15 seconds, 15-20 seconds, etc. between pitches. Both distributions are highly skewed right, so the average pitch pace isn’t representative of the central tendency of the data set; the median works a lot better in this situation to describe the most likely outcome.
The pitch pace with the highest frequency with the bases empty is the 15-20 second range, while the most frequent pitch pace bumps up to 20-25 seconds when runners are on base. MLB is kicking around the idea of having a 20 second pitch clock. From the distribution, it becomes apparent that keeping the pace to under 20 seconds would have an impact on the pitch pace of play.
I created a box plot to show another perspective of the distributions. The mean of the runners on base pitch pace is significantly higher than the mean of the pitch pace with bases empty.
Data Background
PITCH/fx data isn’t designed to accurately measure the time between pitches; it has some problems. A human operator is needed to enter data on each pitch such as ball/strike, information about the hit or if runs scored. For this reason, the data is very messy. It has problems where subtracting the time of each subsequent pitch from the pitch prior yields negative numbers because of the operator entered the previous pitch after the pitcher threw the next pitch. For these reasons I have to re-examine cleaning and processing the PITCH/fx data.
Further Work
I need to clean the data further. This will include identifying and excluding first pitches from at-bats and aggregating each at-bat. This should alleviate some of the delay problems associated with the human entry component of PITCH/fx.
I want to look at leverage’s impact on the pitch pace. My initial analysis is that leverage doesn’t matter all too much when you consider if there’s a player on base or not since leverage and having a player on base are collinear. With cleaner data the effect of leverage or post season play might be more apparent.
I’m going look at the time between innings. This should change depending on the broadcast; national broadcasts have longer commercial breaks. There also should be artifacts for weather delays.
Pitching changes should also be included. Inning breaks with new pitchers tend to be longer, it would be nice to see how much longer they are on the aggregate.
All of these need to be programmed into a parser that looks at the data sequentially. My plan is to update this page once I have more research available.