I have recently written a much more mathematically involved post using the negative binomial and wrote up a discrete probability distribution primer. These are a more complete treatment of the the topic. However, this post is a good overview of the basics.
My friend sparked my recent interest in Poisson distributions by mentioning how rare it is to meet a romantic interest/significant other that you’ll have a long term relationship opposed to going out for just a few dates or even dating at all. I immediately though about earthquakes. It’s strange, but makes some sense, since the large-impact earthquakes are both very unpredictable and rare, much like dating. I’d love to show this actually happens, but since I can’t download relationship data, I’ve found something almost as good: baseball data!
A Poisson distribution [pronunciation] is used for count data and rare events over a specified time/area. This is in contrast to the more familiar bell-curve normal distribution which uses continuous data. [For math/science people, it’s a decaying exponential] A few good example potential models using a Poisson distribution are number of sick days a person uses through out a year or traffic accidents per month on a certain stretch of road. Earthquake frequency modeling is probably one of the more famous uses of a Poisson distribution.
Getting back to baseball, runs are not common events, and I wouldn’t go so far to call them rare events. However, in the context of individual innings, runs are rare. Going back to a previous post about the Pirates’ run probability, any given team in MLB only has a 26% chance that they will score in any given inning. This means that 73% of the time you are watching baseball you are watching the teams not score. I am interested in how often a team will score 0, 1, 2, 3 or more runs in an inning. To determine the probability that a certain number of runs are scored in any inning a Poisson distribution can be used and it follows the general form:
Substituting the term for the Run Expectancy for the beginning of a inning which is .4615 runs an inning in 2013, you will the red distribution line below. [ Run Expectancy/Expected Runs is a fancy way to say the average runs for a given situation.] The blue area represents the actual run frequencies, and the gold line is the distribution which I obtained from regression.
The Poisson distribution describes how often runs are scored during innings pretty well, but it’s not perfect. The trend line underestimates the shutout and big-run innings, while overestimating the one-run innings. The model shown above is suffering from overdispersion, which means the variance [how spread out the data is] is larger than what the model assumes. The short reason to account for the lack of fit is that baseball isn’t completely random. You’ll have better teams who score multiple runs in an inning against poor teams who will in turn fail to score any runs in an inning. The disparity in teams will cause a wider variance in run scoring.
The red line in the graph above is a distribution I obtained when I regressed the count data against the number of runs and obtained a ‘new’ mean. This distribution is a little bit closer to the empirical data, though it still suffers overdispersion.
I’ve put all the counts and frequencies/probabilities into a table so that it is easier to reference. If you wanted to calculate the probability that you would see an entire game (full 9 innings) with 7 or more runs in an inning (like last night’s Pirates game), you would use the following formula:
of not having any +7-run innings
So there’s a roughly 2% chance that any baseball game you attend will have an inning with 7 or more runs scored in it.