Count Data Distribution Primer — Binomial / Negative Binomial / Poisson

Count data is exclusively whole number data where each increment represents one of something. It could be a car accident, a run in baseball, or an insurance claim. The critical thing here is that these are discrete, distinct items. Count data behaves differently than continuous data, and the distribution [frequency of of different values] is different between the two. Random continuous data typically follows the normal distribution, which is the bell curve everyone remembers from high school grade systems. [Which is a really bad way to grade, but I digress.] Count data generally follows the Binomial/Negative Binomial/Poisson distribution depending what context you are viewing the data; all three distributions are mathematically related.

Binomial Distribution:

The binomial distribution (BD) is the collection of probabilities of getting a certain number of successes in a given number of trials specifically measuring Bernoulli trials [a yes/no event similar to a coin flip, but it’s not necessarily 50/50]. My favorite example to understand the binomial distribution is using it to determine the probability that you’d get exactly 5 HEADS if you flipped a coin 10 times [it’s NOT 50%!].

It’s actually 24.61%. The probability of getting heads in any given coin flip is 50%, but over 10 flips, you’ll only get exactly 5 HEADS and 5 TAILS about 25% of the time. The equation below gives the two popular notations for the binomial probability mass function. $n$ is total number of trials. [the graph above used n=10]. $r$ is the number of successes you want to know the probability for. You calculate this function for each number of HEADS [0-10] for $r$ to get the distribution above. $p$ is the simple probability for each event. [$p$ = .5 for the coin flip.]

$P(X=r) = {{n}\choose{r}} p^{r} (1-p)^{n-r} = \frac{n!}{r!(n-r)!} p^{r} (1-p)^{n-r}$

The equation has three parts. The first part is the combination ${{n}\choose{r}}$, which is the number of combinations when you have $n$ total items taken $r$ at a time. Combination disregard order, so the set {1, 4, 9} is the same as {4, 9, 1}. This part of the equation tells you how many possible ways there are to get to a certain outcome since there are many way to get 5 HEADS in 10 tosses. Since ${{10}\choose{5}}$ is larger than any other combination, 5 HEADS will have the largest probability.

There are two more terms in the equation. $p^r$ is joint probability of getting r successes in a particular order, and $(1-p)^{n-r}$ is the corresponding probably of also getting the failures also in a particular order. I find it helpful to conceptualize the equation as having three parts accounting for different things: total combinations of successes and failures, the probabilities of successes, and the probability of failures.

Negative Binomial Distribution:

While there is a good reason for it, the name of the negative binomial distribution (NBD) is confusing. Nothing I will present will involve making anything negative so, let’s just get that out of the way and ignore it. The binomial distribution uses the probability of successes in the total number of ATTEMPTS. To contrast this, the negative binomial distribution uses the probability that a certain number of FAILURES occur before the $r$th SUCCESS. This has many applications specifically when a sequence terminates after the $r$th success such as modeling the probability that you will sell out of the 25 cups of lemonade you have stocked for a given number of cars that pass by. The idea is that you would pack up your lemonade stand after you sell out, so cars that would pass by after the final success won’t matter. Another good example is modeling the win probability of a 7-game sports playoff series. The team that wins the series must win 4 games and specifically the last game played in the series, since the playoff series terminates after one team reaches 4 wins.

One of the more important restrictions on the NBD is that the last event must be a success. Going back to the sports playoff series example, the team that wins the series will NEVER lose the last game. With the 10 coin-flip example, the BD was looking for the probability of getting a certain number of HEADS within a set number of coin flips. Using the NBD, we will look for the probability of 5 HEADS before getting a certain number of TAILS. The total number of flips will not ALWAYS equal 10 and actually exceeds 10 as seen below.

The probability mass function that describes the NBD graph above is given below:

$P(X=k) = {{r+k-1}\choose{k}} p^{r} (1-p)^{k}$

The equation for the NBD has the same parts as the BD: the combinations, the success, and the failures. In the NBD case the combinations are less than the BD [for the same total number of coin flips]. This is because the last outcome is held fix at a success. The probability of success and failure parts of the equation are conceptually the same as the BD. The failure portion is written differently because the number of failures is a parameter $k$ instead of a derived quantity like [$n-r$].

Poisson Distribution:

The Poisson Distribution (PD) is directly related to both the BD and the NBD, because it is the limiting case of both of them. As the number of trials goes to infinity, then the Poisson distribution emerges. The graph for the PD will look similar to the NBD or the BD, and there is no example comparing the coin flip since there has to be some non-discrete process like traffic flow or earthquakes. The major difference is not what is represented, but how it is viewed and calculated. The Poisson distribution is described by the equation:

$P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}$

$\lambda$ is the expected value [or the mean] for an event and $x$ is the count value. If you knew an average of 0.2 car crashes happen at an intersection at a given day then you could solve the equation for $x$ = {0, 1, 2, 3, 4, 5, … } and get the PD for the problem.

One of the restrictions and major issues with the use of the PD is that the model assumes the mean and the variance are equal. In most real data instances the variance is greater than the mean, so the PD tends to favor more values around the expected value than real data reflects.

If you are interested in the derivations and math behind these I recommend this site: http://statisticalmodeling.wordpress.com/. I feel like they explain the derivation of the negative binomial better than most places I’ve found. It addresses why it’s called the NEGATIVE binomial distribution as well. The site also contains derivations of the PD being the limiting case of the BD and NBD.

Against All Odds — Upsets

It’s a great time to be a Dayton fan!  It’s the first time the school has reached the Sweet Sixteen since before all the Dayton fans I know where born…and they did it as an 11 seed!  Their game against Ohio State was the first game of the tournament to tip.  A little over two hours later, everyone’s brackets were busted.  After watching the tournament this weekend, I felt like there were a lot of upsets this year.   (Or at the very least my bracket was getting busted up pretty quickly.)  But are there really more upsets this year than normal?

First, I’m going to define an upset as any lower seed beating a higher seed.  I’m of the personal belief that 8/9, 4/5, and 1/2 match-ups shouldn’t count as upsets, but for this analysis, I’m going to consider these as possible upsets.  First, let’s look at how many upsets there were this year.  Through two rounds, there have been 13 upsets.  That’s one less than last year at this time, and just at the average (if you round).  So this is a rather average year.  Three of four 1-seeds are still alive — not too much different from what you might expect.

NCAA Tournament Upsets By Year

Historically, 1999 had the most upsets with 19 in the first weekend of play.  Nothing really stuck out like how Florida Gulf Coast got to the Sweet Sixteen as a 15-seed last year.  1991 had the fewest number of upsets in the first weekend with just nine.  All the upsets through out the years appear to be random noise fluctuating around an average of 12.8 upsets (out of 48 games played) in the first two rounds per year.  The conclusion you can draw from this is that the number of upsets is rather consistent over the years with not much systematic change from year to year.

Thirteen upsets is a lot; it’s almost 1/3 of all the games played this weekend.  Last week, I posted the probability that a seed would win in the first round of the tournament.  This was a linear relationship starting with an almost certain probability for the 1-seeds and then going to a 50/50 split for an 8/9 game.  On the surface it doesn’t seem like it almost 1/3 of the games would be upsets, but if you look at all the possibilities it will make more sense.

Let’s look at Dayton’s 11-seed.  A 11-seed has a historical 34% chance of upsetting a 6-seed in the first round, but when considering there are four distinct 6-11 seed match-ups each year there’s only a 19% chance that all 6-seeds will win their first round games.  In fact, the most likely scenario is that just one 6-seed will upset a 11 seed.  This year there were two 6-11 upsets which is the second most likely scenario at 30% (still more likely than not getting any upsets).

The following table depicts the probability of different scenarios for each first round seeding combination.  All the green area on the table is why everyone’s brackets bust every year.  Keep reading if you are interested in the math, otherwise you might want to bounce, because it’s gonna get boring.

NCAA Tournament First Round Win Probability by Seed

Still here?  Ok.  The basis for determining the probability of the upset scenario is the binomial distribution.  A binomial distribution requires two things, a binary outcome (hence the bi- prefix)  and a set probability of how that outcome is achieved.  The simplest example of a binomial distribution is determining the probability of successive coin flips.  The probability function is given as

$P(X) = (^n_k) p^k q^{1-k}$

The $(^n_k)$ term is the combination of n terms taken k-at a time.
$p$ is the probability of the event happening — the win probability
$q$ is the compliment of the event so in this cause it would be probability of losing
$n$ will be 4 since there are four games for a seed match up
$k$ will be 0-4 depending on how many upsets we are looking for.

Looking at the probability that two (and only two) 11-seeds upset 6-seeds that will be

$P(X) = (^4_2) (.34)^2 (1-.34)^2 = 6 * .34^2 * .66^2 = .302 = 30$%

You can derive this equation by writing out probability trees (if you remember those from high school math).  The problem with that method is that for each outcome (# of upsets = [0, 1, 2, 3, 4]) you have to write out the different combinations of games for each outcome.   This can get unwieldy quickly.  Binomial distributions can be used for many different applications, including the aforementioned coin-flip, likelihood of combinations of boy/girl babies, the probability that the ‘better’ team loses a 7-game playoff series, the likely number of winners for the lottery…so this will rear it’s head again for NHL, NBA, or MLB playoffs.