There are three doors. And hidden behind them are two goats and a car. Your objective is to win the car. Here’s what you do:
- Pick a door.
- The host opens one of the doors you didn’t pick that has a goat behind it.
- Now there are just two doors to choose from.
- Do you stay with your original choice or switch to the other door?
- What’s the probability you get the car if you stay?
- What’s the probability you get the car if you switch?
It’s not a 50/50 choice. I won’t digress into the math behind it, but instead let you play with the simulator below. The game will tally up how many times you win and lose based on your choice.
What’s going on here? Marilyn vos Savant wrote the solution to this game in 1990. You can read vos Savant’s explanations and some of the ignorant responses. But in short, because the door that’s opened is not opened randomly, the host gives you additional information about the set of doors you didn’t choose. Effectively, if you switch, you are select all the other doors. If you choose to stay, you are select just one door.
In her answer, she suggests:
Here’s a good way to visualize what happened. Suppose there are a million doors, and you pick door #1. Then the host, who knows what’s behind the doors and will always avoid the one with the prize, opens them all except door #777,777. You’d switch to that door pretty fast, wouldn’t you?
To illustrate that in the simulation, you can increase number of number of doors in the simulator. It becomes pretty clear that switch is the correct choice.
Finally, here’s some Kevin Spacey:
Covariance is the less understood sibling of correlation. While correlation is commonly used in reporting, covariance provides the mathematical underpinnings to a lot of different statistical concepts. Covariance describes how two variables change in relation to one another. If variable X increases and variable Y increases as well, Both X & Y will have positive covariance. Negative covariance will result from having two variables move in opposite directions, and zero covariance will result from the variables have no relationship with each other. Variance is also a specific case of covariance where both input variables are the same. All of this is rather abstract, so let’s look at more concrete definitions.
Covariance — Summation Notation
The definition you will find in most introductory stat books is a relatively simple equation using the summation operator (Σ). This shows covariances as the sum of the product of a paired data point relative to its mean. First, you need to find the mean of both variables. Then take all the data points and subtract the mean from its respective variable. Finally, you multiply the differences together
N is the number of data points in the population.
n is the sample number. μX is the population mean for X; μY for Y.
Ȳ are the mean as well but this notation designates it as a sample mean rather than a population mean. Calculating the covariance of any significant data set can be tedious if done by hand, but we can set-up the equation in R and see it work. I used modified version of Anscombe’s Quartet data set.
#get a data set
X <- c(anscombe$x1, 6,4,10)
Y = c(anscombe$y1, 10,8,6)
#get the means
X.bar = mean(X)
Y.bar = mean(Y)
#calculate the covariance
sum((X-X.bar)*(Y-Y.bar)) / (length(X)) #manually population
sum((X-X.bar)*(Y-Y.bar)) / (length(X) - 1) #manually sample
cov(X,Y) #built-in function USES SAMPLE COVARIANCE
Obviously, since covariance is used so much within statistics, R has a built-in function
cov(), which yields the sample covariance for two vectors or even a matrix.
Covariance — Expected Value Notation
[Trying to explain covariance in expected value notation makes me realize I should back up and explain the expected value operator, but that will have to wait for another post. Quickly and oversimplified, the expect value is the mean value of a random variable.
E[X] = mean(X). The expected value notation below describes the population covariance of two variables (not sample covariance):
The above formula is just the population covariance written differently. For example,
E[X] is the same as μx. And the
E acts the same as taking the average of
(X-E[X])(Y-E[Y]). After some algebraic transformations you can arrive at the less intuitive, but still useful formula for covariance:
This formula can be interpreted as the product of the means of variables X and Y subtracted from the average of signed areas of variables X and Y. This probably isn’t very useful if you are trying to interpret covariance. But you’ll see it from time to time. And it works! Try it in R and compare it to the population covariance from above.
mean(X*Y) - mean(X)*mean(Y) #expected value notation
Covariance — Signed Area of Rectangles
Covariance can also be thought of as the sum of the signed area of the rectangles that can be drawn from the data points to the variables respective means. It’s called the signed area because we will get two types of rectangles, ones with a positive value and ones with negative values. Area is always a positive number, but these rectangles take on a sign by virtue of their geometric position. This is more of an academic exercise, in that it provides an understanding of what the math is doing and less of a practical interpretation and application of covariance. If you plot paired data points, in this case we will use the X and Y variables we have already used, you can tell just be looking there is probably some positive covariance because it looks like there is a linear relationship in the data. I’ve chosen to scale the plot so that zero is not included. Since this is a scatter plot including zero isn’t necessary.
First, we can draw the lines for the means of both variables as straight lines. These lines effectively create a new set of axes and will be used to draw the rectangles. The sides of the rectangles will be the difference between a data point and it’s mean [
Xi - X̄]. When that is multiplied by [
Yi - Ȳ], you can see that gives you an area of a rectangle. Do that for every point in your data set, add them up and divide by the number of data points, and you get the population covariance.
The following is a plot has a rectangle for each data point, and it is coded red for negative and blue for positive signs.
The overlapping rectangles need to be considered separately so the opacity is reduced so that all the rectangles are visible. For this data set there is much more blue area than there is red area, so there is positive covariance, which jives with what we calculated earlier in R. If you were to take the areas of those rectangles and add/subtract according to the blue/red color then divide by the number of rectangles, you would arrive the population covariance: 3.16. To get the sample covariance you’d subtract one from the number of rectangles when you divide.
Chatterjee, S., Hadi, A. S., & Price, B. (2000). Regression analysis by example. New York: Wiley.
Covariance As Signed Area Of Rectangles. http://www.davidchudzicki.com/posts/covariance-as-signed-area-of-rectangles/
How would you explain covariance to someone who understands only the mean?
The signed area of rectangles on Chudzicki’s site and statexchange use a different covariance formulation, but similar concept than my approach.
The full code I used to write up this tutorial is available on my GitHub .
Sports have a constant uncertainty and randomness in every aspect of the game including determining champions. This is one area you wouldn’t expect to have a lot of variability, since you would want the team that has the best roster composition and played the hardest to win the championship. This concept is usually brought up in the arguments against the one-game Wild Card round that MLB introduced in 2012 saying there’s too much that can happen in one game to determine the fate of a season. [The counter-argument to this specifically is the division winners now have a reward for winning the division, besides having cool sweatshirts.]
The Sports Side
The basis for championship series in MLB, NBA, and NHL is the an odd-numbered of games series with the champion being the team that wins the majority of games in those series. Most sports use a 7-game series; so for example the Boston Red Sox had to win 4 games to win the World Series last year. Using an example of randomness I got from Leonard Mlodinow’s The Drunkard’s Walk: How Randomness Rules Our Lives, I can illustrate how a team that’s clearly an underdog can win a playoff series against a superior opponent. Mlodinow has a recorded lecture where he explains what he wrote in his book. [It’s a good book, you should read it.]
Let’s use two teams; one is the Favorite, and it is assumed they will beat the Underdog 55% of the time [given enough games]. This also means that the Underdog will win 45% of the time. This represents win probabilities more uneven that you are likely to find in a playoff game since teams are typically much more evenly matched [at least in baseball]. The last assumption of this example is that the teams win probabilities don’t change with a different starting pitcher or home field/court/ice advantage. These are terrible assumptions if you wanted to project real playoff series, but the underlying principle of random sequencing will still hold true.
In order to win the playoff series, a team has to win a certain number of games before the opponent wins that number. To model this distribution based on pure randomness, you can use the negative binomial distribution to determine the probability that the Favorite will win a 7-game playoff series in 4, 5, 6, or 7 games. If you wanted to design a playoff series to minimize the chances that the underdog will win, you’d want to choose a number of games which would have the smallest probability of the underdog winning the series.
This chart shows the probability for all 8 possible outcomes of a 7-game playoff series based of a 55/45 winning percentage split and pure randomness with no home-field advantage. As you can see there’s a substantial chance [39% probability] that the Underdog will win a 7-game series. 39% is rather large, and this is a 7-game series. Baseball also employs a 5-game series for their division series (LDS) and a one-game playoff for the the Wild Card round (WC). The chances of an upset becomes more likely as the number of games decreases. I’ve also added another set of teams (60/40 split — greater disparity) for comparison’s sake.
It should be obvious that the 1-game series has the greatest chance of an upset, hence the objections to its use in baseball. Though my contention would be that a 3-game series does not offer much more certainty that the best team will win.
The Math Side
I first calculated these probabilities by writing out all the possible combinations then adding up those probabilities. I have since realized there was a much easier way to determine these probabilities, and that is by using the negative binomial distribution (NBD). If you want to familiarize yourself with what the distribution represents please read the count data primer. In short, the NBD will determine the probability that a team will lose a certain number of games [0-3] before the other team wins 4 games. The NBD is defined by the following function:
where X is the random variable whose probability we are calculating, k is number of Team A losses [this will vary], r is the number of Team A wins [for the 7-game series, it will be 4 games], and p is the probability of Team A winning. In the case of this example we will be determining the probability of Team A winning a 7-game series, when Team A has a 55%/45% advantage over Team B.
This is the probability for just one possible outcome, Team A wins the series in 6 games. To determine the probability that Team A wins the series, you add the probabilities for each outcome Team A wins in 4, 5, 6, or 7 games. So this calculation then repeated for every loss possibility:
From these calculations, there is a 60.83% chance that the Team A wins just by randomness. Conversely, there is a 39.17% [100% – 60.83%] chance that Team B, the inferior team, wins because of random sequencing.
The MLB Wild Card game rightfully gets criticized for being too susceptible to having a bad day or getting a bad bounce. I wanted to illustrate that any playoff series has a lot of randomness in it. Beyond the numbers, people remember the bad bounces way more than they remember the positive or neutral events that occur [negativity bias]. A bad bounce or a pitcher having a bad day could easily benefit the team you are rooting for. The only real way to root out the randomness you would need to play hundreds of games, and somehow I don’t think that is feasible.
One of the more fan-accessible advanced stats are playoff odds [technically postseason probabilities]. Playoff odds range from 0% – 100% telling the fan the probability that a certain team will reach the MLB postseason. These are determined by creating a Monte Carlo simulation which runs the baseball season thousands of times [FanGraph runs theirs 10,000 times]. In those simulations, if a team reaches the postseason 5,000 times, then the team is predicted to have a 50% probability for making the postseason. FanGraphs and Baseball Prospectus run these every day, so playoff odds can be collected every day and show the story of a team’s season if they are graphed.
Above is a composite graph of the three different types of teams. The Dodgers were identified as a good team early in the season and their playoff odds stayed high because of consistently good play. The Brewers started their season off strong but had two steep drop offs in early July and early September. Even though the Brewers had more wins than the Dodgers, the FanGraphs playoff odds never valued the Brewers more than the Dodgers. The Royals started slow and had a strong finish to secure themselves their first postseason birth since 1985. All these seasons are different and their stories are captured by the graph. Generally, this is how fans will remember their team’s season — by the storyline.
Since the playoff odds change every day and become either 100% or 0% by the end of the season, the projections need to be compared to the actual results at the end of the season. The interpretation of having a playoff probability of 85% means that 85% of the time teams with the given parameters will make the postseason.
I gathered the entire 2014 season playoff odds from FanGraphs, put their predictions in buckets containing 10% increments of playoff probability. The bucket containing all the predictions for 20% bucket means that 20% of all the predictions in that bucket will go on to postseason. This can be applied to all the buckets 0%, 10%, 20%, etc.
Above is a chart comparing the buckets to the actual results. Since this is only using one year of data and only 10 teams made the playoffs, the results don’t quite match up to the buckets. The desired pattern is encouraging, but I would insist on looking at multiple years before making any real conclusions. The results for any given year is subject to the ‘stories’ of the 30 teams that play that season. For example, the 2014 season did have a team like the 2011 Red Sox, who failed to make the postseason after having a > 95% playoff probability. This is colloquially considered an epic ‘collapse’, but the 95% probability prediction not only implies there’s chance the team might fail, but it PREDICTS that 5% of the teams will fail. So there would be nothing wrong with the playoff odds model if ‘collapses’ like the Red Sox only happened once in a while.
The playoff probability model relies on an expected winning percentage. Unlike a binary variable like making the postseason, a winning percentage has a more continuous quality to the data, so this will make the evaluation of the model easier. For the most part most teams do a good job staying around the initial predicted winning percentage coming really close to the prediction by the end of the season. Not every prediction is correct, but if there are enough good predictions the predictive model is useful. Teams also aren’t static, so bad teams can become worse by trading away players at the trade deadline or improve by acquiring those good players who were traded. There are also factors like injuries or player improvement, that the prediction system can’t account for because they are unpredictable by definition. The following line graph allows you to pick a team and check to see how they did relative to the predicted winning percentage. Some teams are spot on, but there are a few like the Orioles or Red Sox which are really far off.
The residual distribution [the actual values – the predicted values] should be a normal distribution centered around 0 wins. The following graph shows the residual distribution in numbers of wins, the teams in the middle had their actual results close to the predicted values. The values on the edges of the distribution are more extreme deviations. You would expect that improved teams would balance out the teams that got worse. However, the graph is skewed toward the teams that become much worse implying that there would be some mechanism that makes bad teams lose more often. This is where attitude, trades, and changes in strategy would come into play. I’d would go so far to say this is evidence that soft skills of a team like chemistry break down.
Since I don’t have access to more years of FanGraphs projections or other projection systems, I can’t do a full evaluation of the team projections. More years of playoff odds should yield probability buckets that reflect the expectation much better than a single year. This would allow for more than 10 different paths to the postseason to be present in the data. In the absence of this, I would say the playoff odds and predicted win expectancy are on the right track and a good predictor of how a team will perform.
Probability and odds are two basic statistic terms to describe the likeliness that an event will occur. They are often used interchangeably in causal conversation or even in published material. However, they are not mathematically equivalent because they are looking at likeliness in different contexts. In everyday conversation when numbers or values aren’t given, the two terms are synonymous . If an event has a high probability, then it has high odds for happening. The incorrect usage arises when a person ascribes a mathematical value to either the odds or probability they are discussing. Hopefully, if you aren’t quite sure what the exact mathematical difference is, this will clear it up for you.
Probability is defined as the fraction of desired outcomes in the context of every possible outcome with a value between 0 and 1, where 0 would be an impossible event and 1 would represent an inevitable event. Probabilities are usually given as percentages. [ie. 50% probability that a coin will land on HEADS.] Odds can have any value from zero to infinity and they represent a ratio of desired outcomes versus the field. Odds are a ratio, and can be given in two different ways: ‘odds in favor’ and ‘odds against’. ‘Odds in favor’ are odds describing the if an event will occur, while ‘odds against’ will describe if an event will not occur. If you are familiar with gambling, ‘odds against’ are what Vegas gives as odds. More on that later. For the coin flip odds in favor of a HEADS outcome is 1:1, not 50%.
Simple probability of event A occurring is mathematically defined as:
The best way to illustrate this is with the classic marbles-in-a-bag example. The graphic below depicts all the marbles in an opaque bag that one marble will be pulled out of. There are 6 blue, 3 red, 2 yellow, and 1 green for a total of 12 marbles in the bag.
The probability of pulling a red marble would be calculated by taking the total number of red marbles and dividing it by the total number of marbles.
Notice that the probability calculation includes the red marbles in the denominator of the calculation, because probability considers the context of the entire event space. Odds, on the other hand, are the ratio of favorable outcomes to unfavorable outcomes. The denominator contains ONLY the marbles that aren’t the favorable outcomes. Odds uses the contexts of good outcomes and bad outcomes. Written as fractions, these two values are completely different. Probability is 1/4 while odds in favor are 1/3. You can see how mistakenly interchanging the terms could give the wrong information. The ‘odds in favor’ of RED would be mathematically calculated by
To find ‘odds against’ you would simply flip odds in favor upside down and this describes the odds of the event not occurring.
‘Odds against’ are commonly are used in the context of gambling. When you hear that the Seattle Seahawks Vegas odds to win the Super Bowl are 5:1 [Retrieved 9/19/2014], the 5:1 is referring to the ‘odds against’ Seattle winning the Super Bowl. Using some quick math we could determine the probability of Seattle winning the Super Bowl would be 1/6 or 16.7%.
Vegas odds are technically payoff odds, because they describe the payout if you were to win the bet. The payout on the Seahawks would win you $5 for every $1 bet on the Seattle winning the Super Bowl. They aren’t true odds, since no one is really sure what the true odds are, because you can’t simply count and weigh the possibilities like with the bag of marbles. The payoff will increase when the event becomes less likely. If you could create a reliable predictive model that told you the Seahawks actually had a 20% probability to win the Super Bowl, you could bet on the Seahawks, knowing that their actual probability to win is better than what Vegas is giving them. And if you made enough bets like this you could beat Vegas.
I stated earlier that probability and odds were colloquially interchangeable when values aren’t given. This is true, because the two are mathematically related. Odds can be computed from probability and probability from odds.
Using the RED marble example [P(RED) = 1/4 and Odds_Favor(RED) = 1/3] we can demonstrate how these are equivalent: