I’ve been in contact with the team over at Stattleship. They have a cool API that allows you to get various stats for basketball, football and hockey. I used data from that API to create the following data visualization for their blog. The visualization shows the offensive and special team yards gained by each team remaining in the playoffs. The yardage is totaled for the entire season as well as the one playoff game each team played. I’ve displayed the points off of offensive TDs and special teams scoring, and that score is color coded with to wins and loses. A black background is a win, and a white background is a loss.
Game 4 of the Penguins-Rangers series featured a brief overtime period that overshadowed the rest of the game as far as tweet volume goes. Rangers fans were more negative at the beginning of the game after the Penguins scored their first goal. Twitter volume picked up for both teams during the overtime period and Rangers fans’ tweets spiked when they won the game and continued throughout the night.
Unfortunately, my Twitter scraper wasn’t looking for the most viral story of the Penguin’s loss to the Rangers. [I’m not linking to it, but it involves a columnist and the Penguins’ GM.] I was able to get general sentiment over the course of the game. There isn’t too much to analyze. There are more Rangers tweets overall, most likely due to increased interest and a larger market, and I’ve annotated where there was scoring during the game. I’ll probably have a few more updates through out the playoffs.
The Steelers-Ravens playoff game gave me a chance to test out a new analytics server and some of the tools I’ve been working on to make Twitter analysis easy using ad hoc Python scripts. So here goes:
There were a lot of Steelers or Ravens colored emojis, black and gold hearts or buttons and the purple devils. Though for some reasons the ‘crying my eyes out’ emoji is by far the most popular in this collection of tweets. The yellow line represents how many unique tweets there were featuring that emoji. For example, 14 of the same of emoji in one tweet would count for 14 in the blue bar, while it would count for just 1 in the context of the yellow line.
Here’s the hashtag use. The #steelers exceeded the #ravens. This looks cool, but it doesn’t tell you much.
Here’s a bar chart that’s a lot easier to read if you want the information.
Sports have a constant uncertainty and randomness in every aspect of the game including determining champions. This is one area you wouldn’t expect to have a lot of variability, since you would want the team that has the best roster composition and played the hardest to win the championship. This concept is usually brought up in the arguments against the one-game Wild Card round that MLB introduced in 2012 saying there’s too much that can happen in one game to determine the fate of a season. [The counter-argument to this specifically is the division winners now have a reward for winning the division, besides having cool sweatshirts.]
The Sports Side
The basis for championship series in MLB, NBA, and NHL is the an odd-numbered of games series with the champion being the team that wins the majority of games in those series. Most sports use a 7-game series; so for example the Boston Red Sox had to win 4 games to win the World Series last year. Using an example of randomness I got from Leonard Mlodinow’s The Drunkard’s Walk: How Randomness Rules Our Lives, I can illustrate how a team that’s clearly an underdog can win a playoff series against a superior opponent. Mlodinow has a recorded lecture where he explains what he wrote in his book. [It’s a good book, you should read it.]
Let’s use two teams; one is the Favorite, and it is assumed they will beat the Underdog 55% of the time [given enough games]. This also means that the Underdog will win 45% of the time. This represents win probabilities more uneven that you are likely to find in a playoff game since teams are typically much more evenly matched [at least in baseball]. The last assumption of this example is that the teams win probabilities don’t change with a different starting pitcher or home field/court/ice advantage. These are terrible assumptions if you wanted to project real playoff series, but the underlying principle of random sequencing will still hold true.
In order to win the playoff series, a team has to win a certain number of games before the opponent wins that number. To model this distribution based on pure randomness, you can use the negative binomial distribution to determine the probability that the Favorite will win a 7-game playoff series in 4, 5, 6, or 7 games. If you wanted to design a playoff series to minimize the chances that the underdog will win, you’d want to choose a number of games which would have the smallest probability of the underdog winning the series.
This chart shows the probability for all 8 possible outcomes of a 7-game playoff series based of a 55/45 winning percentage split and pure randomness with no home-field advantage. As you can see there’s a substantial chance [39% probability] that the Underdog will win a 7-game series. 39% is rather large, and this is a 7-game series. Baseball also employs a 5-game series for their division series (LDS) and a one-game playoff for the the Wild Card round (WC). The chances of an upset becomes more likely as the number of games decreases. I’ve also added another set of teams (60/40 split — greater disparity) for comparison’s sake.
It should be obvious that the 1-game series has the greatest chance of an upset, hence the objections to its use in baseball. Though my contention would be that a 3-game series does not offer much more certainty that the best team will win.
The Math Side
I first calculated these probabilities by writing out all the possible combinations then adding up those probabilities. I have since realized there was a much easier way to determine these probabilities, and that is by using the negative binomial distribution (NBD). If you want to familiarize yourself with what the distribution represents please read the count data primer. In short, the NBD will determine the probability that a team will lose a certain number of games [0-3] before the other team wins 4 games. The NBD is defined by the following function:
where X is the random variable whose probability we are calculating, k is number of Team A losses [this will vary], r is the number of Team A wins [for the 7-game series, it will be 4 games], and p is the probability of Team A winning. In the case of this example we will be determining the probability of Team A winning a 7-game series, when Team A has a 55%/45% advantage over Team B.
This is the probability for just one possible outcome, Team A wins the series in 6 games. To determine the probability that Team A wins the series, you add the probabilities for each outcome Team A wins in 4, 5, 6, or 7 games. So this calculation then repeated for every loss possibility:
From these calculations, there is a 60.83% chance that the Team A wins just by randomness. Conversely, there is a 39.17% [100% – 60.83%] chance that Team B, the inferior team, wins because of random sequencing.
The MLB Wild Card game rightfully gets criticized for being too susceptible to having a bad day or getting a bad bounce. I wanted to illustrate that any playoff series has a lot of randomness in it. Beyond the numbers, people remember the bad bounces way more than they remember the positive or neutral events that occur [negativity bias]. A bad bounce or a pitcher having a bad day could easily benefit the team you are rooting for. The only real way to root out the randomness you would need to play hundreds of games, and somehow I don’t think that is feasible.
I’ve been listening to 93.7 The Fan while running the analysis for this, and I never realized that people can say the same thing over and over again but in slightly different ways. Also all tweets were captured AFTER THE CONCLUSION OF THE 1st PERIOD.
Everyone knows Twitter is the best venue to vent your anger about sports teams. I was able to the statistical programming language R to scrape tweets which had certain keywords or hashtags in them, put them in a database and then flag the tweets that contain certain keywords or collection of words. I had about 20 keywords including: “penguins”, “pens”, “rangers”, “game 7”, and “firebylsma”. I also search for any of the handles of the local hockey writers, because a lot of people will reply to the sports writers during the game with their own opinions.
The first graph has the total number of tweets that I scraped and tweets that I flagged as ‘swearing’. For the most part, I feel like if someone swore in the tweet, it indicates anger or at the very least aggressiveness. As mentioned before these graphs begin after the 1st period ends. And tweets containing the keyword ‘rangers’ has been filtered from this first graph to include a greater amount of Penguins fan.
The quickest and most basic analysis is the number of tweets as a time-series. As soon as you look at the time-series line graph, you can tell when the game ends [9:41 PM]. It’s like Mt. Everest in the graph. Looking closer you can see the spikes where each team scores. An interesting occurrence happened right before the end of the game. Twitter got quiet. The tweets per minute dropped below 500 right before it exploded to a few thousand per minute. I am attributing this silence to people actually watching the game during the tense last minute.
The tweets peaked about 2 minutes after the game ended indicating a minimal lag which includes the time of picking up the smart phone, unlocking it, and composing the tweet. However, the angry, swearing tweets peaked right as the game ended indicating more visceral emotions instead of more thought-out, 140-character commentary. There are two severe dips that I can’t account for at 9:49 PM and 10:04 PM. If anyone knows something that occurred at this time, please let me know. Since there is a clear downward trend and the game was no longer being played, I am going to write those dips off as some technical difficulties that didn’t allow a lot of tweets to be sent at those times.
Since Twitter isn’t an invention specific to just Penguins fans, I separated and compared two sets of tweets: one containing the word “penguins” and one comparing “rangers”. There are a lot more Rangers fans than Penguins fans, because the “rangers” tweets outnumber “penguins” tweets at almost every time. The “penguins” tweets did spike when they scored their only goal. Interestingly enough, there was a lot of swearing right when the goal was scored. Penguins fans are just so angry!
Bottom-line calm down; it’s just sports.
It’s a great time to be a Dayton fan! It’s the first time the school has reached the Sweet Sixteen since before all the Dayton fans I know where born…and they did it as an 11 seed! Their game against Ohio State was the first game of the tournament to tip. A little over two hours later, everyone’s brackets were busted. After watching the tournament this weekend, I felt like there were a lot of upsets this year. (Or at the very least my bracket was getting busted up pretty quickly.) But are there really more upsets this year than normal?
First, I’m going to define an upset as any lower seed beating a higher seed. I’m of the personal belief that 8/9, 4/5, and 1/2 match-ups shouldn’t count as upsets, but for this analysis, I’m going to consider these as possible upsets. First, let’s look at how many upsets there were this year. Through two rounds, there have been 13 upsets. That’s one less than last year at this time, and just at the average (if you round). So this is a rather average year. Three of four 1-seeds are still alive — not too much different from what you might expect.
Historically, 1999 had the most upsets with 19 in the first weekend of play. Nothing really stuck out like how Florida Gulf Coast got to the Sweet Sixteen as a 15-seed last year. 1991 had the fewest number of upsets in the first weekend with just nine. All the upsets through out the years appear to be random noise fluctuating around an average of 12.8 upsets (out of 48 games played) in the first two rounds per year. The conclusion you can draw from this is that the number of upsets is rather consistent over the years with not much systematic change from year to year.
Thirteen upsets is a lot; it’s almost 1/3 of all the games played this weekend. Last week, I posted the probability that a seed would win in the first round of the tournament. This was a linear relationship starting with an almost certain probability for the 1-seeds and then going to a 50/50 split for an 8/9 game. On the surface it doesn’t seem like it almost 1/3 of the games would be upsets, but if you look at all the possibilities it will make more sense.
Let’s look at Dayton’s 11-seed. A 11-seed has a historical 34% chance of upsetting a 6-seed in the first round, but when considering there are four distinct 6-11 seed match-ups each year there’s only a 19% chance that all 6-seeds will win their first round games. In fact, the most likely scenario is that just one 6-seed will upset a 11 seed. This year there were two 6-11 upsets which is the second most likely scenario at 30% (still more likely than not getting any upsets).
The following table depicts the probability of different scenarios for each first round seeding combination. All the green area on the table is why everyone’s brackets bust every year. Keep reading if you are interested in the math, otherwise you might want to bounce, because it’s gonna get boring.
Still here? Ok. The basis for determining the probability of the upset scenario is the binomial distribution. A binomial distribution requires two things, a binary outcome (hence the bi- prefix) and a set probability of how that outcome is achieved. The simplest example of a binomial distribution is determining the probability of successive coin flips. The probability function is given as
The term is the combination of n terms taken k-at a time.
is the probability of the event happening — the win probability
is the compliment of the event so in this cause it would be probability of losing
will be 4 since there are four games for a seed match up
will be 0-4 depending on how many upsets we are looking for.
Looking at the probability that two (and only two) 11-seeds upset 6-seeds that will be
You can derive this equation by writing out probability trees (if you remember those from high school math). The problem with that method is that for each outcome (# of upsets = [0, 1, 2, 3, 4]) you have to write out the different combinations of games for each outcome. This can get unwieldy quickly. Binomial distributions can be used for many different applications, including the aforementioned coin-flip, likelihood of combinations of boy/girl babies, the probability that the ‘better’ team loses a 7-game playoff series, the likely number of winners for the lottery…so this will rear it’s head again for NHL, NBA, or MLB playoffs.
The process of simulating the NCAA tournaments involves two-steps. The first is determining what statistical prediction model to use to determine the outcome of a game. The second step is to simulate the entire tournament. Simulating the tournament multiple times and keeping track of each outcome is called a Monte Carlo simulation. This simulates the entire tournament 10,000 times and tabulates the results from each round.
On to what the computer says [entire bracket png]! Surprise, surprise the computer says almost everything that you might surmise by using your gut. It predicts Florida winning the entire tournament with almost a 20% probability, and it also predicts very few upsets. Everything is pretty much what you would expect. As I said earlier in the week, the committee does a pretty good job seeding everybody overall.
There are a few ‘undervalued’ teams the simulation has picked. Villanova, a 2-seed , is projected to go to the Final Four. All the other Final Four teams are 1-seeds, which is the seed with the highest probability to reach the Final Four historically. The most dramatic prediction, I think, is the North Dakota State upset. NDSt is a 12-seed. 12-seeds are the most frequently undervalued seed performing well above the expected winning percentage. In betting on an upset, I’d be looking for a lower seed’s win probability to be higher than average for that seed. However, NDSt’s win probability is not only higher than average, it’s higher than Oklahoma’s win probability. I’m putting NDSt to win the first game in all my brackets today.
The rest is pretty obvious, but I’m going to be interested to see how this bracket does against what really happens. And how simulations look after each round.
The bracket is broken into its four regions for viewing ease. The winners are in green and the upset winners are in yellow. The Final Four is the last graphic.
All of the analysis is only looking at the 64 team field from the 1985 tournament through 2013. Before 1985 there were less than 64 teams invited. Opening Round games are also ignored.
The NCAA Selection Committee just released the seeding for the upcoming tournament, and everyone over the next few days will be filling out brackets to win 50$ from their office pool. But who to pick? Later this week, I’m going to look at a basic way to predict the point spread for an NCAA tournament game. Before I look at that prediction, let’s examine how the different seeds perform in the tournament.
Seeds are ordinal ranks (1 through 16) placed on the 64 teams and then those teams are pitted against each so the highest seeds won’t face each other till the final round in the Regional Championship (Elite Eight). So how well do the seeds perform? The graphs above show how well each seed has done since 1985, the first year of the 64 team bracket. The 1 seeds are clearly are the most successful in any around, and beyond that there’s a linear pattern with first round wins. So the better the seed the more likely the team is to go further.
In fact the likelihood of the highest seed winning the first game goes from 100% for the 1-16 game to a coin-flip for the 8-9 game.
Looking at the first round wins, there are two interesting seeds, the 9 and 12 seeds, which outperform the pattern. This most likely happens because the Selection Committee undervalues teams it assigns to those slots. There’s an asymmetry of information available for power conferences vs mid-majors. Mid-majors have less games against top 50 teams, so the the committee has less information to judge the team.
Now as far as selecting teams to win tournament games. The safest pick is to always pick the highest seed. The argument against this is that its very boring and you might not win. Being the most likely to win does not really translate into winning in singular instances. But the data points to 1 seeds being the overwhelming favorite to get into the Final Four over any other particular seed by a 2:1 margin.
Picking upsets. The underlying principle of the upset in this tournament is the small sample size of just one game. A lot can happen in one game. Even if a team has a 78% chance of winning, which is rather favorable (Like a 4 seed)….there’s still a 22% chance that team will lose. With 64 teams and 63 games there’s a lot of room for upsets to happen by random chance. The randomness is in how the team feels that day, how fouls are called, if a foul was called in a critical junction, or come down to how a ball bounces or falls.
Given the way the seeds have performed since the tournament expanded in 1985 to a 64-team tournament. The selection committee does a good job seeding the teams, since they are following a pattern you would except from seeding the best teams against the worst.
How does a team like the Steelers go from 0-4 to a dark horse for the final AFC playoff spot to inches away from clinching to the birth? Then the Chargers who were a dark horse themselves go on an secure the 6th seed?
Philip Rivers mentioned, in an endorphin-high interview, that no one gave them a chance, saying the odds were against the Chargers. This delves into the different realms of motivational and analytical that I want to stay away from, but how did these odds really work and how did they change through out the day? Week 17 of the 2013 NFL season was a great example of Bayes’ theorem.
Here’s the math for Bayes’ theorem:
This is read: the probability of A given that B happens is equal to the probability of B given that A happens times the probability of A divided by the probability of B. We are going to use this basic form of Bayes’ theorem to understand what happened to the Steelers and Chargers playoff odds throughout the day on December 29th.
The important point I’m going to illustrate here is that probability is not a property inherent in an event. It is rather a guess or calculation based on known facts or frame of reference at the time. When that frame of reference changes (in this case when the NFL schedule plays out) we can calculate new odds. I’ll be using the above mathematical formula to calculate new probabilities as we discover new information through out the day for the Steelers. Then I’ll compare this with a similar chart for the Chargers, who made the playoffs.
The first and largest problem for this exercise is determining the win/loss probability for the games. I’ve searched online and there is much disagreement on what the playoffs odds where at the beginning of the day, let alone what any singular games odds were. I could use Vegas odds as crowd-sourced odds, however, I’ll make this part easy and just make up numbers for illustrative purposes. I have the Steelers and Chargers win probability weighted high because they were playing the Browns and the Chiefs’ back-ups. The Ravens and Dolphins I have at even odds of 0.5. These could be endlessly debated, but let’s just assume they are correct.
For the Steelers to make the playoffs four things needed to happen. First they had to win. [P(SteelersW) = .85] Next the Ravens and Dolphins had to lose . [P(DolphinsL) = P(RavensL) = .5] And then San Diego had to lose later in the day. [P(ChargersL) = .75] If you multiply all these together you will get roughly P(SteelersPlayoffs) = .05. So there’s a 5% chance with the information at the beginning of the day that the Steelers will make the playoffs.
So let’s use Bayes’ theorem to calculate what the playoff are are if we know the Dolphins lose their game — P(SteelersPlayoffs | DolphinsL).
From just knowing the Dolphins losing their game, you can infer that the Steelers chances of gaining of a playoff birth is twice as likely as it was before. I should also explain why P(DolphinsLoss | SteelersPlayoffs) is equal to 1. This term assumes the Steelers made the playoff and asks what the probability is of the Dolphins’ loss given this information. The Dolphins must lose if the Steelers make the playoffs so the term is equal to 1. This will be true for every game we are considering. (This problem becomes more complicated if there are multiple paths to playoffs, because the term will no longer be 1.)
Starting at P(SteelersPlayoffs) = .05, you can calculate the conditional probability in a chain as the Steelers’ win, Dolphins’ loss, and Ravens’ loss occurs. Then the probability of the Steelers making the playoffs is calculated solely on the remaining game Chiefs/Chargers. In-game win probabilities are calculated on advancednflstats.com. They are dependent on time left in the game, score, and field position, and down, and they are independent of team skill. This is the probability graph for the Chargers game. I used this for the final two Steelers calculations: just before the Chiefs missed a FG, and then in OT during the Chief’s final drive after the Chargers kicked a FG to take the lead.
You can see how the Steelers’ probability changed after each event, and how small the area was until the Chargers almost lost their game. I will compare this with the Chargers, who had a better chance all through the day, except when the Chiefs were threatening to win. The large green area is the probability at the end of the Chargers game when they won, and it’s value is 1.0 denoting they have clinched the playoff birth.
To respond to Philip River’s on field comments about the Chargers having long odds, it’s misleading to think the Chargers somehow overcame those odds themselves, when they were the favorites to capture the final wild card spot when their game started. It was the others teams’ losses that increased the odds in the Chargers favor before they even played.
This exercise serves as an example of how and why probabilities change over time, and it illustrates how probability relies on known information or a reference point. And how changing the reference point affects the known probability.