All posts by Sean Dolinar

NCAA Tournament Seed Graphs

NCAA Tournament — Seeding

All of the analysis is only looking at the 64 team field from the 1985 tournament through 2013.  Before 1985 there were less than 64 teams invited.  Opening Round games are also ignored.

 

The NCAA Selection Committee just released the seeding for the upcoming tournament, and everyone over the next few days will be filling out brackets to win 50$ from their office pool.  But who to pick?  Later this week, I’m going to look at a basic way to predict the point spread for an NCAA tournament game.  Before I look at that prediction, let’s examine how the different seeds perform in the tournament.

Seeds are ordinal ranks (1 through 16) placed on the 64 teams and then those teams are pitted against each so the highest seeds won’t face each other till the final round in the Regional Championship (Elite Eight).  So how well do the seeds perform?  The graphs above show how well each seed has done since 1985, the first year of the 64 team bracket.  The 1 seeds are clearly are the most successful in any around, and beyond that there’s a linear pattern with first round wins.  So the better the seed the more likely the team is to go further.

In fact the likelihood of the highest seed winning the first game goes from 100% for the 1-16 game to a coin-flip for the 8-9 game.

 

Looking at the first round wins, there are two interesting seeds, the 9 and 12 seeds, which outperform the pattern.   This most likely happens because the Selection Committee undervalues teams it assigns to those slots.  There’s an asymmetry of information available for power conferences vs mid-majors.  Mid-majors have less games against top 50 teams, so the the committee has less information to judge the team.

Now as far as selecting teams to win tournament games.  The safest pick is to always pick the highest seed.  The argument against this is that its very boring and you might not win.  Being the most likely to win does not really translate into winning in singular instances.  But the data points to 1 seeds being the overwhelming favorite to get into the Final Four over any other particular seed by a 2:1 margin.

Picking upsets.  The underlying principle of the upset in this tournament is the small sample size of just one game.  A lot can happen in one game.  Even if a team has a 78% chance of winning, which is rather favorable (Like a 4 seed)….there’s still a 22% chance that team will lose.  With 64 teams and 63 games there’s a lot of room for upsets to happen by random chance.  The randomness is in how the team feels that day, how fouls are called, if a foul was called in a critical junction, or come down to how a ball bounces or falls.

NCAA First Round win probability by seed

 

Given the way the seeds have performed since the tournament expanded in 1985 to a 64-team tournament.  The selection committee does a good job seeding the teams, since they are following a pattern you would except from seeding the best teams against the worst.

NCAA Tournament Seed Graphs

MLB 2013 Tree Map

2013 MLB Wins Visualized

 

Dashboard 1

 

Want to know where all the wins in MLB came from last year?  This chart tells you all 2431 (30 teams x 82 wins + 1 Gm163) wins came from last season.  The chart is fully interactive.  I do advise using a large screen since the chart is so large.

This is called a tree map, and the area each cell or group of cells represents wins.  To understand this, the teams with the most wins have the largest area.  And then they are sorted left to right and then top to bottom, so Boston is in the top left while Houston is in the bottom right.  The teams as grouped by their team color.  Then there are sub-cells which denote the wins against specific opponents.  Those sub-cells are sized by wins.  So if you look at the first sub-cell in Boston’s group you’ll see the NYY. This is because the Red Sox got the most wins against the Yankees. You’ll find that division foes usually have the most wins since they play each other the most. The winning percentage against that opponent is also listed in the cell, this is so you can evaluate how well the team actually did against that opponent. Did they win more or lose more? The answer will be determined by if the number is below .500 or above .500.

2013 AFC Playoffs and Bayesian Statistics

How does a team like the Steelers go from 0-4 to a dark horse for the final AFC playoff spot to inches away from clinching to the birth?  Then the Chargers who were a dark horse themselves go on an secure the 6th seed?

Philip Rivers mentioned, in an endorphin-high interview, that no one gave them a chance, saying the odds were against the Chargers.  This delves into the different realms of motivational and analytical that I want to stay away from, but how did these odds really work and how did they change through out the day?  Week 17 of the 2013 NFL season was a great example of Bayes’ theorem.

Here’s the math for Bayes’ theorem:

bayes theorem

 

This is read: the probability of A given that B happens is equal to the probability of B given that A happens times the probability of A divided by the probability of B.  We are going to use this basic form of Bayes’ theorem to understand what happened to the Steelers and Chargers playoff odds throughout the day on December 29th.

The important point I’m going to illustrate here is that probability is not a property inherent in an event.  It is rather a guess or calculation based on known facts or frame of reference at the time.   When that frame of reference changes (in this case when the NFL schedule plays out) we can calculate new odds.  I’ll be using the above mathematical formula to calculate new probabilities as we discover new information through out the day for the Steelers.  Then I’ll compare this with a similar chart for the Chargers, who made the playoffs.

The first and largest problem for this exercise is determining the win/loss probability for the games.  I’ve searched online and there is much disagreement on what the playoffs odds where at the beginning of the day, let alone what any singular games odds were.  I could use Vegas odds as crowd-sourced odds, however, I’ll make this part easy and just make up numbers for illustrative purposes.  I have the Steelers and Chargers win probability weighted high because they were playing  the Browns and the Chiefs’ back-ups.  The Ravens and Dolphins I have at even odds of 0.5.  These could be endlessly debated, but let’s just assume they are correct.

Assumed Probability for Week 17

Assumed probability chart for week 17.

For the Steelers to make the playoffs four things needed to happen.  First they had to win. [P(SteelersW) = .85]  Next the Ravens and Dolphins had to lose . [P(DolphinsL) = P(RavensL) = .5]  And then San Diego had to lose later in the day.  [P(ChargersL) = .75]  If you multiply all these together you will get roughly P(SteelersPlayoffs) = .05.  So there’s a 5% chance with the information at the beginning of the day that the Steelers will make the playoffs.

So let’s use Bayes’ theorem to calculate what the playoff are are if we know the Dolphins lose their game — P(SteelersPlayoffs | DolphinsL).

SteelersPlayoffs

SteelersPlayoffsMath

SteelersPlayoffsResult

From just knowing the Dolphins losing their game, you can infer that the Steelers chances of gaining of a playoff birth is twice as likely as it was before.  I should also explain why P(DolphinsLoss | SteelersPlayoffs) is equal to 1.   This term assumes the Steelers made the playoff and asks what the probability is of the Dolphins’ loss given this information.  The Dolphins must lose if the Steelers make the playoffs so the term is equal to 1.  This will be true for every game we are considering.  (This problem becomes more complicated if there are multiple paths to playoffs, because the term will no longer be 1.)

Starting at P(SteelersPlayoffs) = .05, you can calculate the conditional probability in a chain as the Steelers’ win, Dolphins’ loss, and Ravens’ loss occurs.  Then the probability of the Steelers making the playoffs is calculated solely on the remaining game Chiefs/Chargers.  In-game win probabilities are calculated on advancednflstats.com.  They are dependent on time left in the game, score, and field position, and down, and they are independent of team skill.  This is the probability graph for the Chargers game.  I used this for the final two Steelers calculations: just before the Chiefs missed a FG, and then in OT during the Chief’s final drive after the Chargers kicked a FG to take the lead.

Steelers Playoff Probability

Steelers Playoff Probability — Area chart

 

You can see how the Steelers’ probability changed after each event, and how small the area was until the Chargers almost lost their game.  I will compare this with the Chargers, who had a better chance all through the day, except when the Chiefs were threatening to win.  The large green area is the probability at the end of the Chargers game when they won, and it’s value is 1.0 denoting they have clinched the playoff birth.

Chargers Playoff Probability

Chargers Playoff Probability — Area chart

 

To respond to Philip River’s on field comments about the Chargers having long odds, it’s misleading to think the Chargers somehow overcame those odds themselves, when they were the favorites to capture the final wild card spot when their game started.  It was the others teams’ losses that increased  the odds in the Chargers favor before they even played.

This exercise serves as an example of how and why probabilities change over time, and it illustrates how probability relies on known information or a reference point.  And how changing the reference point affects the known probability.

2012 Toronto Raptors Correlation

2012 Basketball Scoring Correlation

 

Below are correlation graphs illustration a significant (albeit slightly weak) correlation between how many points one team will score and how many points their opponents will score in a given game.

This isn’t anything novel, but rather an illustration and confirmation about what you might surmise about teams that play faster score more and their opponents in turn score more, because there are more possessions over the course of the game.   The Pearson’s correlation coefficients are:

Knicks     .2362
Jazz           .3837
Raptors  .4004
Wizards .4753

(The higher value means the game scores are more strongly correlated with each other.  Typically .30 is a good correlation, less than that is rather weak.)

 

To contrast a sport that doesn’t exhibit this, here’s a break down of the Penguin’s season two years ago (the last full season).  The trend line is virtually straight and the confident interval on the trend line dips negative, suggesting there is no statistically significant correlation between how many goals the Penguins score and the how many goals their opponent’s score in a given game.

2011 Penguins Scatter Plot

 

The correlation coefficient for the Penguins is .0572, which is not statistically significant.  We can conclude that hockey offenses opperate essentially independent of each other.

To further analyze basketball scoring, it would be good to eliminate overtime games, and to see if the team’s correlation to their schedule is related to how good or bad they are, since over the course of a season, a team plays a rather balanced scheduled.  My thinking is mediocre teams will correlate better over a course of a season versus a good or bad team.