A Monty Hall Probability Simulation

There are three doors. And hidden behind them are two goats and a car. Your objective is to win the car. Here’s what you do: Pick a door. The host opens one of the doors you didn’t pick that has a goat behind it. Now there are just two doors to choose from. Do you […]

Covariance — Different Ways to Explain or Visualize It

Covariance is the less understood sibling of correlation. While correlation is commonly used in reporting, covariance provides the mathematical underpinnings to a lot of different statistical concepts. Covariance describes how two variables change in relation to one another. If variable X increases and variable Y increases as well, Both X & Y will have positive […]

Introduction to Correlation with R | Anscombe’s Quartet

Correlation is one the most commonly [over]used statistical tool. In short, it measures how strong the relationship between two variables. It’s important to realize that correlation doesn’t necessarily imply that one of the variables affects the other. Basic Calculation and Definition Covariance also measures the relationship between two variables, but it is not scaled, so […]

Making a Correlation Matrix in R

This tutorial is a continuation of making a covariance matrix in R. These tutorials walk you through the matrix algebra necessary to create the matrices, so you can better understand what is going on underneath the hood in R. There are built-in functions within R that make this process much quicker and easier. The correlation […]

Making a Covariance Matrix in R

The full R code for this post is available on my GitHub. Understanding what a covariance matrix is can be helpful in understanding some more advanced statistical concepts. First, let’s define the data matrix, which is the essentially a matrix with n rows and k columns. I’ll define the rows as being the subjects, while […]

One-Sample t-Test [With R Code]

The one sample t-test is very similar to the one sample z-test. A sample mean is being compared to a claimed population mean. The t-test is required when the population standard deviation is unknown. The t-test uses the sample’s standard deviation (not the population’s standard deviation) and the Student t-distribution as the sampling distribution to […]

One Mean Z-test [with R code]

I’ve included the full R code and the data set can be found on UCLA’s Stats Wiki Building on finding z-scores for individual measurement or values within a population, a z-test can determine if there is a statistically significance different between a sample mean and a population mean with a known population standard deviation. [Those […]

Calculating Z-Scores [with R code]

I’ve included the full R code and the data set can be found on UCLA’s Stats Wiki Normal distributions are convenient because they can be scaled to any mean or standard deviation meaning you can use the exact same distribution for weight, height, blood pressure, white-noise errors, etc. Obviously, the means and standard deviations of […]

Using a Genetic Algorithm to Minimize an OLS Regression in R

A genetic algorithm allows you to optimize parameters by using an algorithm that mimics biological evolution. It will run through several generations of values trying to find the values that minimizes [or maximizes depending on the algorithm] its fitness or evaluation function, which is just any function that returns a value from the parameters the […]

OLS Derivation

Ordinary Least Squares (OLS) is a great low computing power way to obtain estimates for coefficients in a linear regression model. I wanted to detail the derivation of the solution since it can be confusing for anyone not familiar with matrix calculus. First, the initial matrix equation is setup below. With X being a matrix […]

Getting Lucky in a Playoff Series

Sports have a constant uncertainty and randomness in every aspect of the game including determining champions. This is one area you wouldn’t expect to have a lot of variability, since you would want the team that has the best roster composition and played the hardest to win the championship. This concept is usually brought up […]

Statistics — Probability vs. Odds

Probability and odds are two basic statistic terms to describe the likeliness that an event will occur. They are often used interchangeably in causal conversation or even in published material. However, they are not mathematically equivalent because they are looking at likeliness in different contexts. In everyday conversation when numbers or values aren’t given, the […]

MLB — Run Distribution Per Game & Per Inning — Negative Binomial

This is an extension of an earlier post I wrote about the runs per inning distribution. In this post I use the negative binomial distribution to better model the how MLB teams score runs in an inning or in a game. I wrote a primer on the math of the different distributions mentioned in the […]

Count Data Distribution Primer — Binomial / Negative Binomial / Poisson

Count data is exclusively whole number data where each increment represents one of something. It could be a car accident, a run in baseball, or an insurance claim. The critical thing here is that these are discrete, distinct items. Count data behaves differently than continuous data, and the distribution [frequency of of different values] is […]

Moving Average Time Series — Baseball

Usually I use stats to describe baseball, but this post is going to use baseball to illustrate stats. There’ll be some math. If that scares you, you’ve been duly warned. Also I have collected the SAS output for each model for technical reference. A time series is data that has been collected at a regular […]

Where Do People Tweet?

This is a representative map of twitter from 11am to 11pm EDT yesterday.

Chicago Transit Authority — Ridership

Waiting for the break of day…oooOOOOO…25 or 6 to 4! -Chicago (formerly The Chicago Transit Authority) I was lucky to live in Chicago during the summer of 2012. The thing I most miss from Chicago is the transit system. Taking the ‘L’ to work everyday was much more relaxing and interesting than having to drive […]

Text Message Analytics — Numbers

People communicate a lot through text messages, and lucky for me iPhones keep track of those text messages I’ve sent. iPhones store your text messages in a SQLite database, and this database is readily accessible in your iPhone backup on your computer. [This is why encrypting your backup might be a good idea if you […]

#SeanTrek GeoTracks 2012

You might remember #SeanTrek — the 46 day, 12,000 mile, 34 state excursion I took back at the very end of 2012. I didn’t know what I how I was going to use this at the time, but I geotagged just about everything I did on the trip. I checked-in to every place on Foursquare […]