React Context API: Multiple Consumer and Providers

With React’s Context API you are able to use multiple Providers and Consumers using different context instances. This could be useful if you want to separate different data models into distinct parts of your app. For example, you could have a data context and a table context. One context instance could be used to control […]

React Context API: Three Way Light Switch

In the Way Too Simple Context API example, we made a simple light switch. This post will show why flux, the single source of truth, and Context API are really useful. The last post had just two components (not counting the App component): a light switch and light bulb. Here we are going to add […]

A Way Too Simple React Context API Example

React’s Context API is convenient built-in state management for React Projects. It has it’s advantages and disadvantages over a library like Redux for sending props and changing the app’s state. I’m going to focus on the advantages of using Context API and getting an overly simple example to work. The example in this post only […]

My Dead Simple Redux Example

If you are here, I assume you are banging your head against the wall trying to figure out Redux for a React project. If you are looking for a quick start for a React project that has Redux already setup then this is a good boilerplate: http://mikechabot.github.io/react-boilerplate/. But you are probably still a little confused […]

A Monty Hall Probability Simulation

There are three doors. And hidden behind them are two goats and a car. Your objective is to win the car. Here’s what you do: Pick a door. The host opens one of the doors you didn’t pick that has a goat behind it. Now there are just two doors to choose from. Do you […]

Make a HTML Table with jQuery

For a project I was working on, I needed a quick, simple solution to make a dynamic table based on data sent back from an AJAX call. I used jQuery to build and manipulate the table HTML, since it was quick to use jQuery and it’s already in my project. After considering a few different […]

D3 Visualization Basics — First Steps

D3 visualizations work by manipulating elements in the browser window. This short tutorial will demonstrate the very basics of that. This is also a working, simple demonstration of the interplay of HTML, CSS and JavaScript from the introduction page in this D3 tutorial set. For the sake of making this simple, everything will come from […]

D3 Visualization Basics — Introduction

Data visualization is important, really important. I can’t be more blunt than that. We are able to process much more information faster by seeing a visual representation than we could look at a table, database or interacting with a spreadsheet. I will be writing a series of posts that explore some of the foundations D3 […]

Stattleship! Sport Stats API

I’ve been in contact with the team over at Stattleship. They have a cool API that allows you to get various stats for basketball, football and hockey. I used data from that API to create the following data visualization for their blog. The visualization shows the offensive and special team yards gained by each team […]

The Backwards K — Baseball Strikeout Looking

The backwards K is normally used to denote a called third strike in a strikeout. It’s typically written on a scorecard. I’ve been looking for the backwards K so I can denote the strikeout looking on Twitter, and I finally found it: ꓘ (for unsupported browsers — Chrome) The easiest way to use this character […]

Collecting Twitter Data: Converting Twitter JSON to CSV — UTF-8

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 [current page] The main drawback […]

Collecting Twitter Data: Converting Twitter JSON to CSV — ASCII

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII [current page] | Part VII: Twitter JSON to CSV — UTF-8 I outlined some […]

Collecting Twitter Data: Converting Twitter JSON to CSV — Possible Errors

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors [current page] | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 ASCII JSON-to-CSV | […]

R Bootcamp: Making a Subset

Data Manipulation: Subsetting Making a subset of a data frame in R is one of the most basic and necessary data manipulation techniques you can use in R. If you are brand new to data analysis, a data frame is the most common data storage object in R and subsets are a collection of rows […]

Emoji iOS 9.1 Update — The Taco Emoji Analysis

Before I get too far I don’t actually analysis taco emojis. At least not yet. I, however, give you the tools to start parsing them from tweets, text or anything you can get into Python. This past month Apple released their iOS 9.1 and their latest OS X 10.11.1 El Capitan update. That updated included […]

R Bootcamp — A Quick Introduction

R, a statistics programming language environment, is becoming more popular as organizations, governments and businesses have increased their use of data science. In an effort to provide a quick bootcamp to learn the basics of R quickly, I’ve assemble some of the most basic processes to give a new user a quick introduction to the […]

Covariance — Different Ways to Explain or Visualize It

Covariance is the less understood sibling of correlation. While correlation is commonly used in reporting, covariance provides the mathematical underpinnings to a lot of different statistical concepts. Covariance describes how two variables change in relation to one another. If variable X increases and variable Y increases as well, Both X & Y will have positive […]

Introduction to Correlation with R | Anscombe’s Quartet

Correlation is one the most commonly [over]used statistical tool. In short, it measures how strong the relationship between two variables. It’s important to realize that correlation doesn’t necessarily imply that one of the variables affects the other. Basic Calculation and Definition Covariance also measures the relationship between two variables, but it is not scaled, so […]

Baseball Twitter Roller Coaster

Because Twitter is fun and so are graphs, I have tweet volume graphs from my Twitter scraper that collects tweets with the team-specific nicknames and Twitter handles. After a trade (or non-trade), the data can be collected and a graphical picture of the reaction can be produced. The graph represents the volume of sampled tweets […]

Making a Correlation Matrix in R

This tutorial is a continuation of making a covariance matrix in R. These tutorials walk you through the matrix algebra necessary to create the matrices, so you can better understand what is going on underneath the hood in R. There are built-in functions within R that make this process much quicker and easier. The correlation […]

Making a Covariance Matrix in R

The full R code for this post is available on my GitHub. Understanding what a covariance matrix is can be helpful in understanding some more advanced statistical concepts. First, let’s define the data matrix, which is the essentially a matrix with n rows and k columns. I’ll define the rows as being the subjects, while […]

One-Sample t-Test [With R Code]

The one sample t-test is very similar to the one sample z-test. A sample mean is being compared to a claimed population mean. The t-test is required when the population standard deviation is unknown. The t-test uses the sample’s standard deviation (not the population’s standard deviation) and the Student t-distribution as the sampling distribution to […]

Twitter Sentiment — Penguins VS. Rangers Gm 4

Game 4 of the Penguins-Rangers series featured a brief overtime period that overshadowed the rest of the game as far as tweet volume goes. Rangers fans were more negative at the beginning of the game after the Penguins scored their first goal. Twitter volume picked up for both teams during the overtime period and Rangers […]

Twitter Sentiment — Penguins vs. Rangers Gm 3

Unfortunately, my Twitter scraper wasn’t looking for the most viral story of the Penguin’s loss to the Rangers. [I’m not linking to it, but it involves a columnist and the Penguins’ GM.] I was able to get general sentiment over the course of the game. There isn’t too much to analyze. There are more Rangers […]

Using New, Diverse Emojis for Analysis in Python

I haven’t been updating this site often since I’ve started to perform a similar job over at FanGraphs. All non-baseball stat work that I do will continued to be housed here. Over the past week, Apple has implemented new emojis with a focus on diversity in their iOS 8.3 and the OS X 10.10.3 update. […]

Collecting Twitter Data: Using a Python Stream Listener

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener [current page] | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 I use the […]

Collecting Twitter Data: Getting Started

Part I: Introduction | Part II: Getting Started [current page] | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 The R code […]

Collecting Twitter Data: Introduction

Part I: Introduction [current page] | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 Collecting Twitter data […]

2015 State of the Union Address — Text Analytics

I collected tweets about the 2015 State of the Union address [SOTU] in real time from 10am to 2am using the keywords [obama, state of the union, sotu, sotusocial, ernst]. The tweets were analyzed for sentiment, content, emoji, hashtags, and retweets. The graph below shows Twitter activity over the course of the night. The volume […]

MLB — Pace of Play [Working Post]

This post is a work in progress. The data concerning the pace of play is rather messy and this project is rather large compare to what I normally tackle. For that reason I’m going start this post and update it as a ‘working post’. Please feel free to contact me if anyone has any input: […]

2015 Steelers-Ravens Playoff Twitter Infographics

The Steelers-Ravens playoff game gave me a chance to test out a new analytics server and some of the tools I’ve been working on to make Twitter analysis easy using ad hoc Python scripts. So here goes: There were a lot of Steelers or Ravens colored emojis, black and gold hearts or buttons and the […]

One Mean Z-test [with R code]

I’ve included the full R code and the data set can be found on UCLA’s Stats Wiki Building on finding z-scores for individual measurement or values within a population, a z-test can determine if there is a statistically significance different between a sample mean and a population mean with a known population standard deviation. [Those […]

Calculating Z-Scores [with R code]

I’ve included the full R code and the data set can be found on UCLA’s Stats Wiki Normal distributions are convenient because they can be scaled to any mean or standard deviation meaning you can use the exact same distribution for weight, height, blood pressure, white-noise errors, etc. Obviously, the means and standard deviations of […]

The Most Popular Emoji Characters on Twitter

On Twitter, about 10% of general-topic tweets contain emoji characters, the tiny icons and emoticons, which are starting to get more attention when analyzing tweets, Facebook messages, or text messages. An emoji [] can capture an emotion or completely change the meaning of the written text. Before exploring how different emojis are used and what […]

Emoji, UTF-8, and Python

I have updated [better] code that allows for easy counting of emoji’s in string objects in Python, it can be found on my GitHub. I have a two counting classes in a mini-package loaded there. Emoji [], those ubiquitous emoticons that popped up when iPhone users found them in 2011 with iOS 5 are a […]

James Bond — Graph Theory

If you have every wondered if you could watch every James Bond movie without watching the same actor play James Bond in a row or how many different possibilities there were, you’ve unsuspectedly ventured into graph theory. Graph theory is basically the study of connected things. These can be bridges, social networks, or in this […]

Visualization of CNN’s 2014 Midterm Election Coverage

Adding to the basic text analytics I wrote about last week, I ran a bag-of-word sentiment analysis on CNN’s midterm election coverage on transcripts found on their site. Fortunately, all the transcripts have a time stamp on them denoting what hour of programming the transcript covers, so I was able to attach a time of […]

Basic Text Analytics for News Bias

Bias is a problem every news media outlet has in some form beyond the well-debated political slants that Fox News and MSNBC are renown for. I’ve been attempting to quantify biases using text analytics. By looking at the frequency and topics of articles, word choices, and associated words, I believe that you can find analytical […]

Using a Genetic Algorithm to Minimize an OLS Regression in R

A genetic algorithm allows you to optimize parameters by using an algorithm that mimics biological evolution. It will run through several generations of values trying to find the values that minimizes [or maximizes depending on the algorithm] its fitness or evaluation function, which is just any function that returns a value from the parameters the […]

OLS Derivation

Ordinary Least Squares (OLS) is a great low computing power way to obtain estimates for coefficients in a linear regression model. I wanted to detail the derivation of the solution since it can be confusing for anyone not familiar with matrix calculus. First, the initial matrix equation is setup below. With X being a matrix […]

2014 ALWCG Twitter Graphs

The Royals and A’s had quite the entertaining 12-inning game Tuesday night. These are a few graphs I made from Twitter data. Yellow is Oakland; blue is Kansas City. The proportions of tweets between teams might be off, but I would venture to guess the Royals had much more social media activity than the A’s. […]

Getting Lucky in a Playoff Series

Sports have a constant uncertainty and randomness in every aspect of the game including determining champions. This is one area you wouldn’t expect to have a lot of variability, since you would want the team that has the best roster composition and played the hardest to win the championship. This concept is usually brought up […]

Do MLB Playoff Odds Work?

One of the more fan-accessible advanced stats are playoff odds [technically postseason probabilities]. Playoff odds range from 0% – 100% telling the fan the probability that a certain team will reach the MLB postseason. These are determined by creating a Monte Carlo simulation which runs the baseball season thousands of times [FanGraph runs theirs 10,000 […]

Statistics — Probability vs. Odds

Probability and odds are two basic statistic terms to describe the likeliness that an event will occur. They are often used interchangeably in causal conversation or even in published material. However, they are not mathematically equivalent because they are looking at likeliness in different contexts. In everyday conversation when numbers or values aren’t given, the […]

MLB — Run Distribution Per Game & Per Inning — Negative Binomial

This is an extension of an earlier post I wrote about the runs per inning distribution. In this post I use the negative binomial distribution to better model the how MLB teams score runs in an inning or in a game. I wrote a primer on the math of the different distributions mentioned in the […]

Count Data Distribution Primer — Binomial / Negative Binomial / Poisson

Count data is exclusively whole number data where each increment represents one of something. It could be a car accident, a run in baseball, or an insurance claim. The critical thing here is that these are discrete, distinct items. Count data behaves differently than continuous data, and the distribution [frequency of of different values] is […]

Twitter Retweet Decay

This uses the same data set I obtained from my NU Data Mining final project [summary]. Recently, @MLBcathedrals tweeted a photo I submitted to them: Seat at exact spot where #Twins Harmon Killebrew hit a 522′ HR. Now in Mall of America via @seandolinar: @ckamka pic.twitter.com/OluZOUVUkb” — Baseball Cathedrals (@MLBcathedrals) August 4, 2014 I got […]

Twitter Sentiment Analysis

This is a summary of a final project I did from my Introduction to Data Mining class at NU. The goal of the project was to find a business need and execute a data mining process. The general process I used is outlined here and the sentiment lexicon is found here. The lexicon is from […]

Pirates 2014 — Bullpen

All the graphs are pulled from this Fangraphs leaderboard. The Pirates bullpen has a been a source of problems and criticisms for the Pirates this year. At the beginning of 2014, the bullpen had almost the same personnel as the 2013 season. Bullpens can vary wildly from year to year, and the Pirates relievers pitched […]

MLB — Bases Loaded. No Outs. No Runs.

Bases loaded, no outs is one of the most tenuous points of a close baseball game. If you are rooting for the team at the plate, you feel confident your team will score here. Anything else, would be a huge disappointment. If you are rooting for the fielding team and your pitcher gets out of […]

Moving Average Time Series — Baseball

Usually I use stats to describe baseball, but this post is going to use baseball to illustrate stats. There’ll be some math. If that scares you, you’ve been duly warned. Also I have collected the SAS output for each model for technical reference. A time series is data that has been collected at a regular […]

Pirates Do Not Need Help Against LHP

Stats in this post are current up to right before the July 31, 2014 PIT-ARZ game. The MLB non-waiver trade deadline just passed. I’m not interesting in debating what teams should or should not have done except to say the price for quality players was very high this year. The whole supply & demand, free […]

MLB — Poisson Distribution To Model Runs Scored Per Inning

I have recently written a much more mathematically involved post using the negative binomial and wrote up a discrete probability distribution primer. These are a more complete treatment of the the topic. However, this post is a good overview of the basics. My friend sparked my recent interest in Poisson distributions by mentioning how rare […]

Where Do People Tweet?

This is a representative map of twitter from 11am to 11pm EDT yesterday.

Predicting Baseball Wins with WAR

This is a lot of debate about the usefulness of the comprehensive baseball statistic, WAR — Wins Above Replacement. I don’t think that WAR is the end all statistic, but it is a useful tool. Why? Because it can describe relatively accurately how a player contributes to a team. It also can help fans understand […]

Twitter Analysis – Penguins Game 7

I’ve been listening to 93.7 The Fan while running the analysis for this, and I never realized that people can say the same thing over and over again but in slightly different ways. Also all tweets were captured AFTER THE CONCLUSION OF THE 1st PERIOD. Everyone knows Twitter is the best venue to vent your […]

Probability and Sunday Night Baseball

There’s nothing I like more than a bases-loaded, no-outs situation in baseball. This might be my favorite situation/stat no one realizes. There’s around a 15% chance that the team who has the bases loaded will not score at all that inning! 15% might not seem like much, but over the course of the season it […]

Chicago Transit Authority — Ridership

Waiting for the break of day…oooOOOOO…25 or 6 to 4! -Chicago (formerly The Chicago Transit Authority) I was lucky to live in Chicago during the summer of 2012. The thing I most miss from Chicago is the transit system. Taking the ‘L’ to work everyday was much more relaxing and interesting than having to drive […]

Chicago Transit Authority — Ridership By Station

Pirates 2014 — Take Your Finger Off The Panic Button

EVERYONE TAKE THEIR FINGER OFF THAT PANIC BUTTON. The Pirates did really, really well last year. They won 94 games, the NLWCG, and took the Cardinals to 5 games in the NLDS. Expectations for right or wrong reasons have been raised for the following year. With April coming to a close the Pirates are looking […]

Text Message Analytics — Numbers

People communicate a lot through text messages, and lucky for me iPhones keep track of those text messages I’ve sent. iPhones store your text messages in a SQLite database, and this database is readily accessible in your iPhone backup on your computer. [This is why encrypting your backup might be a good idea if you […]

Charlie Morton — PitchFX

I’m in a predictive modeling class for my grad program at NU, and we are learning a statistical programming language called SAS. One of the things we are trying early on is cluster analysis to determine if variables are related. I decided to play around with data that’s a little more interesting than housing prices. […]

#SeanTrek GeoTracks 2012

You might remember #SeanTrek — the 46 day, 12,000 mile, 34 state excursion I took back at the very end of 2012. I didn’t know what I how I was going to use this at the time, but I geotagged just about everything I did on the trip. I checked-in to every place on Foursquare […]

Pirates — Run Probability

Presented without much commentary or analysis. This is how the Pirates fared last year given a certain number of outs and with runners on specific bases. So for example with no ones and nobody on base the Pirates had a 26% chance of a scoring a run from that point in the inning on till […]

Pirates — Pitch Count

At the Pirates, we like to try to guess what the pitch count will be for the Pirates’ starting pitcher. In honor of opening day today, I present a cheat sheet! The graphs might be a little bit overkill, but it’s cool all the different ways you can visualize the this simple data. The number […]

Runners in Scoring Position [RISP]

  These graphs were constructed with data from retrosheet.org.  If you like baseball data, I suggest you go there.  Also any individual player stats must have at least 100 season at bats to appear on the graph/analysis. Last year a lot of emphasis was placed on how bad the Pirates were offensively, specifically with runners […]

Against All Odds — Upsets

It’s a great time to be a Dayton fan!  It’s the first time the school has reached the Sweet Sixteen since before all the Dayton fans I know where born…and they did it as an 11 seed!  Their game against Ohio State was the first game of the tournament to tip.  A little over two […]

2014 NCAA Tournament Predictions — Monte Carlo

The process of simulating the NCAA tournaments involves two-steps.  The first is determining what statistical prediction model to use to determine the outcome of a game.  The second step is to simulate the entire tournament.   Simulating the tournament multiple times and keeping track of each outcome is called a Monte Carlo simulation. This simulates […]

NCAA Tournament — Seeding

All of the analysis is only looking at the 64 team field from the 1985 tournament through 2013.  Before 1985 there were less than 64 teams invited.  Opening Round games are also ignored.   The NCAA Selection Committee just released the seeding for the upcoming tournament, and everyone over the next few days will be […]

2013 MLB Wins Visualized

    Want to know where all the wins in MLB came from last year?  This chart tells you all 2431 (30 teams x 82 wins + 1 Gm163) wins came from last season.  The chart is fully interactive.  I do advise using a large screen since the chart is so large. This is called […]

2013 AFC Playoffs and Bayesian Statistics

How does a team like the Steelers go from 0-4 to a dark horse for the final AFC playoff spot to inches away from clinching to the birth?  Then the Chargers who were a dark horse themselves go on an secure the 6th seed? Philip Rivers mentioned, in an endorphin-high interview, that no one gave […]

2012 Basketball Scoring Correlation

  Below are correlation graphs illustration a significant (albeit slightly weak) correlation between how many points one team will score and how many points their opponents will score in a given game. This isn’t anything novel, but rather an illustration and confirmation about what you might surmise about teams that play faster score more and […]

2013 Pirates R/RA Graphs