Tag Archives: featured

A Monty Hall Probability Simulation

There are three doors. And hidden behind them are two goats and a car. Your objective is to win the car. Here’s what you do:

  • Pick a door.
  • The host opens one of the doors you didn’t pick that has a goat behind it.
  • Now there are just two doors to choose from.
  • Do you stay with your original choice or switch to the other door?
  • What’s the probability you get the car if you stay?
  • What’s the probability you get the car if you switch?

It’s not a 50/50 choice. I won’t digress into the math behind it, but instead let you play with the simulator below. The game will tally up how many times you win and lose based on your choice.


What’s going on here? Marilyn vos Savant wrote the solution to this game in 1990. You can read vos Savant’s explanations and some of the ignorant responses. But in short, because the door that’s opened is not opened randomly, the host gives you additional information about the set of doors you didn’t choose. Effectively, if you switch, you are select all the other doors. If you choose to stay, you are select just one door.

In her answer, she suggests:

Here’s a good way to visualize what happened. Suppose there are a million doors, and you pick door #1. Then the host, who knows what’s behind the doors and will always avoid the one with the prize, opens them all except door #777,777. You’d switch to that door pretty fast, wouldn’t you?

To illustrate that in the simulation, you can increase number of number of doors in the simulator. It becomes pretty clear that switch is the correct choice.

Finally, here’s some Kevin Spacey:

Make a HTML Table with jQuery

For a project I was working on, I needed a quick, simple solution to make a dynamic table based on data sent back from an AJAX call. I used jQuery to build and manipulate the table HTML, since it was quick to use jQuery and it’s already in my project.

After considering a few different ways to approach this, I decided different arrays would be the easiest way to handle the data. The data looks like this:

It’s a JavaScript object with two keys: one for the header that I abbreviated k and the main data values which have the key of v. The header is just an array of strings, while the values are an array of arrays. I specifically designed this code to work within these parameters, so there could be more checks built in, but the data source is rather rigid.

To make the Table class, I defined the attributes:

Using this prototype code is a little bit of overkill, but it can be reused and extended. I plan on having the application update with new data and possibly other features. Creating a prototype allows that to be a little bit easy and cleaner.

I have three setter methods, which just allow the Table object to have it’s attributes set and have the data set.

All the methods I’ve written have return this in them. That allows method chaining, which makes the implementation of the code a lot simpler. The meat of the code is in the build method.

I’ve annotated most of the code, but basically this creates jQuery objects for each of part of the table structure: the table (table), a row (tr), a header cell (th) and a normal table cell (td). The clone() method is necessary so that jQuery creates another HTML element. Otherwise it will keep on removing, modifying and appending the same element.

Using the prototype we just created is rather easy, we did the hard part already. We use the new keyword to instantiate a new object. This allows us to create many different independent Table objects which can be manipulated individually within the application.

Above is the short snippet of code which has method chaining. This allows us not to have to write separate lines of code for each method which would look like table.setData(). I used the setHeader() to set the array (data.k) which populates the table’s header. The setData() method sets the array of an array (data.v) as the source of the data for the rest of the table.

Finally, the build() method uses the data we just set to actually run the code that manipulates the HTML and this is what you see in your web browser.

Before using it on a web page, there has to be some HTML. (And some CSS so that the table looks decent.) The most important part is that the div container has the class of "table-container". The Table class is by default looking for that class to append the table to. You can customize that by changing using a jQuery selection string as a parameter in the table.build([jQuery selection string]) method.

Above is an working version of all the code from above. The full code I used for this post can be found on my GitHub .

D3 Visualization Basics — First Steps

D3 visualizations work by manipulating elements in the browser window. This short tutorial will demonstrate the very basics of that. This is also a working, simple demonstration of the interplay of HTML, CSS and JavaScript from the introduction page in this D3 tutorial set.

For the sake of making this simple, everything will come from one HTML document, which can be found in my GitHub . This will container the HTML and JavaScript. You can [and should] separate the JavaScript into its own file on bigger projects.

In this small project, we will start with a simple div container with the class of “container”.

Right now this doesn’t do anything, so it’s not worth showing. But if D3 code is added to create a blue box, it might look a little more interesting. [Provided you find blue boxes interesting.]

Blue Block

Above is a simple example of a basic D3 procedure. It can helpful to think about this command as having two parts: getting a DOM [browser] element to manipulate and giving instructions for those manipulations. Here is what the code is doing:

  1. The select statement [d3.selectAll()] code is finding every instance of the class of “container”. [There is only one element on the page.]
  2. The append() method adds another div as a child element inside of the container div.
  3. The attr() method gives the new div a class of “new-block”.
  4. The style() method gives the new div several style properties
  5. The text() method puts the text “Blue block” into the div

The style() and attr() methods can be a little confusing since they do similar things, but attr() will place attributes in the HTML tag, while style() writes in-line CSS for the element, which will override any CSS stylesheet you load in the header. These are just a few of the different methods you can use for a D3 DOM selection object, and you can find more in the D3 API reference.

Blue Block with Functions

Creating the blue block was a blast, but adding some functionality might make this a little more useful. Let’s make the block change color on a click and then remove it on a double click.

The first part of the script to create the blue block is the same, except I’ve doubled the code to have to blue blocks and I added new code that interacts with the blue block. The on() method allows you to attach an event listener to elements rendered in the browser. These will wait until a certain event happens. In the example it executes a function to turn the box orange when the blue box is clicked. You can put many different instructions in here and aren’t limited to manipulating the element that is being clicked. Below is an illustration that I find useful for visualizing how event listeners are attached. I will devote another post to the details of D3 event listeners.

attach event listeners

You might notice the d3.select(this) in the code. this is a fun keyword in JavaScript syntax which deals with scope, and the this in the code refers to the specific DOM element which was clicked or double clicked. If you click on the left block, only that block turns orange. You could change the code replace this with '.block-new' and clicking one button will change both buttons to orange.

Having the d3.select(this) code blocks in the event listener function makes it so they are only executed when the event happens. The block’s background color is changed to orange [#FF9821] when it’s clicked. The remove() method deletes any DOM elements within the selection including the children elements. This comes in handy when you need to rebuild or update a data visualization.

[Next]

Data! D3’s most powerful tool.

D3 Visualization Basics — Introduction

Data visualization is important, really important. I can’t be more blunt than that. We are able to process much more information faster by seeing a visual representation than we could look at a table, database or interacting with a spreadsheet. I will be writing a series of posts that explore some of the foundations D3 is built on along with how to create engaging data visualizations using it.

D3 is a powerful tool that allows you to create interactive data visualizations for the web. Understanding how D3 works starts with understanding how modern web pages are designed.

If you have found this page, you probably at least have some knowledge of how to make a modern website: HTML, CSS, JavaScript, responsive design, etc. D3 uses basic elements from these components of web design to create the visualizations. This by no means the only way to create interactive visualizations, but this is an effective way to produce them.

Before jumping into D3 nuts and bolts, let’s look at what does each of these components do. [If you already know this stuff, feel free to skip ahead…once I get the other posts built out.]

Basic Web Programming

In the most simplistic terms, HTML provides the structure of the webpage, CSS provides the styling and formatting, and JavaScript provides the functionality of the site. The browser brings these three components together and interprets them into something the end user (you) can understand and use. Sometimes one component can accomplish what the other does, but if you stick to this generalization you’ll be in good shape.

To produce a professional-looking, fully-functional D3 data visualization you will need to understand, write and manipulate all three components.

HTML

The most vivid memories I have of HTML is from the websites of the late 90s: Geocities, Angelfire, etc. HTML provides instructions on how browsers should interpret information; it organizes the information. Everything you see on a webpage has corresponding HTML code.

If you look at the source HTML or inspect one of this site’s page you’ll see some of the structure. When HTML renders in the browser these elements are referred to DOM elements. DOM stands for Document Object Model, which is the structure of a webpage.

HTML containers

Looking at the DOM tree you can see the many of the div containers that provide structure for the how the site is laid out. The p tags contain each paragraph in the content of my posts. h1, h2 and h3 are subheadings I’ve made to make the post more organized. You also notice some attributes especially class which have many uses for CSS, JavaScript and D3. Classes in particular are used to identify what function that DOM element plays in JavaScript or how to style it in CSS.

CSS

A house without painted walls or decorations is pretty boring. The same thing happens with bare bones HTML. You can organize the information, but it won’t be in an appealing format.

Most sites have style sheets (CSS) which sets margins, colors, display options, etc. Style sheets have a specific syntax which identifies HTML elements by type, class or id. This identification and selection concept is used extensively in D3.

CSS example

Above is some CSS from this site. It contains formatting instructions for elements of the class “page-links”. It includes instructions for the font size, margins, height, width and to make the text all uppercase. The advantage of CSS is that it keeps formatting away from the structure of the HTML allowing you to format many elements at once. For example if you wanted to change the color of every link, you could easily do that by modifying the CSS.

There is an alternative to using CSS style sheets and that’s by using inline style definitions.

Inline styles use the same markup as the CSS in the style sheets. Inline styles

  • control only the element they are in
  • OVERRIDE any CSS styles [without an !important tag]

The code above overrides the normal paragraph’s style property which aligns it left. Using inline styles are generally bad for web design, but it’s important to understand how they work since D3 manipulates inline styles often.

JavaScript

JavaScript breathes life into your web page. It certainly not the only way to have your website become interactive or build programming into it, but it is widely used and supported in the popular browsers. D3 is a JavaScript library, so you will inevitably have to write JavaScript to use it.

For D3 visualization, JavaScript will be use to

  • Manage and manipulate data for the visualization
  • Create DOM elements
  • Manipulate DOM elements
  • Destroy DOM elements
  • Attach data to DOM elements

JavaScript will be used insert elements onto the page, it will also be used to change colors and styles of those elements. You might be able to see how this could be useful. For example JavaScript could map data points to an element’s position for a scatter plot or to an element’s height or width for a bar chart.

I bolded the last function D3 does, attaching data to elements, because it’s so critical to D3. This allows you to attach a data point beyond x, y data to allow for rich visualization.

__data__

Above is data attached to a D3 visualization I made for FanGraphs. This is a simple example, but I was able to attach data detailing the team’s name, id, league, ERA and FIP. Using the attached data I was able to create the graph and tooltips. More complex designs can take advantage of the robust data structure D3 provides.

[Next]

I’ll look at how to set up a basic project by organizing data, files and code.

Stattleship! Sport Stats API

I’ve been in contact with the team over at Stattleship. They have a cool API that allows you to get various stats for basketball, football and hockey. I used data from that API to create the following data visualization for their blog. The visualization shows the offensive and special team yards gained by each team remaining in the playoffs. The yardage is totaled for the entire season as well as the one playoff game each team played. I’ve displayed the points off of offensive TDs and special teams scoring, and that score is color coded with to wins and loses. A black background is a win, and a white background is a loss.

backwards K

The Backwards K — Baseball Strikeout Looking

The backwards K is normally used to denote a called third strike in a strikeout. It’s typically written on a scorecard. I’ve been looking for the backwards K so I can denote the strikeout looking on Twitter, and I finally found it:

backwards-K2

(for unsupported browsers — Chrome)

The easiest way to use this character is to copy and paste the backwards K from above and save it in a note or something you can copy and paste from routinely. This character is actually from Apple’s implementation of the Unicode from the artificial, Latinized version of the Lisu alphabet. This alphabet contains an upside-down, turned K which looks similar enough to a backwards K I think this pass on Twitter.

If you don’t see the backwards K in the block above, you computer or mobile device probably isn’t using a font that supports that specific character. It’s supported on Macs and iPhones (as well as the Edge browser in Windows 10).

References:
http://unicode.org/charts/PDF/UA4D0.pdf
https://en.wikipedia.org/wiki/Fraser_alphabet

R Bootcamp: Making a Subset

Data Manipulation: Subsetting

Making a subset of a data frame in R is one of the most basic and necessary data manipulation techniques you can use in R. If you are brand new to data analysis, a data frame is the most common data storage object in R and subsets are a collection of rows from that data frame based on certain criteria.

Data Frame
V1 V2 V3 V4 V5 V6 V7
Row1
Row2
Row3
Row4
Row5
Row6

Arrow

Subset
V1 V2 V3 V4 V5 V6 V7
Row2
Row5
Row6

The Data

For this example, I’m using data from FanGraphs. You can get the exact data set here, and it’s provided in my GitHub. This data set has players names, teams, seasons and stats. We are able to create a subset based on any one or more of these variables.

The Code

I’m going to show four different ways to subset data frames: using a boolean vector, using the which() function, using the subset() function and using filter() function from the dplyr package. All of these functions are different ways to do the same thing. The dplyr package is fast and easy to code, and it is my recommended subsetting method, so let’s start with that. This is especially true when you have to loop an operation or run something that will be run repeatedly.

dplyr

The filter() requires the dplyr package to be loaded in your R environment, and it removes the filter() function from the default stats package. You don’t need to worry about but it does tell you that when you first install and load the package.

Aside from the loading the package, you’ll have to load the data in as well.

The filter() function is rather simple to use. The examples above illustrate a few simple examples where you specify the data frame you want to use and create true/false expressions, which filter() uses to find which rows it should keep. The output of the function is saved into a separate variable, so we can reuse the original data frame for other subsets. I put a few other examples in the code to demonstrate how it works.

Built-in Functions

If you don’t want to use the dplyr package, you are able to accomplish the same thing uses the basic functionality of R. #method 1 uses a boolean vector to select rows for the subset. #method 2 uses the which() function. This function finds the index of a boolean vector of True values. Both of these techniques use the original data frame and uses the row index to create a subset.

The subset() function works much like the filter() function, except the syntax is slightly different and you don’t have to download a separate package.

Efficiency

While subset works in a similar fashion, it doesn’t perform the same way. While some data manipulation might only happen once or a few times throughout a project, many projects require constant subsetting and possibly from a loop. So while the gains might seem insignificant for one run, multiply that difference and it adds up quickly.

I timed how long it would take to run the same [complex] subset of a 500,000 row data frame using the four different techniques.

Time to Subset 500,000 Rows
Subset Method Elapsed Time (sec)
boolean vector 0.87
which() 0.33
subset() 0.81
dplyr filter() 0.21

The dpylr filter() function was by far the quickest, which is why I prefer to use it.

The full code I used to write up this tutorial is available on my GitHub .

References:

Introduction to dplyr. https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

emoji

Emoji iOS 9.1 Update — The Taco Emoji Analysis

Before I get too far I don’t actually analysis taco emojis. At least not yet. I, however, give you the tools to start parsing them from tweets, text or anything you can get into Python.

This past month Apple released their iOS 9.1 and their latest OS X 10.11.1 El Capitan update. That updated included a bunch of new emojis. I’ve made a quick primer on how to handle emoji analysis in Python. Then when Apple released an update to their emojis to include the diversity, I updated my small Python class for emoji counting to include to the newest emojis. I also looked at what is actually happening with the unicode when diversity modifier patches are used.

Click for Updated socialmediaparse Library

With this latest update, Apple and the Unicode Consortium didn’t really introduce any new concepts, but I did update the Python class to include the newest emojis. In my GitHub the data folder includes a text file with all the emojis delimitated by ‘\n’. The class uses this file to find any emoji’s in a unicode string which has been passed to the add_emoji_count() method.

Building off of the diversity emoji update, I added a skin_tone_dict property of the EmojiDict class. This property returns a dictionary with the number of unique human emojis per tweet and their skin tones. This property will not catch multiple human emojis written if they in the same execution of the add_emoji_count() method

Above is an example of how to use the new attribute. It is a dictionary so you can work that into your analysis however you like. I will eventually create better methods and outputs to make this feature more robust and useful.

The full code / class I used in this post can be found on my GitHub .

R Bootcamp — A Quick Introduction

R, a statistics programming language environment, is becoming more popular as organizations, governments and businesses have increased their use of data science. In an effort to provide a quick bootcamp to learn the basics of R quickly, I’ve assemble some of the most basic processes to give a new user a quick introduction to the R language.

This post assumes that you have already installed R and have it running correctly on your computer. I recommend getting RStudio to use to write and execute your code. It will make your life much easier.

Getting Started

First R is an interactive programming environment, which means you are able to send commands to its interpreter to tell it what to do.

There are two basic methods to send commands to R. The first is by using the console, which is like your old-school command line computing methods. The second method is more typically used by R coders, and that’s to write a script. An R script isn’t fancy. At its core it’s a text document that contain a collection of R commands. Then when the code is executed it is treated like a collection of individual commands being feed one-by-one into the R interpreter. This differs on how other, more fundamental programming languages work.

R - How R Works

Basics

Comments are probably the best place to start, especially because my code is chock-full of them. A comment is code that is fed to R, but it’s not executed and has no bearing on the function of your script or command. In R comments are lines prefaced with a #.

One of the most basic thing you could use R for is a calculator. For instance if we run the code 9-3, R will display a 6 as the result of that code. All of this is rather straight forward. The only operator you might not be familiar with if you are new to coding is the modulus operator, which yields the remainder when you divide the first number by the second. This gets used often when dealing with data. For example, you can get a 0 for even number and 1 for odd number if you take you variable use the modulus operator with the number 2.

Beyond the basic math and numeric operations you can do, R has several fundamental data types. NULL and NA are representative of empty objects or missing data. These two data types aren’t the same. NA will fill a position in an vector or data frame. The details are best left for another entry.

Numeric values can have mathematical operations performed on them. Strings are essentially non-numeric values. You can’t add strings together or find the average of a string. In any type of data analysis, you’ll typically have some string data. It can be used to classify entries in categorically such as male/female or Mac/Windows/Linux. R will treat these like factors.

Finally, boolean values (True or False) are binary logical values. They work like normal logic operations you might have learned in math or a logic class with AND (&&) and OR (||) operators. These can be used in conditional statements and various other data manipulation operations such as subsetting.

Now that we covered the basic operations and data types, let’s look at how to store that — variables. To assign a value to a variable it’s rather easy. You can use a simple equation or the traditional R notation using an arrow.

Variables must begin with a letter and they are case-sensitive. Periods are acceptable faux separators in variable names, but that doesn’t translate to other programming languages like Python or JavaScript, so that might factor in how you establish naming conventions.

I’ve mentioned vectors a few times already. They are an important data structure within R. A vector is an ordered list of data. Typically, thought of as numeric data, but character (string) vectors are often used in R. The c() operator can create a vector. It’s important that vectors contain the same type of data: boolean, numeric or character. If you mix types it will force values into another type. And you can assign your vectors to variables. In fact, you can store just about any thing in R to a variable.

Lists are created with the list() command. They are used more for storage and organization than for data structure. For example you could store the mean, median and range for a set of data in a list. A vector would house the data used to calculated said summary stats. Lists are useful when you begin to write bigger programs and need to shuffle a lot of things around.

The basic statistic operators are listed below. All of these require a vector to operate on.

Handling Data

Above we discussed some of the building blocks of basic analysis in R. Beyond introductory Statistics classes, R isn’t very useful unless you can import data. There are many ways to do this since data exists in many different formats. A .csv file is one of the most basic, compatible way data is stored to be used between different analytical tools.

Before loading this data file into R, it’s a good idea to set your working directory. This is where the data file is stored.

Next you can use the read.csv() function to ingest a .csv file into R. This call won’t save the data in a variable, it just brings it in as a data frame and show it to you.

Data frames are the primary form of data structure you’ll encounter in R. Data frames are like tables in Excel or SQL in that they are rectangular and have a rigid schema. However, at a data frame’s core are a collection of equal-length vectors.

If you assign the data frame output of the read.csv() function to a variable, you can pass around the data frame to different data manipulation functions or modeling functions. One of the most basic ways to manipulate the data is to access different values within the data frame. Below are several different examples on how to get to values, rows or columns in a data frame.

The basic concept is that data frames can be accessed by row and column number. [row, column] And that an entire row or column can be accessed by omitting the dimension you aren’t trying to retrieve. You can retrieve individual fields (variables) by using the $ sign and using the variable name. This is the method I use most often. It requires you knowing and using the name of the variables, which can make your code easier to read.

By accessing rows, you can create a subset of data by using a logical argument to filter out your data set.

The code above creates two new data frames which separate Kobe Bryant’s season stats into an under-25 data set and a 25 and under data set.

Relationships Between Variables

Correlation is often used to summarize the linear relationship between two variables. Getting the correlation in R is simple. Use the cor() function with two equal length vectors. R uses the corresponding elements in each vector to get a Pearson correlation coefficient.

A simple linear model can be made by using the lm() function. The linear model function requires two things: a formula and a data frame. The formula uses a tilde (~) instead an equal sign. The formula represent the variables you would use in your standard ordinary least squares regression. The data parameter is the data frame which contains all the data.

The summary() function will take the linear model object and displays information about the coefficients of your linear model.

NOTES: The data set used in this tutorial is from basketball-reference.com.

The full code I used in this tutorial can be found on my GitHub .

Covariance — Different Ways to Explain or Visualize It

Covariance is the less understood sibling of correlation. While correlation is commonly used in reporting, covariance provides the mathematical underpinnings to a lot of different statistical concepts. Covariance describes how two variables change in relation to one another. If variable X increases and variable Y increases as well, Both X & Y will have positive covariance. Negative covariance will result from having two variables move in opposite directions, and zero covariance will result from the variables have no relationship with each other. Variance is also a specific case of covariance where both input variables are the same. All of this is rather abstract, so let’s look at more concrete definitions.

Covariance — Summation Notation

The definition you will find in most introductory stat books is a relatively simple equation using the summation operator (Σ). This shows covariances as the sum of the product of a paired data point relative to its mean. First, you need to find the mean of both variables. Then take all the data points and subtract the mean from its respective variable. Finally, you multiply the differences together

Population Covariance:

$latex
cov(X, Y) = \frac{1}{{N}} \sum\limits_{i}^{N}{(X_i – \mu_x)(Y_i – \mu_y)}
&s=2$

Sample Covariance:

$latex
cov(X, Y) = \frac{1}{{n-1}} \sum\limits_{i}^{n}{(X_i – \bar X)(Y_i – \bar Y)}
&s=2$

N is the number of data points in the population. n is the sample number. μX is the population mean for X; μY for Y. and are the mean as well but this notation designates it as a sample mean rather than a population mean. Calculating the covariance of any significant data set can be tedious if done by hand, but we can set-up the equation in R and see it work. I used modified version of Anscombe’s Quartet data set.

Obviously, since covariance is used so much within statistics, R has a built-in function cov(), which yields the sample covariance for two vectors or even a matrix.

Covariance — Expected Value Notation

[Trying to explain covariance in expected value notation makes me realize I should back up and explain the expected value operator, but that will have to wait for another post. Quickly and oversimplified, the expect value is the mean value of a random variable. E[X] = mean(X). The expected value notation below describes the population covariance of two variables (not sample covariance):

$latex
cov(X, Y) = \textnormal{E}[(X-\textnormal{E}[X])(Y-\textnormal{E}[Y])]
&s=2$

The above formula is just the population covariance written differently. For example, E[X] is the same as μx. And the E[] acts the same as taking the average of (X-E[X])(Y-E[Y]). After some algebraic transformations you can arrive at the less intuitive, but still useful formula for covariance:

$latex
cov(X, Y) = \textnormal{E}[XY] – \textnormal{E}[X]\textnormal{E}[Y]
&s=2$

This formula can be interpreted as the product of the means of variables X and Y subtracted from the average of signed areas of variables X and Y. This probably isn’t very useful if you are trying to interpret covariance. But you’ll see it from time to time. And it works! Try it in R and compare it to the population covariance from above.

Covariance — Signed Area of Rectangles

Covariance can also be thought of as the sum of the signed area of the rectangles that can be drawn from the data points to the variables respective means. It’s called the signed area because we will get two types of rectangles, ones with a positive value and ones with negative values. Area is always a positive number, but these rectangles take on a sign by virtue of their geometric position. This is more of an academic exercise, in that it provides an understanding of what the math is doing and less of a practical interpretation and application of covariance. If you plot paired data points, in this case we will use the X and Y variables we have already used, you can tell just be looking there is probably some positive covariance because it looks like there is a linear relationship in the data. I’ve chosen to scale the plot so that zero is not included. Since this is a scatter plot including zero isn’t necessary.

XY Scatter Plot

First, we can draw the lines for the means of both variables as straight lines. These lines effectively create a new set of axes and will be used to draw the rectangles. The sides of the rectangles will be the difference between a data point and it’s mean [Xi - X̄]. When that is multiplied by [Yi - Ȳ], you can see that gives you an area of a rectangle. Do that for every point in your data set, add them up and divide by the number of data points, and you get the population covariance.

Scatterplot Covariance Rectangle

The following is a plot has a rectangle for each data point, and it is coded red for negative and blue for positive signs.

All Covariance Rectangles

The overlapping rectangles need to be considered separately so the opacity is reduced so that all the rectangles are visible. For this data set there is much more blue area than there is red area, so there is positive covariance, which jives with what we calculated earlier in R. If you were to take the areas of those rectangles and add/subtract according to the blue/red color then divide by the number of rectangles, you would arrive the population covariance: 3.16. To get the sample covariance you’d subtract one from the number of rectangles when you divide.

References:

Chatterjee, S., Hadi, A. S., & Price, B. (2000). Regression analysis by example. New York: Wiley.

Covariance. http://math.tutorvista.com/statistics/covariance.html

Covariance As Signed Area Of Rectangles. http://www.davidchudzicki.com/posts/covariance-as-signed-area-of-rectangles/
How would you explain covariance to someone who understands only the mean?
https://stats.stackexchange.com/questions/18058/how-would-you-explain-covariance-to-someone-who-understands-only-the-mean

Notes:

The signed area of rectangles on Chudzicki’s site and statexchange use a different covariance formulation, but similar concept than my approach.

The full code I used to write up this tutorial is available on my GitHub .