Introduction to Correlation with R | Anscombe’s Quartet

Correlation is one the most commonly [over]used statistical tool. In short, it measures how strong the relationship between two variables. It’s important to realize that correlation doesn’t necessarily imply that one of the variables affects the other.

Basic Calculation and Definition

Covariance also measures the relationship between two variables, but it is not scaled, so it can’t tell you the strength of that relationship. For example, Let’s look at the following vectors a, b and c. Vector c is simply a 10X-scaled transformation of vector b, and vector a has no transformational relationship with vector b or c.

a b c
1 4 40
2 5 50
3 6 60
4 8 80
5 8 80
5 4 40
6 10 100
7 12 120
10 15 150
4 9 90
8 12 120

Plotted out, a vs. b and a vs. c look identical except the y-axis is scaled differently for each. When the covariance is taken of both a & b and a & c, you get different a large difference in results. The covariance between a & b is much smaller than the covariance between a & c even though the plots are identical except the scale. The y-axis on the c vs. a plot goes to 150 instead of 15.

$latex
cov(X, Y) = \frac{\Sigma_i^N{(X_i – \bar X)(Y_i – \bar Y)}}{N-1}
&s=2$

$latex
cov(a, b) = 8.5
&s=2$

$latex
cov(a, c) = 85
&s=2$

correlation b vs acorrelation c vs a

To account for this, correlation is takes covariance and scales it by the product of the standard deviations of the two variables.

$latex
cor(X, Y) = \frac{cov(X, Y)}{s_X s_Y}
&s=2$

$latex
cor(a, b) = 0.8954
&s=2$

$latex
cor(a, c) = 0.8954
&s=2$

Now, correlation describes how strong the relationship between the two vectors regardless of the scale. Since the standard deviation in vector c is much greater than vector b, this accounts for the larger covariance term and produces identical correlations terms. The correlation coefficient will fall between -1 and 1. Both -1 and 1 indicate a strong relationship, while the sign of the coefficient indicates the direction of the relationship. A correlation of 0 indicates no relationship.

Here’s the R code that will run through the calculations.

#covariance vs correlation
a <- c(1,2,3,4,5,5,6,7,10,4,8)
b <- c(4,5,6,8,8,4,10,12,15,9,12)
c <- c(4,5,6,8,8,4,10,12,15,9,12) * 10

data <- data.frame(a, b, c)

cov(a, b)  #8.5
cov(a, c)  #85

cor(a,b)  #0.8954
cor(a,c)  #0.8954

Caution | Anscombe's Quartet

Correlation is great. It's a basic tool that is easy to understand, but it has its limitations. The most prominent being the correlation =/= causation caveat. The linked BuzzFeed article does a good job explaining the concept some ridiculous examples, but there are real-life examples being researched or argued in crime and public policy. For example, crime is a problem that has so many variables that it's hard to isolate one factor. Politicians and pundits still try.

Another famous caution about using correlation is Anscombe's Quartet. Anscombe's Quartet uses different sets of data to achieve the same correlation coefficient (0.8164 give or take some rounding). This exercise is typically used to emphasize why it's important to visualize data.

Anscombe Correlation R

The graphs demonstrates how different the data sets can be. If this was real-world data, the green and yellow plots would be investigated for outliers, and the blue plot would probably be modeled with non-linear terms. Only the red plot would be consider appropriate for a basic, linear model.

I created this plot in R with ggplot2. The Anscombe data set is included in base R, so you don't need to install any packages to use it. Ggplot2 is a fantastic and powerful data visualization package which can be download for free using the install.packages('ggplot2') command. Below is the R code I used to make the graphs individually and combine them into a matrix.

#correlation
cor1 <- format(cor(anscombe$x1, anscombe$y1), digits=4)
cor2 <- format(cor(anscombe$x2, anscombe$y2), digits=4)
cor3 <- format(cor(anscombe$x3, anscombe$y3), digits=4)
cor4 <- format(cor(anscombe$x4, anscombe$y4), digits=4)

#define the OLS regression
line1 <- lm(y1 ~ x1, data=anscombe)
line2 <- lm(y2 ~ x2, data=anscombe)
line3 <- lm(y3 ~ x3, data=anscombe)
line4 <- lm(y4 ~ x4, data=anscombe)

circle.size = 5
colors = list('red', '#0066CC', '#4BB14B', '#FCE638')

#plot1
plot1 <- ggplot(anscombe, aes(x=x1, y=y1)) + geom_point(size=circle.size, pch=21, fill=colors[[1]]) +
  geom_abline(intercept=line1$coefficients[1], slope=line1$coefficients[2]) +
  annotate("text", x = 12, y = 5, label = paste("correlation = ", cor1))

#plot2
plot2 <- ggplot(anscombe, aes(x=x2, y=y2)) + geom_point(size=circle.size, pch=21, fill=colors[[2]]) +
  geom_abline(intercept=line2$coefficients[1], slope=line2$coefficients[2]) +
  annotate("text", x = 12, y = 3, label = paste("correlation = ", cor2))

#plot3
plot3 <- ggplot(anscombe, aes(x=x3, y=y3)) + geom_point(size=circle.size, pch=21, fill=colors[[3]]) +
  geom_abline(intercept=line3$coefficients[1], slope=line3$coefficients[2]) +
  annotate("text", x = 12, y = 6, label = paste("correlation = ", cor3))

#plot4
plot4 <- ggplot(anscombe, aes(x=x4, y=y4)) + geom_point(size=circle.size, pch=21, fill=colors[[4]]) +
  geom_abline(intercept=line4$coefficients[1], slope=line4$coefficients[2]) +
  annotate("text", x = 15, y = 6, label = paste("correlation = ", cor4))

grid.arrange(plot1, plot2, plot3, plot4, top='Anscombe Quadrant -- Correlation Demostration')

The full code I used to write up this tutorial is available on my GitHub .

References:

Chatterjee, S., Hadi, A. S., & Price, B. (2000). Regression analysis by example. New York: Wiley.