Covariance is the less understood sibling of correlation. While correlation is commonly used in reporting, covariance provides the mathematical underpinnings to a lot of different statistical concepts. Covariance describes how two variables change in relation to one another. If variable X increases and variable Y increases as well, Both X & Y will have positive covariance. Negative covariance will result from having two variables move in opposite directions, and zero covariance will result from the variables have no relationship with each other. Variance is also a specific case of covariance where both input variables are the same. All of this is rather abstract, so let’s look at more concrete definitions.
Covariance — Summation Notation
The definition you will find in most introductory stat books is a relatively simple equation using the summation operator (Σ). This shows covariances as the sum of the product of a paired data point relative to its mean. First, you need to find the mean of both variables. Then take all the data points and subtract the mean from its respective variable. Finally, you multiply the differences together
N is the number of data points in the population.
n is the sample number. μX is the population mean for X; μY for Y.
Ȳ are the mean as well but this notation designates it as a sample mean rather than a population mean. Calculating the covariance of any significant data set can be tedious if done by hand, but we can set-up the equation in R and see it work. I used modified version of Anscombe’s Quartet data set.
#get a data set
X <- c(anscombe$x1, 6,4,10)
Y = c(anscombe$y1, 10,8,6)
#get the means
X.bar = mean(X)
Y.bar = mean(Y)
#calculate the covariance
sum((X-X.bar)*(Y-Y.bar)) / (length(X)) #manually population
sum((X-X.bar)*(Y-Y.bar)) / (length(X) - 1) #manually sample
cov(X,Y) #built-in function USES SAMPLE COVARIANCE
Obviously, since covariance is used so much within statistics, R has a built-in function
cov(), which yields the sample covariance for two vectors or even a matrix.
Covariance — Expected Value Notation
[Trying to explain covariance in expected value notation makes me realize I should back up and explain the expected value operator, but that will have to wait for another post. Quickly and oversimplified, the expect value is the mean value of a random variable.
E[X] = mean(X). The expected value notation below describes the population covariance of two variables (not sample covariance):
The above formula is just the population covariance written differently. For example,
E[X] is the same as μx. And the
E acts the same as taking the average of
(X-E[X])(Y-E[Y]). After some algebraic transformations you can arrive at the less intuitive, but still useful formula for covariance:
This formula can be interpreted as the product of the means of variables X and Y subtracted from the average of signed areas of variables X and Y. This probably isn’t very useful if you are trying to interpret covariance. But you’ll see it from time to time. And it works! Try it in R and compare it to the population covariance from above.
mean(X*Y) - mean(X)*mean(Y) #expected value notation
Covariance — Signed Area of Rectangles
Covariance can also be thought of as the sum of the signed area of the rectangles that can be drawn from the data points to the variables respective means. It’s called the signed area because we will get two types of rectangles, ones with a positive value and ones with negative values. Area is always a positive number, but these rectangles take on a sign by virtue of their geometric position. This is more of an academic exercise, in that it provides an understanding of what the math is doing and less of a practical interpretation and application of covariance. If you plot paired data points, in this case we will use the X and Y variables we have already used, you can tell just be looking there is probably some positive covariance because it looks like there is a linear relationship in the data. I’ve chosen to scale the plot so that zero is not included. Since this is a scatter plot including zero isn’t necessary.
First, we can draw the lines for the means of both variables as straight lines. These lines effectively create a new set of axes and will be used to draw the rectangles. The sides of the rectangles will be the difference between a data point and it’s mean [
Xi - X̄]. When that is multiplied by [
Yi - Ȳ], you can see that gives you an area of a rectangle. Do that for every point in your data set, add them up and divide by the number of data points, and you get the population covariance.
The following is a plot has a rectangle for each data point, and it is coded red for negative and blue for positive signs.
The overlapping rectangles need to be considered separately so the opacity is reduced so that all the rectangles are visible. For this data set there is much more blue area than there is red area, so there is positive covariance, which jives with what we calculated earlier in R. If you were to take the areas of those rectangles and add/subtract according to the blue/red color then divide by the number of rectangles, you would arrive the population covariance: 3.16. To get the sample covariance you’d subtract one from the number of rectangles when you divide.
Chatterjee, S., Hadi, A. S., & Price, B. (2000). Regression analysis by example. New York: Wiley.
Covariance As Signed Area Of Rectangles. http://www.davidchudzicki.com/posts/covariance-as-signed-area-of-rectangles/
How would you explain covariance to someone who understands only the mean?
The signed area of rectangles on Chudzicki’s site and statexchange use a different covariance formulation, but similar concept than my approach.
The full code I used to write up this tutorial is available on my GitHub .