correlation matrix

Making a Correlation Matrix in R

This tutorial is a continuation of making a covariance matrix in R. These tutorials walk you through the matrix algebra necessary to create the matrices, so you can better understand what is going on underneath the hood in R. There are built-in functions within R that make this process much quicker and easier.

The correlation matrix is is rather popular for exploratory data analysis, because it can quickly show you the correlations between variables in your data set. From a practical application standpoint, this entire post is unnecessary, because I’m going to show how to derive this using matrix algebra in R.

First, the starting point will be the covariance matrix that was computed from the last post.

{\bf C } =   \begin{bmatrix}  V_a\ & C_{a,b}\ & C_{a,c}\ & C_{a,d}\ & C_{a,e} \\  C_{a,b} & V_b & C_{b,c} & C_{b,d} & C_{b,e} \\  C_{a,c} & C_{b,c} & V_c & C_{c,d} & C_{c,e} \\  C_{a,d} & C_{b,d} & C_{c,d} & V_d & C_{d,e} \\  C_{a,e} & C_{b,e} & C_{c,e} & C_{d,e} & V_e  \end{bmatrix}

This matrix has all the information that’s needed to get the correlations for all the variables and create a correlation matrix [V — variance, C — Covariance]. Correlation, we are using the Pearson version of correlation, is calculated using the covariance between two vectors and their standard deviations [s, square root of the variance]:

cor(X, Y) = \frac{cov(X,Y)}{s_{X}s_{Y}}

The trick will be using matrix algebra to easily carry out these calculations. The variance components are all on the diagonal of the covariance matrix, so in matrix algebra notation we want to use this:

{\bf V} = diag({\bf C}) = \begin{bmatrix}  V_a\ & 0\ & 0\ & 0\ & 0 \\  0 & V_b & 0 & 0 & 0 \\  0 & 0 & V_c & 0 & 0 \\  0 & 0 & 0 & V_d & 0 \\  0 & 0 & 0 & 0 & V_e  \end{bmatrix}

Since R doesn’t quite work the same way as matrix algebra notation, the diag() function creates a vector from a matrix and a matrix from a vector, so it’s used twice to create the diagonal variance matrix. Once to get a vector of the variances, and a second time to turn that vector into the above diagonal matrix. Since the standard deviations are needed, the square root is taken. Also the variances are inverted to facilitate division.

After getting the diagonal matrix, basic matrix multiplication is used to get the all the terms in the covariance to reflect the basic correlation formula from above.

{\bf R } = {\bf S} \times {\bf C} \times {\bf S}

And the correlation matrix is symbolically represented as:

{\bf R } =   \begin{bmatrix}  r_{a,a}\ & r_{a,b}\ & r_{a,c}\ & r_{a,d}\ & r_{a,e} \\  r_{a,b} & r_{b,b} & r_{b,c} & r_{b,d} & r_{b,e} \\  r_{a,c} & r_{b,c} & r_{c,c} & r_{c,d} & r_{c,e} \\  r_{a,d} & r_{b,d} & r_{c,d} & r_{d,d} & r_{d,e} \\  r_{a,e} & r_{b,e} & r_{c,e} & r_{d,e} & r_{e,e}  \end{bmatrix}

The diagonal where the variances where in the covariance matrix are now 1, since a variable’s correlation with itself is always 1.