# Calculating Z-Scores [with R code]

I’ve included the full R code and the data set can be found on UCLA’s Stats Wiki

Normal distributions are convenient because they can be scaled to any mean or standard deviation meaning you can use the exact same distribution for weight, height, blood pressure, white-noise errors, etc. Obviously, the means and standard deviations of these measurements should all be completely different. In order to get the distributions standardized, the measurements can be changed into z-scores.

Z-scores are a stand-in for the actual measurement, and they represent the distance of a value from the mean measured in standard deviations. So a z-score of 2.0 means the measurement is 2 standard deviations away from the mean.

To demonstrate how this is calculated and used, I found a height and weight data set on UCLA’s site. They have height measurements from children from Hong Kong. Unfortunately, the site doesn’t give much detail about the data, but it is an excellent example of normal distribution as you can see in the graph below. The red line represents the theoretical normal distribution, while the blue area chart reflects a kernel density estimation of the data set obtained from UCLA. The data set doesn’t deviate much from the theoretical distribution.

The z-scores are also listed on this normal distribution to show how the actual measurements of height correspond to the z-scores, since the z-scores are simple arithmetic transformations of the actual measurements. The first step to find the z-score is to find the population mean and standard deviation. It should be noted that the sd function in R uses the sample standard deviation and not the population standard deviation, though with 25,000 samples the different is rather small.

Using just the population mean [μ = 67.99] and standard deviation [σ = 1.90], you can calculate the z-score for any given value of x. In this example I’ll use 72 for x.

$latex z = \frac{x – \mu}{\sigma} &s=2$

This gives you a z-score of 2.107. To put this tool to use, let’s use the z-score to find the probability of finding someone who is 72 inches [6-foot] tall. [Remember this data set doesn’t apply to adults in the US, so these results might conflict with everyday experience.] The z-score will be used to determine the area [probability] underneath the distribution curve past the z-score value that we are interested in.
[One note is that you have to specify a range (72 to infinity) and not a single value (72). If you wanted to find people who are exactly 6-foot, not taller than 6-foot, you would have to specify the range of 71.5 to 72.5 inches. This is another problem, but this has everything to do with definite integrals intervals if you are familiar with Calc I.]

The above graph shows the area we intend to calculate. The blue area is our target, since it represents the probability of finding someone taller than 6-foot. The yellow area represents the rest of the population or everyone is is under 6-feet tall. The z-score and actual height measurements are both given underscoring the relationship between the two.

Typically in an introductory stats class, you’d use the z-score and look it up in a table and find the probability that way. R has a function ‘pnorm’ which will give you a more precise answer than a table in a book. [‘pnorm’ stands for “probability normal distribution”.] Both R and typical z-score tables will return the area under the curve from -infinity to value on the graph this is represented by the yellow area. In this particular problem, we want to find the blue area. The solution to this is an easy arithmetic function. The area under the curve is 1, so by subtracting the yellow area from 1 will give you the area [probability] for the blue area.

Yellow Area:

Blue Area [TARGET]:

Both of these techniques in R will yield the same answer of 1.76%. I used both methods, to show that R has some versatility that traditional statistics tables don’t have. I personally find statistics tables antiquated, since we have better ways to determine it, and the table doesn’t help provide any insight over software solutions.

Z-scores are useful when relating different measurement distributions to each acting as a ‘common denominator’. The z-scores are used extensively for determining area underneath the curve when using text book tables, and also can be easily used in programs such as R. Some statistical hypothesis tests are based on z-scores and the basic principles of finding the area beyond some value.

# OLS Derivation

Ordinary Least Squares (OLS) is a great low computing power way to obtain estimates for coefficients in a linear regression model. I wanted to detail the derivation of the solution since it can be confusing for anyone not familiar with matrix calculus.

First, the initial matrix equation is setup below. With X being a matrix of the data’s p covariates plus the regression constant. [This will be represented as a column of ones if you were to look at the data in the X matrix.] Y is the column matrix of the target variable and β is the column matrix of unknown coefficients. e is a column matrix of the residuals.

$latex \mathbf{Y = X} \boldsymbol{\beta} + \boldsymbol{e} &s=1$

Before manipulating the equation it is important to note you are not solving for X or Y, but instead β and will do this by minimizing the sum of squares for the residuals (SSE). So the equation can be rewritten by moving the error term to the left side of the equation.

$latex \boldsymbol{e} = \mathbf{Y – X} \boldsymbol{\beta}&s=1$

The SSE can be written as the product of the transposed residual column vector and its original column vector. [This is actually how you would obtain the sum of squares for any vector.]

$latex \mathrm{SSE} = \boldsymbol{e}’\boldsymbol{e} &s=1$

Since you transpose and multiply one side of the equation, you have to follow suit on the other side. Yielding

$latex \boldsymbol{e’e} = (\mathbf{Y – X} \boldsymbol{\beta})'(\mathbf{Y – X} \boldsymbol{\beta})&s=1$

The transpose operator can be distributed through out the quantity on the right side, so the right side can be multiplied out.

$latex \boldsymbol{e’e} = (\mathbf{Y’ – \boldsymbol{\beta}’X’})(\mathbf{Y – X} \boldsymbol{\beta})&s=1$

Using the rule that A’X = X’A, you can multiple out the right side and simplify it.

$latex \boldsymbol{e’e} = (\mathbf{Y’Y – Y’X\boldsymbol{\beta} – \boldsymbol{\beta}’X’Y} + \boldsymbol{\beta’\mathbf{X’X}\beta})&s=1$
$latex \boldsymbol{e’e} = (\mathbf{Y’Y – \boldsymbol{\beta}’X’Y – \boldsymbol{\beta}’X’Y} + \boldsymbol{\beta’\mathbf{X’X}\beta})&s=1$
$latex \boldsymbol{e’e} = (\mathbf{Y’Y – 2\boldsymbol{\beta}’X’Y} + \boldsymbol{\beta’\mathbf{X’X}\beta})&s=1$

To minimize the SSE, you have to take the partial derivative relative to β. Any terms without a β term in them will go to zero. Using the transpose rule from before you can see how the middle term yields -2X’Y using differentiation rules from Calc1. The last term is a bit tricky, but it derives to +2X’Xβ.

$latex \frac{\delta\boldsymbol{e’e}}{\delta\boldsymbol{\beta}} = \frac{\delta\mathbf{Y’Y}}{\delta\boldsymbol{\beta}} – \frac{2\boldsymbol{\beta}’\mathbf{X’Y}}{\delta\boldsymbol{\beta}} + \frac{\delta\boldsymbol{\beta’\mathbf{X’X}\beta}}{\delta\boldsymbol{\beta}}&s=1$

$latex \frac{\delta\boldsymbol{e’e}}{\delta\boldsymbol{\beta}} = – 2\mathbf{X’Y} + 2\boldsymbol{\mathbf{X’X}\beta}&s=1$

To find the minimum (it will never be a maximum if you have all the requirements for OLS fulfilled), the derivative of the SSE is set to zero.

$latex 0 = – 2\mathbf{X’Y} + 2\mathbf{X’X}\boldsymbol{\beta}&s=1$

$latex 0 = \mathbf{- X’Y} + \mathbf{X’X}\boldsymbol{\beta}&s=1$

Using some basic linear algebra and multiplying both sides by the inverse of (X’X)…

$latex (\mathbf{X’X})^{-1}\mathbf{X’X}\boldsymbol{\beta} = (\mathbf{X’X})^{-1}\mathbf{X’Y}&s=1$

…yields the solution for β

$latex \boldsymbol{\beta} = (\mathbf{X’X})^{-1}\mathbf{X’Y}&s=1$

References:

Chatterjee, S & Hadi, A. (2012). Regression analysis by example. Hoboken, NJ: John Wiley & Sons, Inc.

# Count Data Distribution Primer — Binomial / Negative Binomial / Poisson

Count data is exclusively whole number data where each increment represents one of something. It could be a car accident, a run in baseball, or an insurance claim. The critical thing here is that these are discrete, distinct items. Count data behaves differently than continuous data, and the distribution [frequency of of different values] is different between the two. Random continuous data typically follows the normal distribution, which is the bell curve everyone remembers from high school grade systems. [Which is a really bad way to grade, but I digress.] Count data generally follows the Binomial/Negative Binomial/Poisson distribution depending what context you are viewing the data; all three distributions are mathematically related.

Binomial Distribution:

The binomial distribution (BD) is the collection of probabilities of getting a certain number of successes in a given number of trials specifically measuring Bernoulli trials [a yes/no event similar to a coin flip, but it’s not necessarily 50/50]. My favorite example to understand the binomial distribution is using it to determine the probability that you’d get exactly 5 HEADS if you flipped a coin 10 times [it’s NOT 50%!].

It’s actually 24.61%. The probability of getting heads in any given coin flip is 50%, but over 10 flips, you’ll only get exactly 5 HEADS and 5 TAILS about 25% of the time. The equation below gives the two popular notations for the binomial probability mass function. $latex n&s=1$ is total number of trials. [the graph above used n=10]. $latex r&s=1$ is the number of successes you want to know the probability for. You calculate this function for each number of HEADS [0-10] for $latex r&s=1$ to get the distribution above. $latex p&s=1$ is the simple probability for each event. [$latex p&s=1$ = .5 for the coin flip.]

$latex P(X=r) = {{n}\choose{r}} p^{r} (1-p)^{n-r} = \frac{n!}{r!(n-r)!} p^{r} (1-p)^{n-r}&s=2$

The equation has three parts. The first part is the combination $latex {{n}\choose{r}} &s=1$, which is the number of combinations when you have $latex n&s=1$ total items taken $latex r&s=1$ at a time. Combination disregard order, so the set {1, 4, 9} is the same as {4, 9, 1}. This part of the equation tells you how many possible ways there are to get to a certain outcome since there are many way to get 5 HEADS in 10 tosses. Since $latex {{10}\choose{5}}&s=1$ is larger than any other combination, 5 HEADS will have the largest probability.

There are two more terms in the equation. $latex p^r&s=1$ is joint probability of getting r successes in a particular order, and $latex (1-p)^{n-r}&s=1$ is the corresponding probably of also getting the failures also in a particular order. I find it helpful to conceptualize the equation as having three parts accounting for different things: total combinations of successes and failures, the probabilities of successes, and the probability of failures.

Negative Binomial Distribution:

While there is a good reason for it, the name of the negative binomial distribution (NBD) is confusing. Nothing I will present will involve making anything negative so, let’s just get that out of the way and ignore it. The binomial distribution uses the probability of successes in the total number of ATTEMPTS. To contrast this, the negative binomial distribution uses the probability that a certain number of FAILURES occur before the $latex r&s=1$th SUCCESS. This has many applications specifically when a sequence terminates after the $latex r&s=1$th success such as modeling the probability that you will sell out of the 25 cups of lemonade you have stocked for a given number of cars that pass by. The idea is that you would pack up your lemonade stand after you sell out, so cars that would pass by after the final success won’t matter. Another good example is modeling the win probability of a 7-game sports playoff series. The team that wins the series must win 4 games and specifically the last game played in the series, since the playoff series terminates after one team reaches 4 wins.

One of the more important restrictions on the NBD is that the last event must be a success. Going back to the sports playoff series example, the team that wins the series will NEVER lose the last game. With the 10 coin-flip example, the BD was looking for the probability of getting a certain number of HEADS within a set number of coin flips. Using the NBD, we will look for the probability of 5 HEADS before getting a certain number of TAILS. The total number of flips will not ALWAYS equal 10 and actually exceeds 10 as seen below.

The probability mass function that describes the NBD graph above is given below:

$latex P(X=k) = {{r+k-1}\choose{k}} p^{r} (1-p)^{k}&s=2$

The equation for the NBD has the same parts as the BD: the combinations, the success, and the failures. In the NBD case the combinations are less than the BD [for the same total number of coin flips]. This is because the last outcome is held fix at a success. The probability of success and failure parts of the equation are conceptually the same as the BD. The failure portion is written differently because the number of failures is a parameter $latex k&s=1$ instead of a derived quantity like [$latex n-r&s=1$].

Poisson Distribution:

The Poisson Distribution (PD) is directly related to both the BD and the NBD, because it is the limiting case of both of them. As the number of trials goes to infinity, then the Poisson distribution emerges. The graph for the PD will look similar to the NBD or the BD, and there is no example comparing the coin flip since there has to be some non-discrete process like traffic flow or earthquakes. The major difference is not what is represented, but how it is viewed and calculated. The Poisson distribution is described by the equation:

$latex P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}&s=2$

$latex \lambda&s=1$ is the expected value [or the mean] for an event and $latex x&s=1$ is the count value. If you knew an average of 0.2 car crashes happen at an intersection at a given day then you could solve the equation for $latex x&s=1$ = {0, 1, 2, 3, 4, 5, … } and get the PD for the problem.

One of the restrictions and major issues with the use of the PD is that the model assumes the mean and the variance are equal. In most real data instances the variance is greater than the mean, so the PD tends to favor more values around the expected value than real data reflects.

If you are interested in the derivations and math behind these I recommend this site: http://statisticalmodeling.wordpress.com/. I feel like they explain the derivation of the negative binomial better than most places I’ve found. It addresses why it’s called the NEGATIVE binomial distribution as well. The site also contains derivations of the PD being the limiting case of the BD and NBD.