xaktly | Probability & Statistics

Correlation


What is correlation?


How do we know whether two variables are related? Correlation is a measure of dependence of one variable or measurement upon another (or others).

If, when one part of some system is changed, another also changes in a predictable way, then we say that the two parts are correlated.

For example, in the graph below, the maximum heart rates of 47 men aged 15-75 are plotted against their ages. The data show that as a man ages his maximum heart rate tends to drop in a reasonably predictable way.

The graph shows a correlation between maximum heart rate (MHR) and age. It's clear that while the data doesn't form a perfect line, it looks like it might "want to be" a line with a negative slope. That is, it sure seems like maximum heart rate decreases as men age, and that we might even be able to find a mathematical relationship between the two.

It's important to realize that the data doesn't prove that aging causes MHR to drop, only that in these subjects it does drop, and pretty convincingly, as a function of age.

It turns out that a well known rule of thumb for predicting maximum heart rate (for women or men) is

MHR = 220 - age,

which is a linear relationship.

Beware  post hoc ergo propter hoc !


This Latin phrase means "after this, therefore because of this." It's a warning about a common logical fallacy, confusion of cause and effect. It's important to know when things are correlated, but correlation does not prove causation. Just because one thing is correlated with another doesn't mean one caused the other – either way around. We have to be careful about drawing that kind of conclusion. More evidence is usually needed.

The spectrum of correlation


There is a spectrum of levels of correlation of two variables, from "very highly correlated" to "uncorrelated". Three examples are shown in the plots below.

For now, our definition of how correlated? is a little soft, but pretty easy to understand. If you can squint your eyes and make out a pattern in the plotted data — linear in this case, then there probably is some mathematical correlation.

The points in the plot on the left lie pretty close to an imaginary straight line (green) that can be drawn through them. In the middle plot, the line is a little messy, but it's clearly there. On the right, it's really kind of a guess as to whether there is any relationship between x and y in those points; they are uncorrelated or at best only very weakly correlated.


Perfectly correlated data would lie on a line that we could define with a linear equation like $y = mx + b$, where m is the slope and b the y-intercept. This data isn't perfectly correlated, but it sure is suggestive of a linear relationship between the y and x data.

This data is a lettle less correlated, but if we squint, we can still see something of a linear relationship there. It's possible to "fit" a straight line to this data so that it comes as close to all of the points as possible.

This data might be correlated and it might not. It's difficult to tell. It would be hard to place a line through the data that made any sense.


This data is negatively correlated. As the independent (x) variable increases, the dependent variable (y) decreases. Negatively-correlated data can also be strongly or weakly correlated.

Positive correlation

As the independent variable increases, so does the dependent variable


Postitive correlation

Plot $y$ vs. $x$

If $x \uparrow,$ then $y \uparrow$.

Negative correlation

As the independent variable increases, the dependent variable decreases.


Negative correlation

Plot $y$ vs. $x$

If $x \uparrow,$ then $y \downarrow$.


Nonlinear correlations


Sometimes when we look at a plot of data there is an obvious nonlinear relationship — the plotted data have an obvious curved appearance.

That appearance might be polynomial (like quadratic or cubic), exponential or logarithmic in form, or it might have the shape of some other curve. That curved shape might also be difficult to pin down or it might even be accidental, so we have to be careful when we interpret such graphs and try to analyze them.

The graph below shows automobile stopping distance plotted against the speed of the vehicle before the brakes are engaged. It's easy to see that a curved model would fit the data better than a linear one. In fact, in this case we know from the laws of physics that the relationship is quadratic — inversely quadratic in this case.

Looking at graphs of data can enlighten us about what kind of underlying relationship might exist. Likewise, having some foreknowledge of such a relationship from other sources might help, too. We have to balance all sources of information to do the best analysis we can.



The correlation coefficient, $R$


Later, after you've learned about probability distributions, you'll learn how to put a number to correlation. Language like "somewhat correlated" and "well correlated" aren't really what we're looking for when we're trying to make data-based decisions, so a number is what we want.

That number, at least for data that is linearly correlated, will be called the correlation coefficient. You might not really have the background to understand it right now, but you can understand how to interpret it using the sliding scale below.

The correlation coefficient, usually labeled $R$, has a range from -1 to +1. $R = -1$ means that the data is perfectly correlated and that the correlation is negative. All points lie exactly on a downward-sloping line. $R = 1$ means perfect positive correlation, and $R = 0$ means no dependence of one variable on the other at all (at least according to the data in the set). $R$ can take on any value between -1 and 1.




The arithmetic of correlation


Most of the time we are interested in correlation between two data sets:

Does maximum heart rate change predictably with age?

Does the number of salmon surviving the downriver trip depend on the number of days of peak power generation at the dam?

Does understanding improve with study time?

... and so on. To determine a numerical value for the correlation coefficient in these cases, we rely on the Pearson correlation coefficient (usually we just call it the correlation coefficient).

The correlation coefficient between two data sets – we'll call them x, and y, each containing N elements – is defined using the standard deviations of the x & y data sets, and a new measure, the covariance between x and y.

The covariance, $\sigma_{x,y}$, is defined like this:

$$\sigma_{xy}^2 = \sum_{i = 1}^N \, (x_i - \bar{x})(y_i - \bar{y})$$

You can see that it's a lot like the variance, which we write here two ways to better illustrate the point:

$$ \begin{align} \sigma_x^2 &= \sum_{i = 1}^N \, (x_i - \bar{x})^2 \\[5pt] &= \sum_{i = 1}^N \, (x_i - \bar{x})(x_i - \bar{x}) \end{align}$$

The covariance, unlike the variance, can be positive or negative. Notice that the variance is just a special case of the covariance — it's the covariance of a set of data with itself.

Now the correlation coefficient is the covariance between the x and y data sets and the product of the standard deviations in the x and y data sets. Roll over/tap the equation to highlight those parts:




Now if you think about it for a while, you should see that $|R|$ cannot be greater than 1. Because the covariance can be negative, the range of the correlation coefficient, $R$, is $-1 \le R \le 1$, or $R \in [-1, 1]$.

A positive correlation coefficient means that when x (which we presume to be the independent variable) increases, then y also increases. When r is negative (we call that a negative correlation), then when x increases, y tends to decrease.

When $R = -1$, a data set is perfectly negatively correlated. When $R = +1$, the set is perfectly positively correlated, and when $R = 0$ (the covariance = 0), there is no statistical relationship between x and y.


Here are some examples of data (it's made up data of maximum heart rate vs. age for a group of men). The data were generated by adding some "noise" to numbers generated by the formula MHR = 220 - age, a well known rule of thumb. The noise was added using a random number generator on a spreadsheet.

The data are negatively correlated (y decreases as x increases). They are well correlated on the left, somewhat less well-correlated in the center, and completely uncorrelated on the right.



Calculating R

There are a few ways that the correlation coefficient can be calculated. Many calculators have a statistics package that will allow you to automatically calculate the correlation between two arrays of data. Likewise, most spreadsheet programs are also so-equipped.

When y vs. x data are plotted in most spreadsheet programs, a linear least-squares fit can be performed automatically (usually by selecting something like "insert trend line" when a scatter (x-y) plot is selected).

You might also try calculating a correlation coefficient the long way using a spreadsheet program.

You'll want columns for the x- and y-data, from which you can calculate their averages, and then for the differences $(x_n - \bar x), \, (y_n - \bar y)$, and their squares. Then it will just be a matter of plugging in the sums of those columns into the formula for the correlation coefficient.

The table below shows just such a calculation using a spreadsheet. I've left some of the values out, but they are the data used on the left plot (r = -0.95) above. The key to such a calculation is organization.




$R$ and $R^2$


There is a difference in the interpretation of the correlation coefficient $R$ and its square, $R^2$, often referred to as the coefficient of determination. $R$ is the correlation coefficient. It tells us the degree, on an absolute scale between 0 and 1 (or 0% to 100%) how correlated two variables are.

$R^2$ is the fraction of change in the y variable that is due to a change in the x variable. For example, if $R^2 = 0.75$, then 75% of the change in the y-variable can be attributed to a change in the x-variable. It's handy to have such a qualification on our data because it can hint that, for low $R^2$ values, there might be some other contributor to the change in y than just x.

$$R$$

The correlation coefficient is between -1 and 1, and gives the degree of correlation between two variables. 0 means uncorrelated, ±1 means perfect (positive or negative) correlation.

$$R^2$$

The coefficient of determination tells us what fraction of the change in the y variable is due to change in the x variable.


A programming example

Here is an example of how to code something like a correlation coefficient calculation. This is a simple Python program that calculates the correlation coefficient for the same data as the spreadsheet example above. The data is in the arrays x and y. You can click/tap on the code to download a text copy to run yourself.

__author__ = 'Your name'

# program: correlation.py for xaktly.com/Correlation.html (this is a comment)

#!/user/local/bin/python
#Include "Python.h"

import math   # Needed for the square-root function

nData = 38    # Number of data points in each array

x = {20,21,23,28,29,29,31,33,33,34,36,40,40,41,42,43,44,
     44,44,45,47,48,50,51,52,55,58,61,62,66,67,67,72,74,77,80}
     
y = {204,189,200,191,198,184,188,188,189,195,178,186,178,184,172,
     171,183,174,170,181,169,175,165,164,173,174,176,163,156,155,
     159,149,160,156,142,139,149,134}
     
xsum  = 0.0       # sum of all x[i]
ysum  = 0.0       # sum of all y[i]
x2sum = 0.0       # sum of all (x[i] - xavg)^2
y2sum = 0.0       # sum of all (y[i] - yavg)^2
xavg  = 0.0       # average value of x's
yavg  = 0.0       # average value of y's
xysum = 0.0       # sum of (x[i] - xavg)(y[i] - yavg)

i = 0

for i in range (0, nData):
    xsum += x[i]
    ysum += y[i]
    
xavg = xsum / nData
yavg = ysum / nData

for i in range (0, nData):
    x2sum += (x[i] - xavg) * (x[i] - xavg)
    y2sum += (y[i] - yavg) * (y[i] - yavg)
    xysum += (x[i] - xavg) * (y[i] - yavg)
    
# calculate r, remembering to take roots:

r = xysum / ((math.sqrt(x2sum)) * (math.sqrt(y2sum)))

print 'correlation coefficient: r = ', r
      

More complicated data sets


Often we work with data sets in which an dependent variable depends on more than one independent variable. Maybe it's a data set that models, say, a six dimensional function,

$$y = f(x_1, x_2, x_3, x_4, x_5, x_6)$$

In such a case, we would want to get an idea about the independence of the variables x1 ... xn. To do that we can calculate (I won't say how here) a matrix of correlation coefficients. It might look like this one (which is made up).

Notice that x1 is perfectly correlated with x1, and so on, those coefficients lying along the diagonal (upper left to lower right) of our matrix.

In analyzing such a matrix, we're making sure that (1) the positive and negative coefficients are more or less randomly distributed, i.e. that there's no significant bias one way or the other.

And (2), we're flagging any large correlations. In this matrix, the correlations between x1 & x4 and x2 & x6 are 0.6 or above. While not terribly worrisome, it would be a good plan to look at those relationships more closely to make sure that those variables are actually as independent as we've assumed.



Creative Commons License   optimized for firefox
xaktly.com by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2012-2019, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.