What is correlation?

In math, correlation is a measure of dependence of one variable or measurement upon another (or others).

If when one part of some system is changed, another also changes in a predictable way, then we say that the two parts are correlated.

For example, in the graph below, the maximum heart rates of 47 men are plotted against their ages. The data show that as a man ages his maximum heart rate tends to drop.

The graph shows a strong correlation between maximum heart rate (MHR) and age. It's important to realize that the data doesn't prove that aging causes MHR to drop, only that in these subjects it does drop, and pretty convincingly as a function of age.



Beware  post hoc ergo propter hoc !


This Latin phrase literally means "after this, therefore because of this." It's a warning about a common logical fallacy, confusion of cause and effect. It's important to know when things are correlated — that's important information. But correlation does not prove causation. Just because one thing is correlated with another doesn't mean one caused the other. We have to be careful about drawing that kind of conclusion. More evidence is usually needed.


Degrees of correlation

There are degrees of correlation of two variables, from "very highly correlated" to "uncorrelated". Three examples are shown in the plots below.

For now, our definition of how correlated? is a little soft, but pretty easy to understand. If you can squint your eyes and make out a pattern — linear in this case, then there probably is some mathematical correlation.

The points in the plot on the left lie pretty close to an imaginary straight line that can be drawn through them. In the middle, the line is a little messy, but it's clearly there. On the left, it's really kind of a guess as to whether there is any relationship between x and y in those points; they are uncorrelated or only very weakly correlated.



Correlation can be negative or positive

In the plots above, as the value of the independent variable (x-axis) increases, the y-axis values of the points increase. We call that a positive correlation.

In the plots below, the trend goes the other way, and we say that these data are negatively correlated.

These terms become a little less important when we talk about non-linear correlations below, but in a great many cases we can reduce a complicated data analysis situation to one where we're looking for linear correlations.



Nonlinear correlations

Sometimes when we look at a plot of data there is an obvious nonlinear relationship. In other words, the plotted data have an obvious curved appearance.

That appearance might be polynomial — quadratic or cubic, say — exponential or logarithmic, or it might have the shape of some other curve. That curved shape might also be difficult to pin down or even accidental, so we have to be careful when we interpret such graphs and try to analyze them.

The graph below shows automobile stopping distance plotted against the speed of the vehicle before hitting the brakes. It's easy to see that a curved model would fit the data better than a linear one.

Looking at graphs of data can enlighten us about what kind of underlying relationship might exist. Likewise, having some foreknowledge of such a relationship from other sources might help, too. We have to balance all sources of information to do the best analysis we can.



The correlation coefficient

Later, after you've learned about probability distributions, you'll learn how to put a number to correlation. Language like "somewhat correlated" and "well correlated" aren't really what we're looking for when we're trying to make data-based decisions, so a number is what we want.

That number, at least for data that is linearly correlated, will be called the correlation coefficient. You might not really have the background to understand it right now, but you can understand how to interpret it using the sliding scale below.

The correlation coefficient, usually labeled R, has a range from -1 to +1. R = -1 means that the data is perfectly correlated and that the correlation is negative. All points lie exactly on a downward-sloping line. R = +1 means perfect negative correlation, and R = 0 means no dependence of one variable on the other at all (at least according to the data in the set). R can take on any value between -1 and 1.



The mathematics of correlation

Most of the time we are interested in correlation between two data sets:

Does maximum heart rate change predictably with age?

Does the number of salmon surviving the downriver trip depend on the number of days of peak power generation at the dam?

Does understanding improve with study time?

... and so on. To determine a numerical value for the correlation coefficient in these cases, we rely on the Pearson correlation coefficient (usually we just call it the correlation coefficient).

The correlation coefficient between two data sets – we'll call them x, and y, each containing N elements – is defined using the standard deviations of the x & y data sets, and a new measure, the covariance between x and y.

The covariance, σx,y, is defined like this:

You can see that it's a lot like the variance, which we write here two ways to better illustrate the point:

The covariance, unlike the variance, can be positive or negative. Notice that the variance is just a special case of the covariance — it's the covariance of a set of data with itself.

Now the correlation coefficient is the covariance between the x and y data sets and the product of the standard deviations in the x and y data sets. Roll over/tap the equation to highlight those parts:



Now if you think about it for a while, you should see that | r | cannot be greater than 1. Because the covariance can be negative, the range of the correlation coefficient, r, is -1 ≤ r ≤ 1, or [-1, 1].

A positive correlation coefficient means that when x (which we presume to be the independent variable) increases, then y also increases. When r is negative (we call that a negative correlation), then when x increases, y tends to decrease.

When r = -1, a data set is perfectly negatively correlated. When r = +1, the set is perfectly positively correlated, and when r = 0 (the covariance = 0), there is no statistical relationship between x and y.


Here are some examples of data (it's made up data of maximum heart rate vs. age for a group of men. The data were generated by adding some "noise" to numbers generated by the formula MHR = 220 - age, a well known rule of thumb. The noise was added using a random number generator on a spreadsheet.

The data are negatively correlated (y decreases as x increases). They are well correlated on the left, somewhat less well-correlated in the center, and completely uncorrelated on the right.



Calculating r

There are a few ways that the correlation coefficient can be calculated. Many calculators have a statistics package that will allow you to automatically calculate the correlation between two arrays of data. Likewise, most spreadsheet programs are also so-equipped.

When y vs. x data are plotted in most spreadsheet programs, a linear least-squares fit can be performed automatically (usually by selecting something like "insert trendline" when a scatter (x-y) plot is selected).

You might also try calculating a correlation coefficient the long way using a spreadsheet program.

You'll want columns for the x- and y-data, from which you can calculate their averages, and then for the differences (xn - x),(yn - y), and their squares. Then it will just be a matter of plugging in the sums of those columns into the formula for the correlation coefficient.

The table below shows just such a calculation using a spreadsheet. I've left some of the values out, but they are the data used on the left plot (r = -0.95) above. The key to such a calculation is organization.




A programming example

Here is an example of how to code something like a correlation coefficient calculation. This is a simple Python program that calculates the correlation coefficient for the same data as the spreadsheet example above. The data is in the arrays x and y. You can click/tap on the code to download a text copy to run yourself.



More complicated data sets

Often we work with data sets in which an dependent variable depends on more than one independent variable. Maybe it's a data set that models, say, a six dimensional function,

y = f(x1, x2, x3, x4, x5, x6)

In such a case, we would want to get an idea about the independence of the variables x1 ... xn. To do that we can calculate (I won't say how here) a matrix of correlation coefficients. It might look like this one (which is made up).

Notice that x1 is perfectly correlated with x1, and so on, those coefficients lying along the diagonal (upper left to lower right) of our matrix.

In analyzing such a matrix, we're making sure that (1) the positive and negative coefficients are more or less randomly distributed, i.e. that there's no significant bias one way or the other.

And (2), we're flagging any large correlations. In this matrix, the correlations between x1 & x4 and x2 & x6 are 0.6 or above. While not terribly worrisome, it would be a good plan to look at those relationships more closely to make sure that those variables are actually as independent as we've assumed.




Creative Commons License   optimized for firefox
xaktly.com by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2012, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.