What is correlation?

In math, correlation is a measure of dependence of one variable or measurement upon another (or others).

If when one part of some system is changed, another also changes in a predictable way, then we say that the two parts are correlated.

For example, in the graph below, the maximum heart rates of 47 men are plotted against their ages. The data show that as a man ages his maximum heart rate tends to drop.

The graph shows a strong correlation between maximum heart rate (MHR) and age. It's important to realize that the data doesn't prove that aging causes MHR to drop, only that in these subjects it does drop, and pretty convincingly as a function of age.



Beware  post hoc ergo propter hoc !

This latin phrase literally means "after this, therefore because of this." It's a warning about a common logical fallacy, confusion of cause and effect. It's important to know when things are correlated — that's important information. But correlation does not prove causation. Just because one thing is correlated with another doesn't mean one caused the other. We have to be careful about drawing that kind of conculsion. More evidence is usually needed.

Degrees of correlation

There are degrees of correlation of two variables, from "very highly correlated" to "uncorrelated". Three examples are shown in the plots below.

For now, our definition of how correlated? is a little soft, but pretty easy to understand. If you can squint your eyes and make out a pattern — linear in this case, then there probably is some mathematical correlation.

The points in the plot on the left lie pretty close to an imaginary straight line that can be drawn through them. In the middle, the line is a little messy, but it's clearly there. On the left, it's really kind of a guess as to whether there is any relationship between x and y in those points; they are uncorrelated or only very weakly correlated.



Correlation can be negative or positive

In the plots above, as the value of the independent variable (x-axis) increases, the y-axis values of the points increase. We call that a positive correlation.

In the plots below, the trend goes the other way, and we say that these data are negatively correlated.

These terms become a little less important when we talk about non-linear correlations below, but in a great many cases we can reduce a complicated data analysis situation to one where we're looking for linear correlations.



Nonlinear correlations

Sometimes when we look at a plot of data there is an obvious nonlinear relationship. In other words, the plotted data have an obvious curved appearance.

That appearance might be polynomial — quadratic or cubic, say — exponential or logarithmic, or it might have the shape of some other curve. That curved shape might also be difficult to pin down or even accidental, so we have to be careful when we interpret such graphs and try to analyze them.

The graph below shows automobile stopping distance plotted against the speed of the vehicle before hitting the brakes. It's easy to see that a curved model would fit the data better than a linear one.

Looking at graphs of data can enlighten us about what kind of underlying relationship might exist. Likewise, having some foreknowledge of such a relationship from other sources might help, too. We have to balance all sources of information to do the best analysis we can.



The correlation coefficient

Later, after you've learned about probability distributions, you'll learn how to put a number to correlation. Language like "somewhat correlated" and "well correlated" aren't really what we're looking for when we're trying to make data-based decisions, so a number is what we want.

That number, at least for data that is linearly correlated, will be called the correlation coefficient. You might not really have the background to understand it right now, but you can understand how to interpret it using the sliding scale below.

The correlation coefficent, usually labeled R, has a range from -1 to +1. R = -1 means that the data is perfectly correlated and that the correlation is negative. All points lie exactly on a downward-sloping line. R = +1 means perfect negative correlation, and R = 0 means no dependence of one variable on the other at all (at least according to the data in the set). R can take on any value between -1 and 1.



Creative Commons License   optimized for firefox
xaktly.com by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2012, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.