As a starting point, let's remember that only two points in a plane are required to define a line in that plane. Our general equation for a line is something like
$$y = ax + b$$
where $a$ (often called $m$) is the slope of the line from left to right, and $b$ is the location at which it crosses the y-axis (the y-intercept).* The details of an equation of a line don't really matter, so long as the power of the independent variable, $x$, is one: $x^1 = x$.
Now often in data analysis, we have more than one data point to define some linear relationship between $x$ and $y$. Such a situation, illustrated in the plot below, is called an overdetermined system, one in which we have more data points than we need to define a line.
Our task on this page is to figure out how to determine the equation of the line that best runs through all of these points, or at least as close to all of them as possible. That will be our best-fit line, and we'll use it to make predictions from the data.
Having the equation of that best-fit line will come in handy. For example, say we have data for points $(x_1, y_1)$ and $(x_2, y_2)$. Let's say we wanted to find the y-value that goes with a point in the domain, $x_3$, where $x_1 \lt x_3 \lt x_2$. In this case we can just use the equation of our best fit line to find $y_3 = a x_3 + b$ by plugging in the desired value of $x_3$. We'll also want to estimate the error in that procedure, which we'll get to in due in time.
^{*}If you're familiar with functions and their transformations, the parameter $a$ is a vertical scaling parameter; it stretches or compresses the line vertically, which changes the slope. The parameter $b$ is the familiar vertical translation. Only these two parameters are needed to completely specify a line in a plane: where is it? $(b)$ and how tilted (or vertically stretched) is it $(a)$?
Linear regression or linear least-squares fitting is a method for finding a linear relationsip between the x and y values in a set of data consisting of ordered pairs (x, y). The general idea is that we have several data points (more than two, so that we have an overdetermined system) that look approximately like they form a line when plotted as $y$ vs. $x$. We use linear regression to find the equation of that line, usually in a form like
$$y = ax + b,$$
where $a$ is the slope and $b$ is the y-intercept. Once we have that equation, and possibly some statistics to help us evaluate its quality, we can use it to predict y-values in regions of the domain of the problem for which we have no actual data.
Let's imagine that we have a set of data, some ordered pairs in the form of $(x, y)$. We'd like to know if there is a relationship between $x$ and $y$, so we make a plot of the y's vs. the x's like the one below.
Recall that when we say "y vs. x", we mean that the y values are on the vertical axis ("ordinate") and the x values are on the horizontal axis (the "abcissa").
We can usually just look and see if there's some kind of correlation, but we can also quantify that by calculating the correlation coefficient. Here's the basic idea:
The problem we're trying to solve is two-fold:
Here's an illustration of a guess, but it's a terrible guess, as you can see:
Here is a slightly better guess, but it's still just a guess. The red line still doesn't pass as close as possible to all of the observed data points. Well, this is mathematics, so we don't want to just guess. Fortunately, there is a way to ensure the best-fitting line through such data each time.
The plot below shows the best-fit line. The dashed vertical lines are called residuals. A residual is the vertical distance between the observed y-value of a data point, $y_i$ (in this case the value of $y$ from the i^{th} ordered pair) and the corresponding y-value that would be calculated from the best-fit line, $\hat y_i$. The value of the i^{th} residual is
$$r_i = y_i - \hat y_i$$
where the hat on $\hat y_i$ indicates the y-value that comes from the model, our line.
The idea behind finding the best-fit line is to find the line that minimizes the sum of all of those distances. There is a problem, though. Imagine that we do a really poor job of selecting our line, like the one in the graph below. If we use our formula to calculate the seven residuals, labeled $-a, -b, \dots, b, c$, we see that for each value $a$ there is an opposite residual $-a$, and so on. In this case the sum of our residuals is
$$a + b + c + 0 - a - b - c = 0$$
That's a nice, small value of the sum of the residuals, which ought to point to a good fitting line, but this line is obviously a lousy fit to the data, which obviously forms a perfect line with a positive slope.
The problem is having those negative residual values. We could fix that by using absolute values, but the more common way is to square the residuals and find the minimum of that sum; thus we're always working with positive numbers. We seek to find
$$\text{min} \bigg( \sum_{i=1}^n (y_i - \hat y_i)^2 \bigg)$$
where $\text{min}$ means "find the minimum of", and the sum is from the first residual to the n^{th} one, where we have $n$ total ordered pairs. This is the essence of finding the best-fit line – to find the line that minimizes the sum of the squares of the residuals. This method is often called the method of least squares or least-squares fitting.
If the function representation for a best-fit line is $f(x)$, then the residual of the $i^{th}$ data point is
$$r_i = y_i - f(x_i)$$
The residual is the difference between the data y-value and the y-value calculated from the best-fit line. Positive residuals indicate that the line falls below the datum and negative residuals that the line falls above it.
In practice, the linear fitting problem has been solved and we don't need to reinvent it each time we use it. Many software packages, including specific statistics packages like minitab and R, spreadsheets like Excel and Numbers, and calculator routines like those available on the TI-84 calculator.
Let's do an example using the TI-84. Here are some data collected between 1988 and 2003 comparing median home price in a region with the interest rate on a 30-year mortgage.
Rate (%) |
Median price (\$) |
Rate (%) |
Median price (\$) |
---|---|---|---|
10.30 | 183,300 | 7.60 | 177,900 |
10.30 | 183,200 | 7.60 | 174,500 |
10.10 | 174,900 | 6.90 | 188,100 |
9.30 | 173,500 | 7.40 | 203,200 |
8.40 | 173,900 | 8.10 | 230,200 |
7.30 | 173,200 | 7.00 | 258,200 |
8.40 | 173,200 | 6.50 | 309,800 |
7.90 | 169,700 | 5.80 | 329,800 |
First, we'll enter the data sets into two lists, L_{1} and L_{2}:
To edit a list go to [STAT] and select EDIT. You can clear a whole list by arrowing up to its header (where it says L_{1}, for example) and hitting [CLEAR] then [ENTER]. Now we have our x-values in list 1 (L_{1}) an our y-values in L_{2}.
Now we're all set to perform a linear regression. Hit [STAT] and select CALC, then choose LinReg(ax+b), for "linear regression with result in the form $y = ax + b$."
To set up the regression, select the two lists containing the data. The six list buttons are the [2ND] functions of the lower number keys. It's also useful to fill in the Store RegEq field so that you can plot the fitted line. Hit [VARS] (for "variables") and select Y-VARS then Function then Y_{1} to store the equation of the fitted line in the function variable Y_{1}. The fitting program will enter that function into Y_{1} just as you would enter any other function for graphing. That way you can plot the best-fit line over your data scatterplot.
Finally, arrow down to CALCULATE and hit [ENTER]. Here are the results of the regression analysis. The slope of the fitted line is $a = -23,409$. The negative value indicates that the median home price drops by about \$23,000 for every 1-percent increase in interest rate. The correlation coefficient is $r = 0.62$, a moderately-strong correlation. The coefficient of determination is $r^2 = 0.38$, which suggests that only about 38% of the change in median home price is due to the interest rate change, according to this data.
Here is the plot. To get a plot like this, turn on one of the STAT-PLOTs and make sure to highlight the scatterplot option (the first option). To center the plot, select ZOOM, then the STAT option (or ZOOM + 9). That's usually a good first try at scaling statistics plots.
There is another linear regression algorithm that's just as easy to use, but that gives a little more output data, including the standard deviation of the residuals, which is useful for estimating the error of a predection made using the best-fit line.
A key way to judge the "goodness" of a linear fit is to create a residual plot, which is nothing more than a plot of residuals, $r_i$ v.s the domain values, $x_i$. Here is a typical residual plot for a good linear regression.
The red center line of the plot indicates a residual of zero. That is, the data point is right on the best-fit line. In this plot, the residuals have two features we're looking for in a good fit:
In the residual plot below, the residuals are still more-or-less randomly distributed, but they're larger. It's up to you to judge the scale of the deviations relative to your data, but if these two residual plots represent similar data, we'd conclude that the first reflects a better fit.
In the following residual plot we have a problem. There is clear bias toward positive residuals for large and small x, and bias toward negative residuals in between. In this case, the data may be telling us that it fits to a
After performing the example linear regression above, we can also plot the residuals of that fit. The trick is to replace L_{2} in the STATPLOT menu with the set called RESID, which is automatically stored any time a regression is done on the TI-84.
To insert the set named RESID (the residuals), punch [2ND]-[STAT], which is the LIST button, select NAMES from the top menu and scroll down to option 8, RESID.
Keep the STATPLOT option on the scatterplot icon (the first graph option) and hit [ZOOM]-STAT. You should get a nicely scaled residual plot like this one.
To see the residual x- and y- values, use [TRACE] and the arrow keys to shuffle through the data points. The x- and y-values will be displayed like in this screen image.
There is one more option on the TI-84 calculator for performing a regression analysis. It's called the linear regression T-test (LinRegTTest). Some of its features are beyond what you may have covered so far, so I'll skip those and just point out one that is relevant to this discussion.
To perform the test (on the same data set as the example above), hit [STAT] and select TESTS, then LinRegTTest. Enter lists L_{1} and L_{2} as before. Enter Y_{1} for the regression equation (optional), and hit CALCULATE.
I'll skip past some of the output and just focus on the line highlighted in red below, $s = 41456.5$. That's the standard error of the residuals, and a good estimate of the error associated with any prediction we would make using the equation of the best-fit line.
Here's the idea. Let's say we want to predict the average mortgage amount from an interest rate of 8.00%, a value that is not in our table above. Plugging that rate into our formula, we get
The
$$ \begin{align} y &= -23,409.45 x + 393,348.6 \\[5pt] &= -23,409.45(8.00) + 393,348.6 \\[5pt] &= -187,275.6 + 393,348.6 \\[5pt] &= \$206,073 \end{align}$$
Now that's our estimate, but there is some uncertainty. We'll report our value with the standard error of the residuals, which reflects the kinds of fluctuations we might see in our predictions: $\text{mortgage} = \$206,073 ± 41,456$. So we're confident that the mortgage value given such an interest rate would be between $\$206,073 - 41,456$ and $\$206,073 + 41,456$, or between $\$164,617$ and $\$247,529$.
The essential equations (the normal equations) for writing a computer program to calculate the equation of a linear best fit line – to perform a linear regression – are derived using a little bit of calculus. Ignore this section if that's not familiar to you, but it's included here for those who are curious.
Let's write the equation for the line of our regression model like this:
$$y = b + a x + \epsilon$$
The parameters are $b$, the y-intercept, $a$, the slope, and $\epsilon$, a random error term which we include here for completeness, but we'll omit it for the rest of this derivation. Thus we have
$$y = b + a x$$
Now for our overdetermined system, in which we have a set of $n$ y-values and $n$ x values – in ordered pairs, $(x_i, y_i)$, we can write the sum of the residuals using summations:
$$\sum (y_i - \hat y) = \sum [y_i - (\hat b + \hat a x_i)]$$
Now we're interested in finding the minimum of the sum of the squares of $y_i - \bar y$, so we'll square both sides:
$$\sum (y_i - \hat y)^2 = \sum [y_i - (\hat b + \hat a x_i)]^2$$
The $y_i$ are the observed values and the $\hat y_i$ are the predicted values.
Now we want to find the minimum of this square of differences, and we'll do that by finding the partial derivatives with respect to the parameters $b$ and $a$, and setting those equal to zero. First the derivatives.
$$ \begin{align} \frac{\partial}{\partial \hat b} &\sum (y_i - (\hat b + a x_i))^2 \\[5pt] &= \sum \frac{\partial}{\partial \hat b} (y_i - (\hat b + \hat a x_i))^2 \\[5pt] &= \sum 2 (y_i - (\hat b + \hat a x_i))(-1) \\[5pt] &= -2 \sum (y_i - (\hat b + \hat a x_i)) \end{align}$$
So our first derivative, set equal to zero, is
$$-2 \sum (y_i - (\hat b + \hat a x_i)) = 0 \tag{1}$$
The partial derivative with respect to $\hat a$ is found similarly:
$$ \begin{align} \frac{\partial}{\partial \hat a} &\sum \frac{\partial}{\partial \hat a} (y_i - (\hat b + \hat a x_i))^2 \\[5pt] &= \sum \frac{\partial}{\partial \hat a} (y_i - (\hat b + \hat a x_i))^2 \\[5pt] &= \sum 2(y_i - (\hat b + \hat a x_i))(-x_i) \\[5pt] &= -2 \sum x_i (y_i - (\hat b + \hat a x_i)) \end{align}$$
So our second derivative, set equal to zero, is
$$-2 \sum x_i (y_i - (\hat b + \hat a x_i)) = 0 \tag{2}$$
Now let's solve equation (1) for $\hat b$:
$$ \require{cancel} \begin{align} \cancel {-2} &\sum (y_i - (\hat b + \hat a x_i)) = 0 \\[5pt] \sum y_i &= \sum (\hat b + \hat \beta_i x_i) \\[5pt] \sum y_i &= \sum \hat b + \sum \hat a x_i \\[5pt] \sum y_i &= n \hat b + \hat a \sum x_i \end{align}$$
In the first step we divided both sides by $-2$, and in the last we used the fact that
$$\sum_{i=1}^n \hat b = n \hat b$$
So we have our expression for $\hat b$:
$$\hat b = \frac{\sum (y_i - \hat a x_i)}{n} \tag{3}$$
We can write that in a more informative way like this:
$$ \begin{align} \hat b &= \frac{\sum y_i}{n} - \frac{\hat \sum a x_i}{n} \\[5pt] \hat b &= \bar y - \hat a \bar x \tag{4} \end{align}$$
where the expressions $\sum y_i / n$ and $\sum x_i / n$ are just the averages over the x's and y's, $\bar x$ and $\bar y$. Now we'll use equation (2) to solve for $\hat a$, replacing $\hat b$ by equation (4) above (in
$$ \begin{align} \cancel{-2} \sum x_i (y_i - (\hat b + \hat a x_i)) &= 0 \\[5pt] \sum x_i( y_i - (\color{red}{\bar y - \hat a \bar x} + \hat a x_i)) &= 0 \\[5pt] \sum x_i (y_i - (\bar y - (\bar y - \hat a (\bar x - x_i)))) &= 0 \\[5pt] \sum x_i (y_i - \bar y - \hat a(x_i - \bar x)) &= 0 \\[5pt] \sum x_i (y_i - \bar y) - \sum x_i \hat a (x_i - \hat x) &= 0 \end{align}$$
Then rearrange that last expression to
$$ \begin{align} \sum x_i (y_i - \bar y) &= \sum x_i \hat a (x_i - \hat x) \\[5pt] \hat a &= \frac{\sum x_i (y_i - \bar y)}{\sum x_i (x_i - \bar x)} \end{align}$$
Now this last expression can be rewritten as
$$a = \frac{\sum (x_i - \bar x)(y_i - \bar y)}{\sum (x_i - \bar x)^2}$$
That last step is because
$$ \begin{align} \sum &(x_i - \bar x)(y_i - \bar y) \\[5pt] &= \sum x_i (y_i - \bar y) - \color{red}{\sum \bar x (y_i - \bar y)} \\[5pt] &= \sum x_i (y_i - \bar y) - \color{red}{\bar x \sum (y_i - \bar y)} \end{align}$$
where the part in red is zero because the sum is zero.
So finally, our normal equations are
$$ \begin{align} \hat b &= \bar y - a \bar x \tag{5} \\[8pt] \hat a &= \frac{\sum (x_i - \bar x)(y_i - \bar y)}{\sum (x_i - \bar x)^2} \tag{6} \end{align}$$
These equations form the basis of a simple computer program to determine the equation of the best fit line. We simply loop through the data to calculate averages $\bar x$ and $\bar y$, and then once again to calculate the other sums. It's not difficult at all.
In a certain city, a research firm interested in a bike-share program recorded the maximum "feels like" temperature (a mixture of temperature and humidity) and the number of bike-share users for that day. A random sampling of 27 results out of the over 17,000 days surveyed are given in the table below. Construct a scatterplot from these data and calculate the equation of the best-fit line. Is the correlation between users and temperature a good one? That is, how comfortable would you be in predicting the number of bike users given the temperature?
Temperature | # Riders | Temperature | # Riders |
---|---|---|---|
19.4 | 88 | 46.4 | 163 |
30.2 | 5 | 51.8 | 141 |
32.0 | 119 | 53.6 | 13 |
33.8 | 15 | 53.6 | 142 |
35.6 | 65 | 55.4 | 33 |
37.4 | 70 | 55.4 | 15 |
39.2 | 26 | 59.0 | 162 |
41.0 | 166 | 60.8 | 13 |
41.0 | 4 | 62.6 | 198 |
41.0 | 49 | 66.2 | 44 |
42.8 | 46 | 66.2 | 130 |
42.8 | 80 | 75.2 | 9 |
46.4 | 163 | 75.2 | 270 |
84.2 | 280 |
Let's fit this data on a TI-84 calculator. The first task is to enter the x- (temperature) and y-values (number of riders) data in two lists; I used L_{1} and L_{2}. Once that is done, set up a [STATPLOT] to plot L_{2} vs. L_{1} as a scatterplot.
Here is the resulting scatterplot:
Just at a glance, it looks correlated, but it's not a terrific correlation — probably in the neighborhood of r = 0.5. We have a couple of options for the linear regression, but let's do the full regression and t-test. If you haven't done t-testing or looked at P values yet, don't worry about it. The rest will make sense anyway. Go to STAT → TESTS and select LinRegTTest:
Here are the parameters: Lists L_{1} and L_{2} for the data. The "Freq: 1" line says that we don't have a frequency list — a list of probabilities or weights for each y-value. We'll store the equation of the regression line in equation Y_{1} for graphing later.
Here are the results. The regression line is
$$y = 2.1736 x - 15.6088$$
Scrolling down gives us the rest of the output.
The P value is 0.0152, which is small compared to an α level of 0.05 for a one or two-tailed test, so we reject our null hypothesis that this data is uncorrelated and conclude that there is a real correlation here.
The corellation coefficient is r = 0.417, so this comports with our visual inspection; it's a middling correlation — not that great. Here is the scatterplot again, including the fitted line.
The output parameter $s$ is the standard deviation of the residuals. It's the ± error we should report when making a prediction from the fitted data. Finally, here is the same graph and output calculated with a spreadsheet (Numbers).
The period of a pendulum ($T$, the time it takes to swing from one side to the other, then back) is related to the length of the string that connects the mass (the "bob") to the pivot point by the equation
$$T = 2 \pi \sqrt{\frac{L}{g}},$$
where $g$ is the acceleration of gravity and $L$ is the length of the string. Now we can rearrange that equation by squaring both sides and solving for $T^2$ to get
$$T^2 = \frac{4 \pi^2}{g} L$$
Now if we collect string lengths and periods for a pendulum on Earth and we plot $L$ vs. $T^2$, we find that the slope of the graph will be $4 \pi^2/g$, so that we can use the pendulum to measure the acceleration of gravity,
$$g = \frac{4 \pi^2}{\text{slope}}$$
Here is some data for a pendulum. Use it to compute the acceleration of gravity on Earth.
length (m) | period (s) |
---|---|
0.1 | 0.7160 |
0.2 | 0.9216 |
0.3 | 1.1738 |
0.4 | 1.1384 |
0.5 | 1.5752 |
0.6 | 1.7551 |
0.7 | 1.9112 |
0.8 | 1.8869 |
0.9 | 2.1362 |
1.0 | 2.0954 |
If we plot that data as $T^2$ vs. $L$, we get a graph like this:
The fitted equation is
$$T^2 = 4.7044 L + 0.051$$
We can then calculate the acceleration of gravity:
$$g = \frac{4 \pi^2}{4.7044} = 8.39 \, \frac{m}{s^2}$$
This is just an estimate of $g$, of course.
xaktly.com by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2012-2024, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.