Stem plots (sometimes "stemplots") or stem-and-leaf plots are a quick way of sorting and organizing data, and a quick way of visualizing the shape of a distribution — or of comparing two distributions. We'll learn them by example using a sample data set. Consider this set of hypothetical test scores between 0 and 100:
38 | 78 | 97 | 100 | 80 |
94 | 83 | 56 | 67 | 84 |
85 | 85 | 89 | 93 | 72 |
68 | 71 | 73 | 90 | 77 |
To make a stem plot of this data, follow these procedures.
First, we have to decide what the stems are. These (except for the 100) are all two-digit numbers, so we'll choose the first digit as the stem. We have first digits of 3, 5, 6, 7, 8, 9 and 10. Here they are:
3| 4| 5| 6| 7| 8| 9| 10|
the leaves are our set of second digits, each associated with its own first digits. For example, the first digit of 3 has an 8 that goes with it. The stem value of 4 has nothing associated with it – there are no scores in the 40's. It looks like this:
3|8 4| 5|6 6|78 7|82137 8|034559 9|7430 10|0
Finally, put all of the leaves in numerical order, and add a key for the reader:
3|8 4| Key: 7|2 means a 5|6 score of 72 6|78 7|12378 8|034559 9|0347 10|0
So that's our stem plot. Starting from it, we can learn a number of things about our distribution of scores. first, its appearance: It is more-or-less normal (Gaussian), with perhaps a slight skew to the left. The median value is the center value, and that's easy to just count to. The average of the 10th and 11th values is
$$\frac{80 + 83}{2} = 81.5$$
The first quartile (Q1) is the median of the first half of the data, or the average of the 5th and 6th values:
$$\frac{71 + 72}{2} = 71.5$$
Likewise, the third quartile is the average of the 15th and 16th values, or
$$\frac{89 + 90}{2} = 89.5$$
From there we can calculate interquartile range (IQR), and so on. The max and min values are 38 and 100, respectively. From the look of the distribution, we don't expect any outlying data points on the high side, but we ought to inspect that 38:
$$Q1 - 1.5 \times IQR = 71.5 - 1.5(18) = 44.5,$$
so a score of 38 is indeed an outlier. From this information, we can construct a quick box plot to further characterize this data set:
Sometimes when we create a stem plot from data we find that a few stems have most of the leaves, perhaps distorting our view of the shape of the underlying distribution. To fix that we can split some or all of the stems. We have to take great care when doing this, however, lest we mislead ourselves or confirm a bias. Here's how it works. First a data set, then the stem plots:
261 | 232 | 269 | 287 | 232 |
252 | 242 | 257 | 287 | 255 |
277 | 269 | 275 | 269 | 254 |
246 | 254 | 245 | 251 | 261 |
231 | 257 | 250 | 251 | 255 |
250 | 232 | 220 | 242 | 253 |
268 | 253 | 267 | 279 | 245 |
The
The
The
The
Finally, put all of the leaves in numerical order, and add a key for the reader:
22|0 23|1222 Key: 23|2 means 232 24|225564 25|0011233445577 26|1178999 27|57 28|77
Finally, put all of the leaves in numerical order, and add a key for the reader:
22|0 23|1222 Key: 23|2 means 232 24|224556 25|001123344 25|5577 26|11 26|78999 27|57 28|77
The
The
Pie charts – they look like a sliced pie – are a common way of showing proportion among categories. Here is a pie chart, using data from 2022, that shows the broad categories of the federal budget of the United States. About 62% of all federal funds are spoken for each year including this one. These are things like salaries, social security, medicare, etc. 8% of the budget has to go to "debt service," that is, interest on borrowed money, and the rest, 30%, can be reallocated by congress and other agencies with authority to do so.
The pie chart is a nice way to display this data. It's understood that the full circle represents 100% of the budget, and we can see right away that mandatory spending takes up the largest portion of it, followed by discretionary spending and interest payments. Not only do we get a good look at which categories are bigger and smaller, but how much bigger or smaller.
There are plenty of options for pie charts. For example, we can pull a pie wedge out for emphasis, as in the chart below.
Fancier pie charts can be rendered in 3-D like this one, with emphasis, as above, on the "Discretionary" wedge. A chart like this doesn't impart any more insight into the data; it's just more visually interesting.
A word on color: Sometimes the over-use of color can be distracting. It should be used with care. Sometimes color can be used to emphasize a point about data, like this:
It's important that all categories are represented in a pie chart, or if not, that the context is clearly given in a caption or the text. For example, we might create another pie chart breaking out the discretionary spending part of the budget, but we'd need to be clear that that's what we're presenting. A pie chart should represent 100% of all possibilities for any well-defined category.
For comparison's sake, of course, all categories displayed in a pie chart must have the same units. In these cases, the unit is percent of the whole budget.
Bar charts are ubiquitous in nearly all fields that involve data. The reader of a bar chart compares the relative heights of the bars against a scale to make judgements about the data. They can be used just like pie charts to show proportion, but more often they're used for comparison and sometimes to show growth. Here's an example using data on the sales of personal computers in 2021.
It is a little easier to see the ordering of total sales (most to least, right-to-left) in the bar chart, but to get a sense of how much of the "pie" each complany owns, nothing really beats a pie chart. You can roll over or tap the image to see the percentages of total personal computer sales.
The stacked bar chart is a hybrid of pie and bar charts. They show the share of each category of the whole, and the trend in those divisions over (in this case) time. This chart shows the proportion of personal computer sales by major company over five years.
Sometimes little lines are drawn linking the stacked bars. They can help to illustrate the trends in the changes of proportion.
The
The
The
The
The
The
The
The
The
Here is a shot of a TI-84 Plus calculator. Some important keyboard areas are hilighted in red. These buttons are of particular use in entering and manipulating statistical data. In following sections, we'll do a couple of video examples of how to accomplish various tasks.
The LIST / STAT button ( LIST is 2nd+[STAT] ) opens up ways of entering and manipulating lists of data. There is a bit of redundancy between these two sets of functions, but in general, if you'd like to enter new data into a list, clear a list to accept new data, or perform math operations on lists, use [STAT]. LIST is a way to select lists of data for use in other functions. For example, if you wanted to add items in list L1 to those in list L2 and store the result in list L3, your key sequence could be:
2nd [STAT] L1 [+] 2nd [STAT] L2 STO→ 2nd [STAT] L3
Here's what the LIST screen looks like:
The STATPLOT button (2nd+[Y=]) opens up a set of ways to graph one set of data against another, and to do a number of statistical operations on those graphs. Here's a typical STATPLOT window.
In STATPLOT mode, you can graph "lists" of data – that's what the L1 and L2 are in the example. A number of different statistical graph types are available, as you can see. We'll explore those elsewhere.
The 2nd functions of the number buttons [1], [2], ... , [6] provide direct access to lists. The list operation done above using the LIST button could also be accomplished just using direct access to lists, like this:
2nd [1] [+] 2nd [2] STO→ 2nd [3]
A complete alphabetical catalog of all of the TI-84 calculator functions can be accessed by entering 2nd [0]. Once the catalog comes up, the ALPHA function is enabled, so if, for example, you want a function that starts with the letter D, just press the [x-1] button, which is ALPHA-D, and the list will hop to the D's, making it easier to find what you want in this extensive list.
The
The
The
The
The
The
The
The
Here is an example of how to code something like a correlation coefficient calculation. This is a simple Python program that calculates the correlation coefficient for the same data as the spreadsheet example above. The data is in the arrays x and y. You can click/tap on the code to download a text copy to run yourself.
__author__ = 'Your name'
# program: correlation.py for xaktly.com/Correlation.html (this is a comment)
#!/user/local/bin/python
#Include "Python.h"
import math # Needed for the square-root function
nData = 38 # Number of data points in each array
x = {20,21,23,28,29,29,31,33,33,34,36,40,40,41,42,43,44,
44,44,45,47,48,50,51,52,55,58,61,62,66,67,67,72,74,77,80}
y = {204,189,200,191,198,184,188,188,189,195,178,186,178,184,172,
171,183,174,170,181,169,175,165,164,173,174,176,163,156,155,
159,149,160,156,142,139,149,134}
xsum = 0.0 # sum of all x[i]
ysum = 0.0 # sum of all y[i]
x2sum = 0.0 # sum of all (x[i] - xavg)^2
y2sum = 0.0 # sum of all (y[i] - yavg)^2
xavg = 0.0 # average value of x's
yavg = 0.0 # average value of y's
xysum = 0.0 # sum of (x[i] - xavg)(y[i] - yavg)
i = 0
for i in range (0, nData):
xsum += x[i]
ysum += y[i]
xavg = xsum / nData
yavg = ysum / nData
for i in range (0, nData):
x2sum += (x[i] - xavg) * (x[i] - xavg)
y2sum += (y[i] - yavg) * (y[i] - yavg)
xysum += (x[i] - xavg) * (y[i] - yavg)
# calculate r, remembering to take roots:
r = xysum / ((math.sqrt(x2sum)) * (math.sqrt(y2sum)))
print 'correlation coefficient: r = ', r
Often we work with data sets in which an dependent variable depends on more than one independent variable. Maybe it's a data set that models, say, a six dimensional function,
$$y = f(x_1, x_2, x_3, x_4, x_5, x_6)$$
In such a case, we would want to get an idea about the independence of the variables x1 ... xn. To do that we can calculate (I won't say how here) a matrix of correlation coefficients. It might look like this one (which is made up).
Notice that x1 is perfectly correlated with x1, and so on, those coefficients lying along the diagonal (upper left to lower right) of our matrix.
In analyzing such a matrix, we're making sure that (1) the positive and negative coefficients are more or less randomly distributed, i.e. that there's no significant bias one way or the other.
And (2), we're flagging any large correlations. In this matrix, the correlations between x1 & x4 and x2 & x6 are 0.6 or above. While not terribly worrisome, it would be a good plan to look at those relationships more closely to make sure that those variables are actually as independent as we've assumed.
xaktly.com by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2012-2019, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.