#### xaktly | Probability & Statistics

Stem & Leaf diagrams

### A quick way to organize data and visualize distributions

Stem plots (sometimes "stemplots") or stem-and-leaf plots are a quick way of sorting and organizing data, and a quick way of visualizing the shape of a distribution — or of comparing two distributions. We'll learn them by example using a sample data set. Consider this set of hypothetical test scores between 0 and 100:

 38 78 97 100 80 94 83 56 67 84 85 85 89 93 72 68 71 73 90 77

### How to do it

To make a stem plot of this data, follow these procedures.

#### 1. Write the stems in order

First, we have to decide what the stems are. These (except for the 100) are all two-digit numbers, so we'll choose the first digit as the stem. We have first digits of 3, 5, 6, 7, 8, 9 and 10. Here they are:

         3|
4|
5|
6|
7|
8|
9|
10|


#### 2. Now add the leaves

the leaves are our set of second digits, each associated with its own first digits. For example, the first digit of 3 has an 8 that goes with it. The stem value of 4 has nothing associated with it – there are no scores in the 40's. It looks like this:

         3|8
4|
5|6
6|78
7|82137
8|034559
9|7430
10|0


#### 3. Order the leaves

Finally, put all of the leaves in numerical order, and add a key for the reader:

         3|8
4|         Key: 7|2 means a
5|6             score of 72
6|78
7|12378
8|034559
9|0347
10|0


So that's our stem plot. Starting from it, we can learn a number of things about our distribution of scores. first, its appearance: It is more-or-less normal (Gaussian), with perhaps a slight skew to the left. The median value is the center value, and that's easy to just count to. The average of the 10th and 11th values is

$$\frac{80 + 83}{2} = 81.5$$

The first quartile (Q1) is the median of the first half of the data, or the average of the 5th and 6th values:

$$\frac{71 + 72}{2} = 71.5$$

Likewise, the third quartile is the average of the 15th and 16th values, or

$$\frac{89 + 90}{2} = 89.5$$

From there we can calculate interquartile range (IQR), and so on. The max and min values are 38 and 100, respectively. From the look of the distribution, we don't expect any outlying data points on the high side, but we ought to inspect that 38:

$$Q1 - 1.5 \times IQR = 71.5 - 1.5(18) = 44.5,$$

so a score of 38 is indeed an outlier. From this information, we can construct a quick box plot to further characterize this data set:

### Splitting stems

Sometimes when we create a stem plot from data we find that a few stems have most of the leaves, perhaps distorting our view of the shape of the underlying distribution. To fix that we can split some or all of the stems. We have to take great care when doing this, however, lest we mislead ourselves or confirm a bias. Here's how it works. First a data set, then the stem plots:

 261 232 269 287 232 252 242 257 287 255 277 269 275 269 254 246 254 245 251 261 231 257 250 251 255 250 232 220 242 253 268 253 267 279 245

The

The

The

The

Finally, put all of the leaves in numerical order, and add a key for the reader:

         22|0
23|1222        Key: 23|2 means 232
24|225564
25|0011233445577
26|1178999
27|57
28|77


Finally, put all of the leaves in numerical order, and add a key for the reader:

         22|0
23|1222        Key: 23|2 means 232
24|224556
25|001123344
25|5577
26|11
26|78999
27|57
28|77


The

The

### Pie charts

Pie charts – they look like a sliced pie – are a common way of showing proportion among categories. Here is a pie chart, using data from 2022, that shows the broad categories of the federal budget of the United States. About 62% of all federal funds are spoken for each year including this one. These are things like salaries, social security, medicare, etc. 8% of the budget has to go to "debt service," that is, interest on borrowed money, and the rest, 30%, can be reallocated by congress and other agencies with authority to do so.

The pie chart is a nice way to display this data. It's understood that the full circle represents 100% of the budget, and we can see right away that mandatory spending takes up the largest portion of it, followed by discretionary spending and interest payments. Not only do we get a good look at which categories are bigger and smaller, but how much bigger or smaller.

There are plenty of options for pie charts. For example, we can pull a pie wedge out for emphasis, as in the chart below.

Fancier pie charts can be rendered in 3-D like this one, with emphasis, as above, on the "Discretionary" wedge. A chart like this doesn't impart any more insight into the data; it's just more visually interesting.

A word on color: Sometimes the over-use of color can be distracting. It should be used with care. Sometimes color can be used to emphasize a point about data, like this:

It's important that all categories are represented in a pie chart, or if not, that the context is clearly given in a caption or the text. For example, we might create another pie chart breaking out the discretionary spending part of the budget, but we'd need to be clear that that's what we're presenting. A pie chart should represent 100% of all possibilities for any well-defined category.

For comparison's sake, of course, all categories displayed in a pie chart must have the same units. In these cases, the unit is percent of the whole budget.

### Bar charts

Bar charts are ubiquitous in nearly all fields that involve data. The reader of a bar chart compares the relative heights of the bars against a scale to make judgements about the data. They can be used just like pie charts to show proportion, but more often they're used for comparison and sometimes to show growth. Here's an example using data on the sales of personal computers in 2021.

It is a little easier to see the ordering of total sales (most to least, right-to-left) in the bar chart, but to get a sense of how much of the "pie" each complany owns, nothing really beats a pie chart. You can roll over or tap the image to see the percentages of total personal computer sales.

The stacked bar chart is a hybrid of pie and bar charts. They show the share of each category of the whole, and the trend in those divisions over (in this case) time. This chart shows the proportion of personal computer sales by major company over five years.

##### Proportion of personal computer sales, 2017-2021

Sometimes little lines are drawn linking the stacked bars. They can help to illustrate the trends in the changes of proportion.

The

The

The

The

The

The

The

The

The

### TI-84 calculator

#### Statistics-related buttons

Here is a shot of a TI-84 Plus calculator. Some important keyboard areas are hilighted in red. These buttons are of particular use in entering and manipulating statistical data. In following sections, we'll do a couple of video examples of how to accomplish various tasks.

#### LIST / STAT

The LIST / STAT button ( LIST is 2nd+[STAT] ) opens up ways of entering and manipulating lists of data. There is a bit of redundancy between these two sets of functions, but in general, if you'd like to enter new data into a list, clear a list to accept new data, or perform math operations on lists, use [STAT]. LIST is a way to select lists of data for use in other functions. For example, if you wanted to add items in list L1 to those in list L2 and store the result in list L3, your key sequence could be:

2nd [STAT]   L1   [+]   2nd [STAT]   L2   STO→   2nd [STAT]   L3

#### STATPLOT

Here's what the LIST screen looks like:

The STATPLOT button (2nd+[Y=]) opens up a set of ways to graph one set of data against another, and to do a number of statistical operations on those graphs. Here's a typical STATPLOT window.

In STATPLOT mode, you can graph "lists" of data – that's what the L1 and L2 are in the example. A number of different statistical graph types are available, as you can see. We'll explore those elsewhere.

#### LIST buttons

The 2nd functions of the number buttons [1], [2], ... , [6] provide direct access to lists. The list operation done above using the LIST button could also be accomplished just using direct access to lists, like this:

2nd [1]   [+]   2nd [2]   STO→   2nd [3]

#### CATALOG

A complete alphabetical catalog of all of the TI-84 calculator functions can be accessed by entering 2nd [0]. Once the catalog comes up, the ALPHA function is enabled, so if, for example, you want a function that starts with the letter D, just press the [x-1] button, which is ALPHA-D, and the list will hop to the D's, making it easier to find what you want in this extensive list.

The

The

The

The

The

The

The

The

### A programming example

Here is an example of how to code something like a correlation coefficient calculation. This is a simple Python program that calculates the correlation coefficient for the same data as the spreadsheet example above. The data is in the arrays x and y. You can click/tap on the code to download a text copy to run yourself.

__author__ = 'Your name' # program: correlation.py for xaktly.com/Correlation.html (this is a comment) #!/user/local/bin/python #Include "Python.h" import math # Needed for the square-root function nData = 38 # Number of data points in each array x = {20,21,23,28,29,29,31,33,33,34,36,40,40,41,42,43,44, 44,44,45,47,48,50,51,52,55,58,61,62,66,67,67,72,74,77,80} y = {204,189,200,191,198,184,188,188,189,195,178,186,178,184,172, 171,183,174,170,181,169,175,165,164,173,174,176,163,156,155, 159,149,160,156,142,139,149,134} xsum = 0.0 # sum of all x[i] ysum = 0.0 # sum of all y[i] x2sum = 0.0 # sum of all (x[i] - xavg)^2 y2sum = 0.0 # sum of all (y[i] - yavg)^2 xavg = 0.0 # average value of x's yavg = 0.0 # average value of y's xysum = 0.0 # sum of (x[i] - xavg)(y[i] - yavg) i = 0 for i in range (0, nData): xsum += x[i] ysum += y[i] xavg = xsum / nData yavg = ysum / nData for i in range (0, nData): x2sum += (x[i] - xavg) * (x[i] - xavg) y2sum += (y[i] - yavg) * (y[i] - yavg) xysum += (x[i] - xavg) * (y[i] - yavg) # calculate r, remembering to take roots: r = xysum / ((math.sqrt(x2sum)) * (math.sqrt(y2sum))) print 'correlation coefficient: r = ', r 

### More complicated data sets

Often we work with data sets in which an dependent variable depends on more than one independent variable. Maybe it's a data set that models, say, a six dimensional function,

$$y = f(x_1, x_2, x_3, x_4, x_5, x_6)$$

In such a case, we would want to get an idea about the independence of the variables x1 ... xn. To do that we can calculate (I won't say how here) a matrix of correlation coefficients. It might look like this one (which is made up).

Notice that x1 is perfectly correlated with x1, and so on, those coefficients lying along the diagonal (upper left to lower right) of our matrix.

In analyzing such a matrix, we're making sure that (1) the positive and negative coefficients are more or less randomly distributed, i.e. that there's no significant bias one way or the other.

And (2), we're flagging any large correlations. In this matrix, the correlations between x1 & x4 and x2 & x6 are 0.6 or above. While not terribly worrisome, it would be a good plan to look at those relationships more closely to make sure that those variables are actually as independent as we've assumed.

X

### ubiquitous

ubiquitous means present, appearing or found everywhere.

xaktly.com by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2012-2019, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.