xaktly | Probability & Statistics

Describing data

Making measurements


Whenever we make a measurement, whether it's a length, a time, a pressure, or a measure of the likelihood that a certain political candidate will win, we generally want to have replicates of the measurement. That is, we want to make the measurement several times under similar conditions so that we can get an idea of the reliability of our methods and their results.

Generally, that means doing some averaging. We're usually much more confident in an average of 100 measurements than we are a single measurement. Or at least we ought to be.

In general, when we make a measurement, its results won't be clone copies of each other. Instead, we'll always have random fluctuations — hopefully small — that will cluster around some average value. We say that our results are distributed around some central value, often the mean, which we'll be more likely to accept as representing the "true" value of what we're measuring.

In making any measurement, we're mostly interested in two things: (1) What is the center of our distribution of measurements? And (2) how far are the measurements scattered from that central value?

Measurements of center


Whenever we make some meaningful measurement, it's very unlikely that we'll just take one measurement and accept the result. It's at least poor practice. What if we'd made a mistake? A more reliable and believable method is to make several measurements, which will have some sort of distribution around a central value, and take an average or sometimes a median. Which of those we use will be case-dependent, as we shall see ...

An average of a set of numbers is just the sum of all of them divided by their total number, $n$. The average can be defined in two ways:

$$\bar x = \sum_{i=1}^n \frac{x_i}{n} \; \; \color{magenta}{\text{ or }} \; \; \bar x = \frac{1}{n} \sum_{i=1}^n x_i$$

These really aren't different; it's just that in the second, the factor of $1/n$ that appears in every term has been factored out.

Here is a symmetric distribution of 38 measurements centered at $x=0$. We don't have to do a calculation to find the center; it's obvious from the graph that the center and most-likely value of the measurement is zero. That's always the case for a symmetric distribution like this one.

$$ \begin{align} \bar x &= 0 \\[5pt] \text{median} &= 0 \end{align}$$

Here is a non-symmetric distribution of 38 measurements ($n=38$), that is skewed to the right, meaning that there are more measurements on the right side of the maximum of the distribution (at $x=0$) than on the left.

In this case, we'll have to calculate the mean. We can do it as a weighted average:

$$ \begin{align} \bar x &= \frac{1}{38} [-2(4)-1(5)+0(7)+1(6)+2(5) \\[5pt] &\phantom{000000} + 3(3)+4(3)+5(2)+6(2)+7(1) ] \\[5pt] &= \frac{1}{38}[-13+66] = 1.39 \end{align}$$

The median is the average of the 19th and 20th data points, counted from either end of the graph. It turns out to be

$$\text{median} = 1$$

In situations where we have a skewed distribution, the median is usually how we want to calculate the center. Think of it this way: Let's say that a neighborhood composed of middle-class households gets a certain level of service from the city, and less if the price of housing is higher. The idea would be that richer neighborhoods might be able to pay more for their own upkeep. Now let's say that in this neighborhood of, say 50 homes, a billion-dollar mansion is built — a home that is worth far more than any of the other 50 homes. The average price of homes in this neighborhood might go up a lot, but that's unfair to the 50 out of 51 homeowners who are still solidly middle class. The median home value, on the other hand — the one in the middle — would be unlikely to change that much, and thus more accurately represent the bulk of the homes and families in the neighborhood.

Measures of center

For a given distribution of measurements, we can describe the center of the data in two ways:

Mean


Sum the data points $(x_i)$ and divide by the number of measurements, $n$.

$$ \begin{align} \bar x &= \sum_{i=1}^n \frac{x_i}{n} \\[5pt] &= \frac{1}{n} \sum_{i=1}^n x_i \end{align}$$

Usually, a bar over a quantity will indicate the mean of that quantity.

Median


The median of a set of measurements is the middle value. For an odd number of measurements $(n \; \text{odd})$, the median is the actual middle value. For an even number of measurements, we just average the middle two.

For example, for a set of 31 measurements arranged in numerical order, the median is the 16th value. For a set of 30, the median is the average of the 15th and 16th values.


Practice problems

Determine the mean and median of these data sets:


Set 1

8 11 15 17
9 11 15 17
10 12 15 17
10 12 15 18
10 13 15 18
10 13 15 20
10 13 15 22
10 14 16 25
11 14 16 27
11 14 16 35
11 15 17 38
Solutions

Here is a dotplot of this data set:

The mean is $\bar x = 15.36$ and the median is $15$. These are very close, indicating that, despite the outlying data on the right side of the distribution, it is close to symmetric.

Set 2

4 9 11
6 9 11
7 9
7 9
7 9
7 10
7 10
7 10
9 10
9 10
9 11
Solutions

Here is a dotplot of this data set:

The mean of this distribution of data is $\bar x = 8.62$ and the median is $9$.


Example 1: Mean vs. Median


First, consider the distribution of 25 home prices, reported to the nearest \$50,000 represented by this dotplot:

The median of this distribution is just the 13th home value, counted from the left and moving upward through each stack of dots. That dot (magenta ) is at \$350,000. The mean can be calculated using a weighted average:

$$ \begin{align} \bar x &= \frac{1}{25} \big( 2(200) + 3(250) + 5(300) \\[5pt] &+ 7(350) + 4(400) + 2(450) + 1(500) \\[5pt] &+ 1(550) \big) = 346 \; \rightarrow \; \$346,000 \end{align}$$

Note that in the mean calculation we just worked with thousands of dollars. The mean and median are nearly identical in this nearly-symmetric distribution

Now let's modify our distribution by including a single \$5-million house in the data set. Here's the new distribution:




Now, as shown on the graph, the mean has shifted to a value of \$536,000, which is higher than the price of most houses in the area. In this case, the mean might not be representative of the true housing situation for a family looking to buy a home in the area.

The median home value, however, didn't change. It's the average of the 13th and 14th values, which remains \$350,000. So for a family looking to move into the neighborhood, they can be aware of that \$5-million home, but also that the price they will pay for a home will more likely be in the \$350,000 range.

This is precisely why we commonly use these two measures of the center of a distribution. When the mean and median are the same, we usually have a left-right symmetric distribution. When the mean is higher than the median, the distribution is skewed to the right, as is our wide home-value graph above. That \$5-million home really stretches the distribution out in that direction. When the mean is smaller than the median, the distribution is usually skewed to the left.

The median is more resistant to change produced by outliers – data points far outside the more clustered data.


Describing a distribution


Probability distributions come in all shapes and sizes. We need to be able to describe and classify them. In general, we want to know four things about any distribution:

  • Shape – consider symmetry and skewness
  • Center – often the mean or median The center gives us the location of the distribution in the domain.
  • Spread – what extent does the data cover – what are the minimum and maximum? The spread of a distribution is usually characterized by one or both of two measures, the standard deviation $(\sigma)$ and the interquartile range (IQR).
  • Outliers – are any points suspiciously far from the rest? Might we have reason not to consider them in our analysis of the data?

The acronyms SCSO or SOCS might help you remember these features.

The most common distribution of data or measurements with variations that occur by random chance is the Normal or Gaussian distribution, shown above. We'll cover its mathematical form on another page.

The height of the normal curve represents the probability of some event occurring or of finding some value of the data. High means likely and low means unlikely. The standard Gaussian curve is left-right symmetric, and the mean and median of the data are the same – right in the middle. Gaussian distributions can be very narrow or very spread out, representing high or low precision of the data, respectively.

Not all distributions are symmetric. Some are skewed one direction or another. We say that the distribution below is "skewed to the left" because it looks like it's being dragged to the left by its left-most point or "tail."



Likewise, the distribution below is "skewed to the right."


For skewed distributions, the mean of the data moves in the direction of the skewing, but the median remains closer to the peak of the curve, so the two differ as the distribution is skewed. Skewness is a key descriptor of a probability distribution.

  • Right skew → mean is to the right of the median
  • Left skew → mean is the the left of the median

Just as a quick example of an important skewed distribution, the distribution of atomic or molecular speeds in any sample of a gas has the Maxwell-Boltzmann distribution shown below, a distinctly right-skewed distribution that varies with the temperature of the sample.



Some distributions look quite different from the Normal distribution. They can be triangular, decaying exponentials, ... you name it. Our job will be to describe distributions as they appear.


Width of a distribution


The width of a distribution corresponds to the precision of the data. A distribution that is narrow means that there is a low probability of values far from the mean or median. A wider distribution means that data points far from the middle are more probable. Consider these two distributions of the distance of each dart from the bullseye of a dart board:

The player on the left is very good. She throws near the center of the board on every throw. The player on the right isn't as good. His throws sometimes hit the bullseye, but on average they're more widely scattered.

We use two methods to measure and report the width of a distribution, the standard deviation and the interquartile range. Both are discussed in other sections (standard deviation, interquartile range), so we won't go into them in great detail here.


Standard deviation

Strictly speaking, the standard deviation (denoted by the Greek lower case "s", $\sigma$) applies only to the Normal or Gaussian distribution. It is defined as the distance along the domain axis from the mean to one of the two inflection points of the curve. An inflection point is a point on the graph where the curvature changes from concave upward to concave downward, or vice versa).

The standard deviation is the average distance of the squares of the distances of each member of a data set from the mean of that set. Here is the formula in summation notation:

$$\sigma = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \bar x)^2}$$

We use the squares of the distances, $x_i - \bar x$, to avoid having the errors of a wide distribution cancel to a very small width, one that doesn't comport with the precision of the data. We further divide not by $N$, the number of data points – as we would when calculating a mean – but $N - 1$ instead. This is because we've essentially already used one of our $N$ pieces of data to calculate the mean, leaving only $N - 1$ left for other calculations. For more about the standard deviation, go here.

What we're doing here is calculating the sum of the squares of all of the distances of each point from the center, then dividing by the number of data points (minus one), then taking the square root to undo the earlier squaring. We could have taken an absolute value rather than squaring (after all, $|x| = \sqrt{x^2}$), but it turns out that the squaring trick makes a lot of the mathematics of statistics easier as things get more complicated.

Interquartile range (IQR)

The interquartile range (IQR) is appropriate for any distribution and not too difficult to calculate. The basic idea (see bar chart below) is that we divide the data in an ordered list into quartiles, groups that all contain, by number, one quarter of the data points. The block from Q0 to Q1 is the first quartile of the data. The median is the second (Q2 and median are synonyms), cutting the data set in half, and Q3 marks the point between the bottom 3/4 and the top quarter of the data. Of those, Q1, the median and Q3 are the important ones.



Let's use dot plot below, illustrating the number of barrels of oil recovered per year from 37 wells in an oil field, to calculate some quartiles and find the IQR.



The median (green dot) is the 19th data point, the middle one. Now dividing the data into halves, both including the median – because $N$ is odd (see gray box below), we find Q1 and Q2 by finding the median of the lower and upper halves of the data, respectively. Those are the magenta lines at values of 35 and 65. So the IQR is $Q3 - Q1 = 65 - 35 = 30$.

Finding quartiles

Median (Q2): When the number of data points, $N$ is odd, the median is the center data point of an ordered list. When the number is even, Q2 is the average of the two center points.

Q1, Q3, $N$ odd: When $N$ is odd, divide the data in half, including the center point (Q2) in each half. Find the median of the first half; that's Q1. Find the median of the second half; that's Q3.

Q1, Q3, $N$ even: When $N$ is even, divide the data set into two halves at the median, then find the median of the first half; that's Q1. Find the median of the second half; that's Q3.

Q0 and Q4 are just the extreme ends of the sorted data list.

The IQR

For any sorted data set divided into quartiles, the IQR is a measure of the width of the set. It is

$$\text{IQR} = Q3 - Q1$$


Outliers


Often when we collect data there are one or more points that just don't seem to fit. For example, let's say we're measuring reaction times in a chemistry experiment. We run the same reaction under the same conditions ten times, finding an average time of 8.2 ± 1.1 minutes for nine of the measurements, but a time of 21 minutes for the tenth. That last measurement just doesn't look right — it's an outlier, and we might be correct to think about what might have gone wrong with our technique. We might be tempted to throw that datum out of our data set. Would we be justified?

It's important to have some objective measure of what constitutes an outlier, some criterion that we all agree upon. One such criterion is the 1.5 IQR rule. It works like this:

  • A data point is an outlier to the left if it is less than 1.5 times the IQR to the left of the first quartile (Q1).

    $x \lt \text{Q1} - 1.5 \cdot \text{IQR} $

  • A data point is an outlier to the right if it is greater than 1.5 times the IQR to the right of the third quartile (Q3).

    $x \gt \text{Q3} + 1.5 \cdot \text{IQR} $

Let's look at our oil barrels example above and ask whether there are any outliers. For this data set, Q1 = 35 and Q3 = 65, so the IQR is 30 units. The condition for outliers on the left is

$$ \begin{align} x &\lt \text{Q1} - 1.5 \cdot \text{IQR} \\[5pt] x &\lt 35 - 1.5 \cdot 30 \\[5pt] x &\lt 35 - 45 \\[5pt] x &\lt -10 \end{align}$$

There are no values less than -10, so there are no outliers to the left. Likewise, the condition for outliers on the right is

$$ \begin{align} x &\gt \text{Q3} + 1.5 \cdot \text{IQR} \\[5pt] x &\gt 65 + 1.5 \cdot 30 \\[5pt] x &\gt 65 + 45 \\[5pt] x &\gt 110 \end{align}$$

There are no values greater than 110, so there are no outliers to the right, either; this data set has no outliers.


Example 2

Consider the dot plot showing the distribution of 52 scores on a chemistry test. Describe the distribution in terms of its shape, center, spread and outliers (SCSO).



Solution: The shape of this dot plot seems roughly symmetric and Normal (that is, has the shape of the Normal or Gaussian curve). If anything, it is skewed a bit to the right just a bit.

The center of the distribution can be measured in two ways: by calculating the mean and/or by finding the median. To find the mean, we just sum the values and divide by their number to get 70.9. The median is the average of the 26th and 27th data points, for a value of 70. The mean is just a bit larger than the median, confirming just a little bit of rightward skew.

The first and third quartiles are the averages of the 13th and 14th points (Q1 = 62.5) and of the 39th and 40th points (Q3 = 80). The IQR is then 80 - 62.5 = 17.5. We can check for outliers:

  • Left:

    $$ \begin{align} x &\lt Q1 - 1.5 \cdot \text{IQR} \\[5pt] x &\lt 52.5 - 1.5(17.5) \\[5pt] x &\lt 26.25 \end{align}$$

    There are no scores less than 26.25, so there are no outliers on the left.

  • Right:

    $$ \begin{align} x &\gt Q3 + 1.5 \cdot \text{IQR} \\[5pt] x &\gt 80 + 1.5(17.5) \\[5pt] x &\gt 106.25 \end{align}$$

    There are no scores greater than 106.25, so there are no outliers on the right, either.

The spread of the data is between scores of 45 and 100, inclusive.

Because this data looks at least vaguely Normal with the mean approximately equal to the median, a Normal analysis is appropriate. Thus the standard deviation of the mean is $\sigma = 11.7$.

Outliers

We have to settle on some objective standard for identifying outlier data that we might want to reject from our data sets. The 1.5 IQR test is one such standard.

We can reasonably conclude that a datum is an outlier if it is smaller than the Q1 value minus 1.5 times the IQR, or if it is greater than Q3 plus 1.5 times the IQR.

$$x \lt Q1 - 1.5(IQR) \phantom{00} \text{ or } \phantom{00} x \gt Q3 + 1.5(IQR)$$

Practice problems

For each of the data sets below, sketch a dotplot. Calculate the mean, median and IQR, and determine whether there are any outliers in the data.


Set 1

281012
381012
481013
481013
591013
591014
591114
691115
691120
6911
7912
71012
Solution

Mean: $\bar x = 9.31$

min max Q1 med Q3
$2$ $20$ $7$ $9$ $11.5$

The IQR is $11.5-7 = 4.5$

Outliers on the left would be less than $7-1.5(4.5) = 0.25$ and outliers on the right would be any datum greater than $11.5+1.5(4.5) = 18.25$. There are no outliers on the left, but we would be justified in dropping the point at 20 as an outlier.

Set 2

15811
26812
36814
36926
47930
4710
4710
5711
Solution

Mean: $\bar x = 8.38$

min max Q1 med Q3
$1$ $30$ $4.5$ $7$ $10$

The IQR is $10-4.5 = 5.5$

Outliers on the left would be less than $4.5-1.5(5.5) = -3.75$ and outliers on the right would be any datum greater than $10+1.5(5.5) = 18.25$. There are no outliers on the left, but we would be justified in dropping the points at x=26 and x=30 as outliers to this data.


Set 3

1212427
7242427
7242528
17242528
202425
202427
Solution

Mean: $\bar x = 21.5$

min max Q1 med Q3
$1$ $28$ $20$ $24$ $25$

The IQR is $25-20 = 5$

Outliers on the left would be less than $20-1.5(5) = 12.5$ and outliers on the right would be any datum greater than $25+1.5(5) = 32.5$. There are no outliers on the right, but we would be justified in dropping the points at x=1 and x=7 as outliers to this data. You can see how the outliers skew this data to the left in the difference between the median and the mean.

Set 4

6141519
8141519
10141619
10141623
11151623
11151825
121518
121518
Solution

Mean: $\bar x = 15.2$

min max Q1 med Q3
$6$ $25$ $12$ $15$ $18$

The IQR is $18-12 = 6$

Outliers on the left would be less than $12-1.5(6) = 3$ and outliers on the right would be any datum greater than $18+1.5(6) = 27$. There are no outliers in this dataset.

Creative Commons License   optimized for firefox
xaktly.com by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2016-2025, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.