Why median?


Imagine you're in a room with, say, 100 people, all of whom earn about the same income as you, let's say \$100,000 per year. If you all earn the same income, then the average income is \$100,000.

Now in walks Oprah Winfrey, who made (according to Forbes Magazine) about \$77 million in 2013. Now the average income in the room has risen by quite a lot.

The new average is

$$ \begin{align} \bar{I} &= \frac{100(100,000) + 77,000,000}{101} \\ &= \$ 861,386 \end{align}$$

Now think about that. Just because a very rich person walked in the room does not mean that your income increased by more than 700% !

So sometimes we need a better way to describe the situation of most people (or the state of most data) than the average.

One of the simplest ways to achieve this goal is just to think about the data point "in the middle." That's what median means—in the middle.

The middle in this case means the middle of a sorted list of all data points. In the case of our example, that would be

Now the median income is \$100,000, and that's makes a lot more sense. The introduction of a much higher income really shouldn't have a disproportionate effect on how all of the others are interpreted.

Median

The median value of a dataset is the value in the middle of a sorted list. If there is an odd number of data points, it's exactly the value in the middle. If there is an even number, the median is the average of the middle two points.


Convergence of mean and median — what does it mean?


In a small sample, or one that doesn't follow the kind of rules that we'd expect for "normal" statistical fluctuations of data (see normal distribution), the mean and median can be quite different, as you saw in the example above.

In general, if the sample size get's large enough, the mean and median will be roughly the same, and they'll get closer as the number of data points gets larger.

So in a sense, the validity of the size of a sample can be judged by how close the median and mean are. If they're roughly the same, you've probably got enough data from which to draw some valid conclusions. The only problem is that mean and median can be equal accidentally, so later we'll develop some other ways to judge the quality of a data set.

If a dataset is large enough and meet the normal expectations, the mean and median will be roughly equal.


The mode


The mode of a data set is the number that occurs most frequently.

To find the mode, it's best to put the data in numerical order, then like data points can be calculated. Here's an example:

Notice that the mode is not necessarily unique. If there were one more value of 8 in our dataset, for example, we'd have two modes, at 5 and 8. We refer to such a dataset as bimodal (in this case) or multimodal (in the general case where there could be many modes.

The mode is another number that converges to the mean and median if a distribution of data is large enough and predictable enough.

I've referred to "distributions" of data and random fluctuations that are "predictable" and "large enough" a few times in this section. I'll try to tell you just a little about that below.

The mode

The mode of a dataset is the most frequently-occuring member of the set. There can be more than one mode. If there is one mode, the data is described as mono-modal. If there are two it's "bimodal", and so on.

Creative Commons License   optimized for firefox
xaktly.com by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2012, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.