Imagine you're in a room with, say, 100 people, all of whom earn about the same income as you, let's say \$100,000 per year. If you all earn the same income, then the average income is \$100,000.

Now in walks Oprah Winfrey, who made (according to Forbes Magazine) about \$77 million in 2013. Now the average income in the room has risen by quite a lot.

The new average is

$$ \begin{align} \bar{I} &= \frac{100(100,000) + 77,000,000}{101} \\ &= \$ 861,386 \end{align}$$

Now think about that. Just because a very rich person walked in the room does *not* mean that your income increased by more than 700% !

So sometimes we need a better way to describe the situation of most people (or the state of most data) than the average.

One of the simplest ways to achieve this goal is just to think about the data point "in the middle." That's what median means—in the middle.

The middle in this case means the middle of a sorted list of all data points. In the case of our example, that would be

Now the median income is \$100,000, and that's makes a lot more sense. The introduction of a much higher income really shouldn't have a disproportionate effect on how all of the others are interpreted.

The **median** value of a dataset is the value in the **middle** of a sorted list. If there is an __odd__ number of data points, it's exactly the value in the middle. If there is an __even__ number, the median is the average of the middle two points.

In a small sample, or one that doesn't follow the kind of rules that we'd expect for "normal" statistical fluctuations of data (see normal distribution), the **mean** and **median** can be quite different, as you saw in the example above.

In general, if the sample size get's large enough, the mean and median will be roughly the same, and they'll get closer as the number of data points gets larger.

So in a sense, the validity of the size of a sample can be judged by how close the median and mean are. If they're roughly the same, you've probably got enough data from which to draw some valid conclusions. The only problem is that mean and median can be equal *accidentally*, so later we'll develop some other ways to judge the quality of a data set.

**mean** and **median** will be roughly equal.

The **mode** of a data set is the number that occurs **most frequently**.

To find the mode, it's best to put the data in numerical order, then like data points can be calculated. Here's an example:

Notice that the mode is not necessarily unique. If there were one more value of 8 in our dataset, for example, we'd have two modes, at 5 and 8. We refer to such a dataset as **bimodal** (in this case) or **multimodal** (in the general case where there could be many modes.

The mode is another number that converges to the mean and median if a distribution of data is large enough and predictable enough.

I've referred to "distributions" of data and random fluctuations that are "predictable" and "large enough" a few times in this section. I'll try to tell you just a little about that below.

The **mode** of a dataset is the **most frequently-occuring** member of the set. There can be more than one mode. If there is one mode, the data is described as **mono-modal**. If there are two it's "**bimodal**", and so on.

**xaktly.com** by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2012, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.