xaktly | Probability & Statistics

Central limit theorem


Most things tend toward a normal distribution

Most events that depend on random chance ultimately tend to conform to a normal or Gaussian distribution. For example, if we measured the heights of all 25 year-old males in America, we'd come up with an average, and all other heights would be normally distributed around them, like this:

Think about it: If we measure just two heights and plot them, that's just two points and our "distribution" of heights won't look anything like that purple curve. But as we measure more and more 25 year-olds, right up to the roughly 2-3 million 25 year-old males in the U.S., that distribution will get smoother and smoother, ultimately looking just like a normal distribution.

Furthermore, our distribution will look pretty smooth because with that many data points, we're going to see very small differences between one measured height and the next, say like 6'-0" and 6'+1/64". The distribution looks pretty continuous.

In the example below, we can see the same type of behavior in a discrete probability problem, the sum of two dice. If we toss two dice, we obtain one of 11 different sums between 2 and 12. There's only one way to get a 2 (1 and 1), but six ways to roll a 7 (1-6, 6-1, 5-2, 2-5, 4-3, 3-4). There are 36 different arrangements of the numbers of two dice if the order matters – say one is red and one is green. That means that the probability of rolling a 2 is 1/36, while the probability of rolling a 7 is 6/36 or 1/6.

In the graph below, the green circles show those exact probabilities. The gray curves are results for trials of two-dice throws of 100, 200 and 1000 throws, respectively, and the red curve represents 10,000 throws. Notice the considerable fluctuations from the expected values, but that when we do a very large number of trials (10,000), we get very close to the expected values.



Consequences of the Central Limit Theorem

Here are three important consequences of the central limit theorem that will bear on our observations:


  1. If we take a large enough random sample from a bigger distribution, the mean of the sample will be the same as the mean of the distribution.

  2. The same is true of the standard deviation (σ) of the mean. It will be equal to the σ of the larger distribution.

  3. A distribution of sample means, provided we take enough of them, will be distributed normally.

Below we'll go through each of these consequences in turn using the following set of data:

Take a binary situation: a population can choose one option ("yes") or the other ("no"). Let's say that 50% choose A and 50% choose B.

I generated 10,000 members of the population using a spreadsheet, where for each member of the population, 1 means "yes" and 0 means "no." The following examples come from considerations of that data set (see table below).



1. If we take a large enough random sample from a bigger distribution, the mean of the sample will be the same as the mean of the distribution.


In this little experiment, we'll draw some small and large samples from our list of 10,000 yes/no decisions. Remember, the average of the data is 50% "yes." The table on the right shows the average number of "yes" answers (1's) in variously-sized random samples taken from it, from 100 samples to 5000.

It's not surprising that choosing 5000 out of the 10,000 samples yields average "yes" responses of near 50%. The averages are only a little worse for 2500 or 1000 data-point samples. After all, that's still quite a bit of data. Even 100 point means are pretty close to the mean of the "parent" data set, particularly if we draw ten 100 point samples and average our results.

While this doesn't prove our assertion, it is a pretty good example of it at work.


2. If we take a large enough random sample from a bigger distribution, the standard deviation of the sample will be the same as the standard deviation of the distribution.


To illustrate this property, a table of 10,000 integers randomly generated between 1 and 100 was constructed in a spreadsheet. The mean and standard deviation of all numbers was 49.85 ± 29.03.

Ten non-overlapping 1000 number samples were then taken from that same data, and averages and standard deviations were calculated. The table shows those averages and standard deviations, as well as the percent difference between the standard deviation for each subset and the whole data set. All of the sigmas are within 3% of the whole-set sigma.

This result also illustrates point #1 quite nicely; all of the means are very close to the mean of the whole set.


3. A set of sample means, if we take enough of them, will be distributed normally.


Using the distribution of 5000 yes/no results (#1 above), 50 different, non-overlapping 100-sample means were calculated. Those results were sorted into the histogram shown here. You can see that these means are beginning to approximate a normal distribution. The purple normal distribution is overlayed just for comparison.


Creative Commons License   optimized for firefox
xaktly.com by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2012, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.