This page belongs to a group of pages designed to help you learn basic statistics. In rough learning order, here they are.
Discrete random variables  Types of variables we encounter in statistics 
Averages  There are many ways of averaging data. 
Median  The median of onedimensional data and what it means 
Correlation  How does a change in one variable affect another? 
Linear leastsquares  Determine the equation of a line from 2D data. 
Law of large numbers  As we take more samples, averages become clearer, and often closer to what we'd expect. 
The Gaussian distribution  The most imporant probability distribution in statistics 
The standard deviation  $\sigma$, a measure of the width of a Gaussian distribution 
Basic probability  Basic axioms and rules of probability 
Sampling distributions  Important leadin to the Central Limit Theorem 
The Central Limit Theorem  Treated correctly, any distribution can take on the characteristics of a Normal distribution. 
The binomial distribution  Distribution to use in binarychoice situations 
Whenever we make measurements, whether that means quantitative measurements (numbers) or qualitative ones (things like categories), we need to be able to organize and present our data. Most of the time, that means organizing into tables or charts of various kinds. The idea is always to communicate something about the data, and to do so powerfully. That is, we'd like a reader to be able to learn something meaningful from our work with as little effort as possible. We also want to strive to present data and our analytical results as honestly as we can.
There are basically two kinds of data that we can organize into tables and charts: categorical data and quantitative or numerical data.
Categorical data is not numerical — it is qualitative. This kind of data places individuals or the results of some measurement into one of several categories. Some examples might be:
Numerical data might include measurements. With numerical or quantitative data it often makes sense to calculate an average. For example, if several people measure the height of a very tall building using geometry and trigonometry, then we'd be inclined to calculate the average of those measurements, perhaps after throwing away any obviously outlying results.
We use the word "measurement" more loosely here than you might be accustomed to. Any experiment is really a measurement, whether we actually use a measuring device or not. An opinion survey, for example is a measurement in which the results could be organized into categories like "vote democrat," "vote republican," or "neither."
Numerical or quantitative data is data for which it makes sense to find an average.
We use tables to organize data in preparation for analysis and to present it so that trends and conclusions are obvious. When organizing data, it's worthwhile to try a few table layouts so that you can judge which best highlights the trends in the data, and which are easier for a reader to decipher.
The table below shows data taken from a sample of men and women, asking them which, of baseball, basketball or football, is their favorite sport. This is categorical data organized into what's known as a twoway table. There are two sets of categories. Category 1 includes males and females. The second set of categories are the sports.
Baseball  Basketball  Football  Total  
Male  23  28  33  84 
Female  37  22  25  84 
Total  60  50  58  168 
We can do some easy calculations with a table like this. These include socalled marginal distributions, which consider differences among all individuals in the table or dataset, and conditional distributions, which consider differences among one classification of individuals (like males and females).
Marginal distributions involve all individuals represented in the table, 168 in this case. We might then ask, what is the marginal distribution of the two genders shown, male (♂) and female (♀)? Because there are 84 in each group, the distribution is {50%, 50%}.
We could also inquire about baseball, basketball and football players in the whole population – 60, 50 and 58 individuals, respectively. Those percentages are easily caclulated:
$$ \begin{align} \text{baseball} \phantom{000} \frac{60}{168} &= 35.7 \% \\[5pt] \text{basketball} \phantom{000} \frac{50}{168} &= 29.8 \% \\[5pt] \text{football} \phantom{000} \frac{58}{168} &= 34.5 \% \end{align}$$
Think of a conditional distribution calculated from a data set like this one as having conditions. For example, under the condition that we're only considering male players, what is the distribution of sports among them? In that case, we have
$$ \begin{align} \text{baseball} \phantom{000} \frac{23}{84} &= 27.4 \% \\[5pt] \text{basketball} \phantom{000} \frac{28}{84} &= 33.3 \% \\[5pt] \text{football} \phantom{000} \frac{33}{84} &= 39.3 \% \end{align}$$
A marginal distribution considers differences in a categorical variable among all individuals in the data set. A conditional distribution considers differences in one category of individuals. That category (such as gender) is the condition.
There are two categories of tea drinkers, male and female. Of the total number of tea drinkers (20), 2/20 = 10% are male and 18/20 = 90% are female.
There are three categories of of beverages. The total number of beverages perferred by males is 2 + 15 + 39 = 56. Of those males, 2/56 = 3.6% are tea drinkers, 15/56 = 26.8% are coffee drinkers, and 39/56 = 69.6% prefer soft drinks.
The table represents data from 56 males and 44 females for a total of 100 individuals. Thus the distribution of males and females is 56% ♂ and 44% ♀.
A total of 45 people reported preferring soft drinks. 39/45 = 86.7% were males and 13.3% were females.
Tea  Coffee  Soft drink  

Male  2  15  39 
Female  18  20  6 
Total  20  35  45 
This table shows the assigned classes (cabin types) and death rates of passengers of the Titanic, which sank in the North Atlantic Ocean in 1912. Use it to calculate:
There were 500 total survivors. Their distribution across classes is
817 people died in the sinking. The distribution of deaths across classes is
There were 709 total thirdclass passengers. The fraction who survived was 181/709 = 25.5%. That means that the number who were lost was 74.5%.
The marginal distribution of secondclass passengers is just the fraction of all passengers with secondclass tickets. It's 284/1317 = 21.6%
Survived 
Did not survive  Total 


Firstclass passengers 
201  123  324 
Secondclass passengers 
118  166  284 
Thirdclass passengers 
181  528  709 
Total passengers 
500  817  1317 
Very often in science and data presentation, a chart or "graph" can be more powerful than a wordy explanation of trends in the data. At the very least, an appropriate chart can be complementary to the text, helping a reader to catch the meaning of a description or discussion. Books (see this link, for example) have been written about presentation of data in charts.
In this section we'll look at just a few of the main chart types used to present data. You'll want to be very familiar with these chart types and how to interpret them.
While it is used commonly to refer to all kinds of charts, the word "graph" in mathematics is reserved for diagrams showing linkages between states of some system, like nodes in a neural network or in a binary tree, such as is used in data analysis and searching algorithms.
Pie charts – they look like a sliced pie – are a common way of showing proportion among categories. Here is a pie chart, using data from 2022, that shows the broad categories of the federal budget of the United States. About 62% of all federal funds are spoken for each year including this one. These are things like salaries, social security, medicare, etc. 8% of the budget has to go to "debt service," that is, interest on borrowed money, and the rest, 30%, can be reallocated by congress and other agencies with authority to do so.
The pie chart is a nice way to display this data. It's understood that the full circle represents 100% of the budget, and we can see right away that mandatory spending takes up the largest portion of it, followed by discretionary spending and interest payments. Not only do we get a good look at which categories are bigger and smaller, but how much bigger or smaller.
There are plenty of options for pie charts. For example, we can pull a pie wedge out for emphasis, as in the chart below.
Fancier pie charts can be rendered in 3D like this one, with emphasis, as above, on the "Discretionary" wedge. A chart like this doesn't impart any more insight into the data; it's just more visually interesting.
A word on color: Sometimes the overuse of color can be distracting. It should be used with care. Sometimes color can be used to emphasize a point about data, like this:
It's important that all categories are represented in a pie chart, or if not, that the context is clearly given in a caption or the text. For example, we might create another pie chart breaking out the discretionary spending part of the budget, but we'd need to be clear that that's what we're presenting. A pie chart should represent 100% of all possibilities for any welldefined category.
For comparison's sake, of course, all categories displayed in a pie chart must have the same units. In these cases, the unit is percent of the whole budget.
Bar charts are ubiquitous in nearly all fields that involve data. The reader of a bar chart compares the relative heights of the bars against a scale to make judgements about the data. They can be used just like pie charts to show proportion, but more often they're used for comparison and sometimes to show growth. Here's an example using data on the sales of personal computers in 2021.
It is a little easier to see the ordering of total sales (most to least, righttoleft) in the bar chart, but to get a sense of how much of the "pie" each complany owns, nothing really beats a pie chart. You can roll over or tap the image to see the percentages of total personal computer sales.
The stacked bar chart is a hybrid of pie and bar charts. They show the share of each category of the whole, and the trend in those divisions over (in this case) time. This chart shows the proportion of personal computer sales by major company over five years.
Sometimes little lines are drawn linking the stacked bars. They can help to illustrate the trends in the changes of proportion.
A histogram is a special kind of bar chart. It represents all of the data from a given observation or experiment, breaking the independent variable into "bins," usually of equal size. The heights of the bars are proportional to the frequency of observations in each bin. Here's an example.
Here are 41 measurements — of what doesn't really matter for this example:
8.25  7.83  8.42  8.50 
8.67  8.17  9.00  9.00 
8.17  7.92  9.00  8.50 
9.00  7.75  7.92  8.00 
8.08  8.42  8.75  8.08 
9.75  8.33  7.83  7.92 
8.58  7.83  8.42  7.75 
7.42  6.75  7.42  8.50 
8.67  10.17  8.75  8.58 
8.67  9.17  9.08  8.83 
8.67 
Here is a sample histogram from that data. It's divided into eight bins of width 0.5 units each. This is a pretty informative histogram. It shows us that the data is distributed approximately in a bellshaped or normal curve.
We can adjust the number of bins to suit our needs. Here is the same data organized into 12 bins. Be cautious about binnumber choice, however: Too many bins could cause the distribution of your data to be less obvious. On the other hand, it is possible to manipulate – sometimes dishonestly – the impression given by the data in order to convince a reader of something that might not be true.
A histogram is just a special type of bar chart. Stylistically, we put spaces between the bars of bar charts, where in histograms, the bars are squeezed together — they touch.
Using [STAT], access the lists. Clear and enter your data in one of the six lists. You can use one of the sorting functions to sort if if you'd like, but you don't have to.
Make sure that one of the STAT PLOTs is on using [2nd][STAT PLOT]. Select the histogram option for the stat plot.
You can make an initial histogram using ZOOM9 (The 9 selects the option for scaling statisticsrelated charts). That will give you a good start, but you'll likely want to refine your histogram.
To refine the scaling, choose WINDOW and fill in the x and y min and max, and the bin width, which is the xscale (Xscl) value. In this histogram I've forced the bin width to be 0.5 units. Don't press Zoom9 again or you'll lose this scaling.
A quick and easy way to visualize distributions
A dot plot (sometimes "dotplot") is related to a histogram, but simpler to construct — they can just be jotted down with paper and pencil if needed. To construct a dot plot, we simply make a number line that includes the full range of the data, then begin stacking dots above it as data points accumulate.
The dot plot at right shows the results of 44 independent measurements of the length of a room. At a glance we can tell that the distribution of lengths is approximately bellshaped or "normal, " with a mean of about 14 or so units.
We can also tell that there are some measurements that seem like outliers that we might want to do some followup work on. For little effort, a dot plot allows us to get a good idea of the kind and quality of the data it represents.
Dot plots can be generated using a number of software packages, and some, such as Mathematica or the TI84 calculator can be tricked into making a dot plot by sorting the data, then assigning increasing yvalues to each set of individual xvalues.
or Stem & leaf plots
Stem plots are another quick way to visualize the shape of a distribution of data and to quickly calculate a few simple statistics. Let's show how it works by example. Here's a small data set of integers:
{18, 18, 19, 23, 24, 24, 24, 25, 25, 26, 26, 26, 26, 27, 28, 29, 29, 30, 31, 31, 33, 33, 35, 39, 40}
This set has been sorted for convenience, but its not necessary. In this case, the first (tens) digit will form our "stems." These are simply 1, 2, 3 and 4, and we'll write them vertically like this
1 2 3 4
Now we'll use the second (ones) digits to form the "leaves,". The leaves that go with the stem of 1, for example, are 8, 8, 9. Here's how that part looks:
1889 Key: 19 = 19 234445566667899 30113359 40
That's it. This simple way of plotting our data gives us a quick glance at the shape of its distribution.
There are a few modifications we can make. Here's an exampe of a stemplot with split stems, in which we've split the leaves from 15 and 59. It's a more informative representation of the distribution of the data in this some cases.
13345 Key: 21 = 21 167777889 211222344455 266667899 3011335555 36777788899999 41123455 467789 5124
For comparing two distributions of similar data, a backtoback stem plot can be handy. One set of leaves, representing the one data set, is on the left of the stems, and the other is on the right.
97664221258 988221131125788 998877321141248899 998666443215111557899 977664443623338 8555432271455 653318179
xaktly.com by Dr. Jeff Cruzan is licensed under a Creative Commons AttributionNonCommercialShareAlike 3.0 Unported License. © 20122019, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.