#### xaktly | Probability & Statistics

Tables & charts

 Discrete random variables Types of variables we encounter in statistics Averages There are many ways of averaging data. Median The median of one-dimensional data and what it means Correlation How does a change in one variable affect another? Linear least-squares Determine the equation of a line from 2-D data. Law of large numbers As we take more samples, averages become clearer, and often closer to what we'd expect. The Gaussian distribution The most imporant probability distribution in statistics The standard deviation $\sigma$, a measure of the width of a Gaussian distribution Basic probability Basic axioms and rules of probability Sampling distributions Important lead-in to the Central Limit Theorem The Central Limit Theorem Treated correctly, any distribution can take on the characteristics of a Normal distribution. The binomial distribution Distribution to use in binary-choice situations

### Data and tables

Whenever we make measurements, whether that means quantitative measurements (numbers) or qualitative ones (things like categories), we need to be able to organize and present our data. Most of the time, that means organizing into tables or charts of various kinds. The idea is always to communicate something about the data, and to do so powerfully. That is, we'd like a reader to be able to learn something meaningful from our work with as little effort as possible. We also want to strive to present data and our analytical results as honestly as we can.

#### Kinds of data

There are basically two kinds of data that we can organize into tables and charts: categorical data and quantitative or numerical data.

#### Categorical data

Categorical data is not numerical — it is qualitative. This kind of data places individuals or the results of some measurement into one of several categories. Some examples might be:

• hair or eye color
• gender or ethnicity
• marital status
• level of education
• country of birth

#### Numerical data

Numerical data might include measurements. With numerical or quantitative data it often makes sense to calculate an average. For example, if several people measure the height of a very tall building using geometry and trigonometry, then we'd be inclined to calculate the average of those measurements, perhaps after throwing away any obviously outlying results.

#### "Measurements"

We use the word "measurement" more loosely here than you might be accustomed to. Any experiment is really a measurement, whether we actually use a measuring device or not. An opinion survey, for example is a measurement in which the results could be organized into categories like "vote democrat," "vote republican," or "neither."

Numerical or quantitative data is data for which it makes sense to find an average.

### Tables

We use tables to organize data in preparation for analysis and to present it so that trends and conclusions are obvious. When organizing data, it's worthwhile to try a few table layouts so that you can judge which best highlights the trends in the data, and which are easier for a reader to decipher.

The table below shows data taken from a sample of men and women, asking them which, of baseball, basketball or football, is their favorite sport. This is categorical data organized into what's known as a two-way table. There are two sets of categories. Category 1 includes males and females. The second set of categories are the sports.

 Baseball Basketball Football Total Male 23 28 33 84 Female 37 22 25 84 Total 60 50 58 168

We can do some easy calculations with a table like this. These include so-called marginal distributions, which consider differences among all individuals in the table or dataset, and conditional distributions, which consider differences among one classification of individuals (like males and females).

#### Marginal distributions

Marginal distributions involve all individuals represented in the table, 168 in this case. We might then ask, what is the marginal distribution of the two genders shown, male (♂) and female (♀)? Because there are 84 in each group, the distribution is {50%, 50%}.

We could also inquire about baseball, basketball and football players in the whole population – 60, 50 and 58 individuals, respectively. Those percentages are easily caclulated:

\begin{align} \text{baseball} \phantom{000} \frac{60}{168} &= 35.7 \% \\[5pt] \text{basketball} \phantom{000} \frac{50}{168} &= 29.8 \% \\[5pt] \text{football} \phantom{000} \frac{58}{168} &= 34.5 \% \end{align}

#### Conditional distributions

Think of a conditional distribution calculated from a data set like this one as having conditions. For example, under the condition that we're only considering male players, what is the distribution of sports among them? In that case, we have

\begin{align} \text{baseball} \phantom{000} \frac{23}{84} &= 27.4 \% \\[5pt] \text{basketball} \phantom{000} \frac{28}{84} &= 33.3 \% \\[5pt] \text{football} \phantom{000} \frac{33}{84} &= 39.3 \% \end{align}

A marginal distribution considers differences in a categorical variable among all individuals in the data set. A conditional distribution considers differences in one category of individuals. That category (such as gender) is the condition.

### Practice 1 – Tables

1. Use the table to calculate
1. the conditional distribution of tea drinkers.
2. the conditional distribution of beverages preferred by males.
3. the marginal distribution of males and females.
4. the marginal distribution of those who prefer soft drinks.
Solutions
1. There are two categories of tea drinkers, male and female. Of the total number of tea drinkers (20), 2/20 = 10% are male and 18/20 = 90% are female.

2. There are three categories of of beverages. The total number of beverages perferred by males is 2 + 15 + 39 = 56. Of those males, 2/56 = 3.6% are tea drinkers, 15/56 = 26.8% are coffee drinkers, and 39/56 = 69.6% prefer soft drinks.

3. The table represents data from 56 males and 44 females for a total of 100 individuals. Thus the distribution of males and females is 56% ♂ and 44% ♀.

4. A total of 45 people reported preferring soft drinks. 39/45 = 86.7% were males and 13.3% were females.

##### Beverage preferences by gender
Tea Coffee Soft drink
Male 2 15 39
Female 18 20 6
Total 20 35 45

1. This table shows the assigned classes (cabin types) and death rates of passengers of the Titanic, which sank in the North Atlantic Ocean in 1912. Use it to calculate:

1. the conditional distribution of survivors.
2. the conditional distribution of non-survivors.
3. the conditional distribution of third-class passengers.
4. the marginal distribution second-class passengers.
Solutions
1. There were 500 total survivors. Their distribution across classes is

• First class: 201/500 = 40.2%
• Second class: 118/500 = 23.6%
• Third class: 181/500 = 36.2%

2. 817 people died in the sinking. The distribution of deaths across classes is

• First class: 123/817 = 15.1%
• Second class: 166/817 = 20.3%
• Third class: 528/817 = 64.6%

3. There were 709 total third-class passengers. The fraction who survived was 181/709 = 25.5%. That means that the number who were lost was 74.5%.

4. The marginal distribution of second-class passengers is just the fraction of all passengers with second-class tickets. It's 284/1317 = 21.6%

##### Survivorship on the Titanic

Survived
Did not survive
Total
First-class
passengers
201 123 324
Second-class
passengers
118 166 284
Third-class
passengers
181 528 709
Total
passengers
500 817 1317

### A picture is worth 1,000 words

Very often in science and data presentation, a chart or "graph" can be more powerful than a wordy explanation of trends in the data. At the very least, an appropriate chart can be complementary to the text, helping a reader to catch the meaning of a description or discussion. Books (see this link, for example) have been written about presentation of data in charts.

In this section we'll look at just a few of the main chart types used to present data. You'll want to be very familiar with these chart types and how to interpret them.

#### "graph" vs. "chart"

While it is used commonly to refer to all kinds of charts, the word "graph" in mathematics is reserved for diagrams showing linkages between states of some system, like nodes in a neural network or in a binary tree, such as is used in data analysis and searching algorithms.

### Pie charts

Pie charts – they look like a sliced pie – are a common way of showing proportion among categories. Here is a pie chart, using data from 2022, that shows the broad categories of the federal budget of the United States. About 62% of all federal funds are spoken for each year including this one. These are things like salaries, social security, medicare, etc. 8% of the budget has to go to "debt service," that is, interest on borrowed money, and the rest, 30%, can be reallocated by congress and other agencies with authority to do so.

The pie chart is a nice way to display this data. It's understood that the full circle represents 100% of the budget, and we can see right away that mandatory spending takes up the largest portion of it, followed by discretionary spending and interest payments. Not only do we get a good look at which categories are bigger and smaller, but how much bigger or smaller.

There are plenty of options for pie charts. For example, we can pull a pie wedge out for emphasis, as in the chart below.

Fancier pie charts can be rendered in 3-D like this one, with emphasis, as above, on the "Discretionary" wedge. A chart like this doesn't impart any more insight into the data; it's just more visually interesting.

A word on color: Sometimes the over-use of color can be distracting. It should be used with care. Sometimes color can be used to emphasize a point about data, like this:

It's important that all categories are represented in a pie chart, or if not, that the context is clearly given in a caption or the text. For example, we might create another pie chart breaking out the discretionary spending part of the budget, but we'd need to be clear that that's what we're presenting. A pie chart should represent 100% of all possibilities for any well-defined category.

For comparison's sake, of course, all categories displayed in a pie chart must have the same units. In these cases, the unit is percent of the whole budget.

### Bar charts

Bar charts are ubiquitous in nearly all fields that involve data. The reader of a bar chart compares the relative heights of the bars against a scale to make judgements about the data. They can be used just like pie charts to show proportion, but more often they're used for comparison and sometimes to show growth. Here's an example using data on the sales of personal computers in 2021.

It is a little easier to see the ordering of total sales (most to least, right-to-left) in the bar chart, but to get a sense of how much of the "pie" each complany owns, nothing really beats a pie chart. You can roll over or tap the image to see the percentages of total personal computer sales.

The stacked bar chart is a hybrid of pie and bar charts. They show the share of each category of the whole, and the trend in those divisions over (in this case) time. This chart shows the proportion of personal computer sales by major company over five years.

##### Proportion of personal computer sales, 2017-2021

Sometimes little lines are drawn linking the stacked bars. They can help to illustrate the trends in the changes of proportion.

### Histograms

A histogram is a special kind of bar chart. It represents all of the data from a given observation or experiment, breaking the independent variable into "bins," usually of equal size. The heights of the bars are proportional to the frequency of observations in each bin. Here's an example.

Here are 41 measurements — of what doesn't really matter for this example:

 8.25 7.83 8.42 8.50 8.67 8.17 9.00 9.00 8.17 7.92 9.00 8.50 9 7.75 7.92 8.00 8.08 8.42 8.75 8.08 9.75 8.33 7.83 7.92 8.58 7.83 8.42 7.75 7.42 6.75 7.42 8.50 8.67 10.17 8.75 8.58 8.67 9.17 9.08 8.83 8.67

Here is a sample histogram from that data. It's divided into eight bins of width 0.5 units each. This is a pretty informative histogram. It shows us that the data is distributed approximately in a bell-shaped or normal curve.

We can adjust the number of bins to suit our needs. Here is the same data organized into 12 bins. Be cautious about bin-number choice, however: Too many bins could cause the distribution of your data to be less obvious. On the other hand, it is possible to manipulate – sometimes dishonestly – the impression given by the data in order to convince a reader of something that might not be true.

#### Bar charts vs. histograms

A histogram is just a special type of bar chart. Stylistically, we put spaces between the bars of bar charts, where in histograms, the bars are squeezed together — they touch.

### Making a histogram on the TI-84

Using [STAT], access the lists. Clear and enter your data in one of the six lists. You can use one of the sorting functions to sort if if you'd like, but you don't have to.

Make sure that one of the STAT PLOTs is on using [2nd][STAT PLOT]. Select the histogram option for the stat plot.

You can make an initial histogram using ZOOM-9 (The 9 selects the option for scaling statistics-related charts). That will give you a good start, but you'll likely want to refine your histogram.

To refine the scaling, choose WINDOW and fill in the x- and y- min and max, and the bin width, which is the x-scale (Xscl) value. In this histogram I've forced the bin width to be 0.5 units. Don't press Zoom-9 again or you'll lose this scaling.

### Dot plots

A quick and easy way to visualize distributions

A dot plot (sometimes "dotplot") is related to a histogram, but simpler to construct — they can just be jotted down with paper and pencil if needed. To construct a dot plot, we simply make a number line that includes the full range of the data, then begin stacking dots above it as data points accumulate.

The dot plot at right shows the results of 44 independent measurements of the length of a room. At a glance we can tell that the distribution of lengths is approximately bell-shaped or "normal, " with a mean of about 14 or so units.

We can also tell that there are some measurements that seem like outliers that we might want to do some follow-up work on. For little effort, a dot plot allows us to get a good idea of the kind and quality of the data it represents.

Dot plots can be generated using a number of software packages, and some, such as Mathematica or the TI-84 calculator can be tricked into making a dot plot by sorting the data, then assigning increasing y-values to each set of individual x-values.

### Stem plots

or Stem & leaf plots

Stem plots are another quick way to visualize the shape of a distribution of data and to quickly calculate a few simple statistics. Let's show how it works by example. Here's a small data set of integers:

{18, 18, 19, 23, 24, 24, 24, 25, 25, 26, 26, 26, 26, 27, 28, 29, 29, 30, 31, 31, 33, 33, 35, 39, 40}

This set has been sorted for convenience, but its not necessary. In this case, the first (tens) digit will form our "stems." These are simply 1, 2, 3 and 4, and we'll write them vertically like this

            1|
2|
3|
4|


Now we'll use the second (ones) digits to form the "leaves,". The leaves that go with the stem of 1, for example, are 8, 8, 9. Here's how that part looks:

            1|889             Key: 1|9 = 19
2|34445566667899
3|0113359
4|0


That's it. This simple way of plotting our data gives us a quick glance at the shape of its distribution.

There are a few modifications we can make. Here's an exampe of a stemplot with split stems, in which we've split the leaves from 1-5 and 5-9. It's a more informative representation of the distribution of the data in this some cases.

            1|3345          Key: 2|1 = 21
1|67777889
2|11222344455
2|66667899
3|011335555
3|6777788899999
4|1123455
4|67789
5|124


For comparing two distributions of similar data, a back-to-back stem plot can be handy. One set of leaves, representing the one data set, is on the left of the stems, and the other is on the right.

                 976642|2|1258
9882211|3|1125788
9988773211|4|1248899
99866644321|5|111557899
977664443|6|23338
85554322|7|1455
65331|8|179

X

### ubiquitous

ubiquitous means present, appearing or found everywhere.

xaktly.com by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2012-2019, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.