xaktly | Probability & Statistics

Significance testing:
Errors & power

What can go wrong with a significance test?

On this page we'll add some critical analysis to our previous section on hypothesis testing. Mainly, we'll analyze what kinds of errors we can make, and what the consequences of those errors might be. We know that in significance tests we choose some confidence level, but ought to be mindful that, by definition, that level allows for errors to sneak through, however unlikely. Recall, for example, that a 95% condfidence level means that about 95% of all samples will capture the true mean of a distribution to within the sample mean plus or minus the margin of error. The other 5% will not.

The figure above illustrates this point. Each dot and horizontal bar represents a sample mean, $\bar x$, taken from a population.

The bar represents the mean plus and minus the margin of error, calculated with some confidence level, $\alpha$, $\bar x ± \text{MOE}$. For a 95% confidence interval, we expect that the true mean of the population, $\mu$, will fall within this range for about 95% of the samples we take. That means that for about 5% of those samples, we'll miss – the true mean won't lie within our confidence interval. Worse, we won't know it unless we take a large number of samples, and that might be time consuming or expensive.

So it's important that we try to understand the kinds of errors that can occur, their impact on the problem we're studying, and our "power" to discern accurately.

We will be concerned here with errors we can sort into Type I and Type II errors, and with calculating the "power" of a significance test to determine accurately whether a challenge to a hypothesis is meaningful.

Because these concepts can be tricky, we'll try to look at them in a few different ways – using language, pictures and the rules of conditional probability. Here we go ...

A word on errors

Don't be bothered by making the kinds of errors we're talking about here; they're normal.We live in a stochastic universe. There is no measurement* we can ever make that won't include some random fluctuations or "noise." We cannot escape this. We can only hope to manage the reality intelligently.

Type I errors

A Type I error occurs when our significance test rejects the null hypothesis, $H_o$, when it is actually true. There are various ways to describe a Type I error, including a "false positive" (in the sense of the hypothesis test), or having a "hair trigger," a perhaps oversized propensity to reject.

The probability of making a Type I error is the conditional probability that, given that $H_o$ is true, we then conclude that te alternate hypothesis, $H_a$ is true instead,


Here are a couple of examples.

  • Consider a capital case in a court in the United States, where capital punishment (the death penalty) is still normal. The status quo in the U.S. system is that the accused is not guilty of the crime until proven to be so in a court — that's $H_o$. A trial is a test of the significance of the evidence against the null hypothesis. In U.S. courts, juries must reject the null hypothesis under the "beyond a reasonable doubt" standard. If $H_o$ is rejected but the null hypothesis (innocence) is actually true, then a grave error has been made and an accused person may be deprived of life by the government, which, in the U.S., is the citizens.
  • When we test drugs for efficacy and safety, our null hypothesis is generally something like "There is no difference between using the drug being tested and the conventional treatment, or no treatment at all." We are usually trying to minimize Type I errors that would reject such a null hypothesis. To wrongly reject this hypothesis (that is, to incorrectly accept that the drug is of value) could be to expose patients to a drug that is either not effective, or possibly does more harm than good. It would certainly incur unnecessary costs for patients and health-care providers.

Type I errors often lead to unnecessary costs or unnecessary consequences. They can occur by random chance or sometimes by malpractice on the part of researchers, though mathematics and science as disciplines are generally pretty good at snooping that kind of thing out — they are "self-policing,"

In our first example, the Type I error leads to the state-sponsored killing of an innocent human, and in the second, to significant expense and exposure of humans to a possibly unnecessary drug.

Type II errors

A Type II error occurs when our significance test fails to reject a null hypothesis which was in fact false. As a conditional probability, it is:


Consider the capital murder case from above. A Type II error in that instance will let a guilty person go rather than convict, when in fact, the null hypothesis — presumed innocence — was false. Many would consider this the preferred outcome if an error was to be made at all, making a Type II error preferable to a Type I error – in that case.

Now think about the pharmaceutical example. In that case, if we assume that the null hypothesis is false — that the drug is better than the existing treatment or no treatment at all — and we miss that, then we've incurred an opportunity cost, a chance to improve treatment and for patients to feel better.

Tolkien on life and death

Author J.R.R. Tolkien (1892-1973), wrote in "The Fellowship of the Ring"",

“Many who live deserve death. And some who die deserve life. Can you give it to them? Then do not be too eager to deal out death in judgement.”

In this example, the Type II error is significant. Pharmaceutical and medical-device studies are fraught with all kinds of ethical and moral peril. It's important to get those right.

A word about hypotheses

We ought to distinguish hypotheses here, because the difference between typical null and alternative hypotheses can bear on the sizes of the potential errors we encounter.

  • Null hypotheses are usually particular, like $\mu = 23.4$ or $p = 0.82$.

  • Alternative hypotheses are usually more vague – they present a range of alternatives. For example, $\mu \ne 23.4$ or $p \gt 0.82$, and so on.

Here is a more visual way to look at these significance decisions.

Type I errors

Look at the red square on the chart. It represents a Type I error, for which the null hypothesis was true, but was rejected by the test. The probability of that rejecttion mdash; of that error — is $\alpha$.

You can see that from the figure on the right (we'll simplify just to one-tailed tests), that the probability of rejecting $H_o$ is $\alpha$. That figure shows two probability distributions, the one on the left assumes that the null hypothesis holds, and the one on the right that the alternative is true.

We can view the probability of making a Type I error as a conditional probability,

$$P(H_a | H_o) = \alpha.$$

$H_o$ true, correct decision

Below the red square is the outcome of a correct decision from the data. The null hypothesis is true and the test fails to reject it. We can use conditional probabilities to find that probability, too. The conditional probability that, given that $H_o$ is true, then $H_a$ is rejected can be written as $P(!H_a | H_o)$. Now we have

$$ \begin{align} P(!H_a|H_o) + P(H_a|H_o) &= 1 \\[5pt] P(!H_a)|H_o) &= 1 - P(H_a|H_o) \\[5pt] P(!H_a|H_o) &= 1 - \alpha \end{align}$$

On the graph, that probability is everything under the left-side distribution except the $\alpha$ area.

$H_o$ false, correct decision

This probability can be written as $P(H_a | !H_o)$. Now we have

$$ \begin{align} P(H_a|!H_o) + P(!H_a|!H_o) &= 1 \\[5pt] P(H_a)|!H_o) &= 1 - P(!H_a|!H_o) \\[5pt] P(!H_a|!H_o) &= 1 - \beta \end{align}$$

Look at the graph, where $\beta$ is defined. It is the probability rejecting the null hypothesis if it is false.

Type II errors

A Type II error can be described by the conditional probability $P(!H_a|!H_o) = \beta$. From the graph you can see that's the blue area of the alternate distribution.

All of these probabilities are given in the table above.

Statistical power

The power of a statistical test is the probability that it correctly rejects the null hypothesis, $H_o$, when some alternative hypothesis, $H_a$ is true. According to our table above, that's $1 - \beta$. Like any probability, statistical power is between 0 and 1. As the power gets closer to one, $\beta$, the probability of making a Type II error, decreases. So increasing the power of a test decreases the Type II error probability.

We can't really calculate a statistical power in most situations. This is because of the nature of hypotheses (see box above). A null and alternative hypothesis are usually sometthing like $H_o: \; \mu = a, \phantom{00} H_a: \mu \ne a$.



Things that affect the resolving power of a significance test

  1. Significance level (α)

    If all other things are held constant, then as $\alpha$ increases, so does the power of the test. Consider the distribution below, in which two $\alpha$ levels are marked off:

    As $\alpha$ increases, the probability of rejecting the null hypothesis increases, therefore the power of the test increases. BUT: As the area of the rejection region (red) increases, so does the probability of rejecting $H_o$ when it is actually true — the probability of a Type I error).

  2. Sample size, $n$

    As $n$ increases, the width of the distribution of the test statistic is reduced by a factor of $1/\sqrt{n}$. The hypothesized distribution and the true distribution of the test statistic (we're assuming that $H_o$ is false) then become more separated, and it's easier to tell whether the observed statistic comes from one or the other.

  1. Inherent variability in the measured variable

    If there is a lot of fluctuation or uncertainty in the variable we're trying to measure, then the width of its distribution will be on the wide side. One way to get around this is by making a matched pairs design, in which one half of each data pair exhibits only the "background" noise. In such a design, we can do a pretty good job of separating signal from noise.

    One way to think about it is night-time digital photography. With very long timeed exposures of the camera's sensor, it's inevitable that some of the electronic fluctuations inherent in any digital camera's recorder will cause noise in that sensor. To get around it we sometimes record a "dark fram", in which we simply cover the lens and record a frame ‐ a record of the noise present in the sensor when no light is coming into the camera. Then we can subtract that from the real images later.

  2. The difference between the hypothesized value of a parameter (the statistic) and its true value

    The larger the difference between the hypothesized value of a parameter and its true value, the easier it can be to detect differences between the two. This is an inherent aspect of any problem, and not one we can usually manipulate.



means randomly determined.

Creative Commons License   optimized for firefox
xaktly.com by Dr. Jeff Cruzan is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. © 2012-2019, Jeff Cruzan. All text and images on this website not specifically attributed to another source were created by me and I reserve all rights as to their use. Any opinions expressed on this website are entirely mine, and do not necessarily reflect the views of any of my employers. Please feel free to send any questions or comments to jeff.cruzan@verizon.net.