4. Hypothesis and Significance Testing

A scientific hypothesis makes a testable statement about the observable universe. A statistical hypothesis is more restricted in that it concerns the behavior of a measurable (or observable) random variable. Much of the work that we do is directed toward rephrasing a scientific hypothesis in terms that allow us to construct an appropriate statistical hypothesis. Say that we are concerned with a random variable x which falls in a sample space W. We can define (at our choosing) a subregion of the sample space, w. Since x is a random variable whose behavior in W is governed by a probability distribution, we can compute the probability that x will fall within our subregion w (i.e., P(x∈w)). Any hypothesis concerning P(x∈w) is a statistical hypothesis.

4.1. The Hypothesis Test

We will begin our discussion of statistical tests with a brief description of the classical hypothesis test scheme developed by Jerzy Neyman and E.S. Pearson in a series of classic papers published in the 1930's (reviewed by Lehman, 1993). Although this approach is perhaps more appropriate to industrial applications and quality control situations than to science, it is fairly easy to understand at the basic level and includes important concepts that carry over into the somewhat looser approach of significance tests that we will follow in this book.

A Neyman-Pearson hypothesis test may be described as comprising four elements:

A null hypothesis.
An alternative hypothesis.
A partitioning of the sample space (the set of all possible experimental outcomes) into two regions: the acceptance region and the rejection region.
A decision rule: If your experimental observation falls in the acceptance region, accept the null hypothesis as true, if the observation falls in the rejection region, reject the null hypothesis in favor of the alternative hypothesis.

These four elements are to be decided upon before you do your experiment and the conclusion of the whole process is meant to be a decision—thumbs up or thumbs down. Now, even in this rather rigid setup, we should not take too seriously the terms accept and reject— they are technical terms within the theory, but should not be interpreted too literally. In science we never completely accept without reservation an hypothesis as true.

Example 4.1

Suppose our experiment is to determine the sex ratio in a particular cross yielding 10 progeny; let π be the true probability of a female (F) and 1-π the true probability of a male (M). We know that either the cross has a normal sex ratio (π= 0.5), or it is biased in favor of females with π = 0.8. Admittedly, this is a bit artificial, but it will allow us to discuss all the more important concepts without getting tangled up in a morass of complex calculations.

Null hypothesis: π = 0.5
Alternate hypothesis: π = 0.8
Acceptance region: Number of F's observed in set {0,1,2,3,4,5,6,7},
Rejection region: Number of F's observed in set {8,9,10}.
Decision rule: We collect and sex our 10 progeny. If the number of F's is ≤ 7, we accept the cross as normal; if the number of F's is >7, we reject the idea of a normal sex ratio in favor of the hypothesis that the ratio is biased and π=0.8.

Is this a reasonable procedure? What properties should a reasonable procedure have?

First of all, it should be clear that in any finite experiment, we can never be 100% sure whether we have a "0.5" cross or a "0.8" cross. If there were a way of determining that type without error, the problem would not be a statistical one. That consideration leads to the first set of important concepts:

Type I error: The probability, α of rejecting the null hypothesis when it is true.

Type II error: The probability, β of accepting the null hypothesis when it is false

In our example,

We find that we will commit a Type I error about 5% of the time, a reasonable rate of error; but that we will incorrectly accept the null hypothesis when, in fact, the cross is biased almost 1/3 of the time. That result doesn't seem very satisfactory and leads to the next important concept:

The power of a test: The power is the probability of rejecting the null hypothesis when it is false, i.e., 1-β.

In our case the power is about 2/3 – not very impressive. The reason, of course, is that the sample is quite small. In all hypothesis tests, there is always a trade-off between α and β. If you lower α, then β will increase, and vice versa. The only way to increase the power for a fixed α, is to increase the sample size (or find a better test).

4.2. The Significance Test

There is a slightly different way of approaching the same problem. Instead of thinking of it as a decision problem, we simply construct a measure of how well the data agree with the null hypothesis. Much of the reasoning is very similar to the Neyman-Pearson approach, but the outlook is somewhat different. Throughout the rest of this book we will tend to favor this approach.

We assign to each possible outcome of the experiment a significance level or P-value. This value is a number between 0 and 1 that indicates how well the data conform to the assumptions of the null hypothesis. This approach involves two operations:

Ordering all possible outcomes from least significant to most significant. (The ordering may be partial.)
Assigning P-values to each outcome such that more significant (less supportive of the null hypothesis) outcomes receive smaller P-values.

If we don't utilize any particular alternate hypothesis, the process leads to what is sometimes called a pure test of significance. Generally, however, there is some alternate hypothesis in mind, and it aids us in how we order the possible outcomes.

A reasonable (but not unique) procedure would be as follows:

Order the possible outcomes according to the likelihood ratio

Outcomes with higher ratios being deemed more significant than those with smaller ratios.
Evaluate the P-value of each outcome, x, as the sum of the probabilities, under the null hypothesis, of all outcomes at least as significant as x according to the above ordering.
Do the experiment, observe the outcome, and report the P-value for the observed outcome. For example, if P=0.61, the data evidently are in reasonable conformity with the null hypothesis. However, if P=0.003, then doubt is raised as to the validity of the null hypothesis and the alternate hypothesis starts looking more appealing. You may well still have some cutoff point, like P=0.05, in mind, at which point you start to doubt the null hypothesis and prefer the alternative, but that is not an intrinsic part of this approach.

Below is the calculation of the significance test P-values for Example 4.1. Notice the close similarity to the hypothesis testing scheme in this simple example. Our rejection region in that scheme was {8,9,10 females} and these three events, indeed correspond to the outcomes with the three smallest P-values. However, in our attempt to define a rejection region with probability around 5%, we left out of it the outcome F=7, which according to the likelihood ratio is actually more favorable to the alternative hypothesis! So the Neyman-Pearson approach would have obliged us to accept the null hypothesis based on an observed outcome (7 females) that really argues for the alternative hypothesis. Anomalies like this often come up especially when dealing with discrete distributions.

Example 4.1 (cont.)

F	P(F\|H₀)	P(F\|H₁)	Likelihood Ratio	P-value
0	0.0010	0.0000	0.0001	1.000
1	0.0098	0.0000	0.0004	0.9990
2	0.0439	0.0000	0.0017	0.9893
3	0.1172	0.0007	0.0067	0.9453
4	0.2051	0.0055	0.0268	0.8281
5	0.2461	0.0264	0.1074	0.6230
6	0.2050	0.0881	0.4295	0.3769
7	0.1172	0.2013	1.7179	0.1719
8	0.0439	0.3020	6.8719	0.0547
9	0.0098	0.2684	27.487	0.0107
10	0.0010	0.1074	109.9512	0.0010

Example 4.1 was picked for its simplicity in order to introduce basic concepts. Both of our hypotheses, null and alternative, were, in the jargon of statistics, simple; they each completely specified their corresponding probability distribution. Real testing problems, whichever approach is used, are generally more complex and involve so-called composite hypotheses.

4.3. Simple versus Composite Hypotheses

A probability distribution is often determined by the values of a set of parameters. A simple statistical hypothesis specifies a unique value for each parameter of the distribution. For example, we may wish to test the hypothesis that a set of observations comes from a Poisson distribution with a mean of 4.0. This hypothesis is simple because it specifies the sole parameter for the distribution and thus the entire distribution. Alternatively, we may wish to compare two sets of observations and test the hypothesis that the two sets come from the same Poisson distribution with mean m, m unspecified. Since we don't care what the value of m is but only that a single value can account for both of the sample distributions, this hypothesis is composite. In general, if a random variable follows a probability distribution with r parameters and our hypothesis specifies k of these parameters, the hypothesis is simple if r = k and is composite if r > k. Most of the rest of the course will be devoted to defining statistical tests of various types, with composite alternative hypotheses.

Example 4.2

A first approach to discerning developmentally detrimental effects of genes that have an easily observed dominant phenotype is to determine whether the number of progeny homozygous for the gene in a cross between heterozygotes is smaller than would be predicted by Mendelian segregation. Suppose that we have isolated a mutant, D, and, to make matters simple, we can determine all three genotypes (D/D, D/+, and +/+) by virtue of an RFLP on Southern blots or PCR analysis. We mate two D/+ animals and determine the genotypes of 20 of the progeny. We find that 2 are D/D, 12 are D/+ and 6 are +/+.

4.4. Choosing the Null and Alternative Hypotheses

Prior to doing the above experiment, we should have a scientific hypothesis in mind, the nature of which will play a role in how we perform our experiment. We could construct a variety of scientific hypotheses, some of which are stated below.

S₁: As stated in Example 4.2 the hypothesis of interest is that the mutant gene D has a detrimental effect during development in its homozygous state such that "few" D/D embryos survive to term.

S₂: The allele D has developmentally detrimental effects, such that both D/D and D/+ offspring are underrepresented.

S₃: Segregation of the D and + alleles is non-Mendelian.

The first step toward testing any of the above is to decide on the nature of the random variable to be measured and to determine its probability distribution under some defined (and relevant to our scientific hypothesis) set of conditions. For any of the above, we could look at the number of D/D progeny obtained, although this random variable might not be the best choice in all circumstances. Note that these three scientific hypotheses make quite different assertions about the numbers of each genotype that would occur. In spite of these distinctions, the contrary, or null hypothesis, is the same, i.e., the number of D/D progeny is binomially distributed with a success probability of p = 0.25.

H₀: p = 0.25

In the cases of S₁ and S₂ above, we are interested in the same composite alternative hypothesis

H_A1: p < 0.25

that is, the D/D class is underrepresented. Note that this approach to asking S₂ does not take into account all of the useful information (the number of D/+ progeny is not specifically considered) and our hypothesis test is not optimal. In the case of S₃, we are interested in a more general composite alternative hypothesis

H_A2: p ≠ 0.25

and again we are not making optimal use of the available information. Comparisons between H₀ and H_A1 are said to be one-sided tests while that between H₀ and H_A2 is a two-sided test.

4.5. Performing the Statistical Test

Now it would seem that we have to decide whether we are going to perform a hypothesis test or a significance test. To be honest, most scientists (and statisticians too!) do not always make a clear distinction between the two; probably because they usually involve largely the same calculations and tend to lead one to the same conclusion.

4.5.1. The hypothesis test approach.

To construct an hypothesis test, we first specify the null hypothesis, H₀, and then decide on a value of α, the Type I error rate. This value, also referred to as the significance level or size of the test, will typically be fairly small (e.g., 0.05) so that we don't often make a Type I error. We then determine the set of outcomes (the critical or rejection region), with a combined probability of α, that will cause us to reject the null hypothesis. If our observed result falls within this set of outcomes, we reject the null hypothesis, otherwise we accept it.

4.5.2. The significance test approach.

Here we just calculate the P-value that corresponds to our observed outcome. How we view that value, i.e., what we "do" after seeing it and thinking about it, is not really a formal part of the theory. Usually our behavior will be very similar to the Neyman-Pearson decision-maker, but we are not formally bound by our approach to make any particular decision.

4.5.3. A complication arising out of composite hypotheses.

Whichever approach we are inclined to use, a complication has arisen in Example 4.2 that was not present in Example 4.1. In Example 4.2, the alternative hypothesis is composite—the alternate hypothesis is a set (actually an interval) of values of the binomial probability, not one particular value. So now how do we use the alternative hypothesis to calculate Neyman-Pearson power or to order our outcomes for a significance test?

The power question is fairly straightforward. Clearly there is now no single power for our test of the hypothesis, but a different power for each possible value of the binomial probability included in the alternative hypothesis. The power is a function rather than a single value. This function is described in detail in the next section for Example 4.2.

If we are significance test animals, and are using the likelihood ratio method to order our outcomes, how do we do it when the alternative is composite? Obviously,

no longer specifies a unique number for each outcome because H₁ refers to a set of values of the binomial probability. The method most often used is to order the outcomes according to

the maximum being taken over all values of the binomial probability in H₁.

In any event, it is useful to provide a rough verbal definition of the meaning of the P-value obtained in such a test:

The significance level, or P-value, is the probability of obtaining the observed result or a more extreme result (one less consistent with the null hypothesis) under the assumption that the null hypothesis is true.

Left unspecified in this definition, of course, is exactly what is meant by "more extreme," and whether or not its definition makes use of an alternative hypothesis or not. In our example we used a likelihood ratio criterion; but others are possible. The most obvious is simply the difference |p-p₀| where p₀ is the value of the binomial probability assigned by the null hypothesis and p denotes the value(s) assigned by the alternate hypothesis. Another possibility is simply to order the possible outcomes by their probabilities under the null hypothesis as is done in the so-called "pure" test of significance (this method, of course, doesn't take into account any specific alternative hypothesis).

In Example 4.2, the test of the null hypothesis

H₀: p = 0.25

against the alternative

H_A1: p < 0.25

would be obtained by summing the probabilities of obtaining 0, 1, and 2 D/D progeny for a binomial distribution with N=20 and p=0.25. These probabilities are 0.00317, 0.0211, and 0.0669, respectively, giving us a significance level of P=0.091. This P-value may be interpreted as a sort of inverse measure of the confidence with which we can discard the null hypothesis in favor of our alternative. Although it is customary to use a P-value of 0.05 (or less) to reject the null hypothesis (the hypothesis test approach), the choice of this value is arbitrary. In our case, we could interpret the results of this experiment to say that we are suspicious (if not entirely convinced) that D/D animals are lost during development (the significance test approach), and it would be reasonable at this point to do another experiment to cause us to tilt one way or the other on the question.

4.5.4. Choosing between one-sided and two-sided tests.

In the example above, we discussed testing the same null hypothesis (i.e., p=0.25) against either a one-sided (p<0.25) or a two-sided (p≠0.25) alternative, depending on the biological question we were asking. It will likely have occurred to you that a one-sided test will give a smaller P-value (generally half) when the deviation is in the "desired" direction than the two-sided test. As discussed below (Chapter 10), it will also have greater power, since you're only looking at half of the sample space. You should avoid the temptation of routinely performing one-sided tests (in the direction of your preconceived idea of how the experiment should work out). When you do so, you are asserting that you would have absolutely no interest in pursuing further studies if your results pointed in the other direction. Thus, for most of the experiments you do, performing a two-sided test would be more appropriate. However, there are some questions that are intrinsically one-sided. For example, you want to test whether a specific chemical causes cancer and might pose a health risk to people. You treat groups of mice with the chemical or a vehicle control and after an appropriate period of time, you assess the incidence of cancer at a particular organ site. Because of the reason you are doing the experiment (safety testing), a one-sided test of whether the incidence in treated mice exceeds that in control mice would be entirely appropriate.

4.6. The Power of Statistical Tests

The concept of power comes out of the hypothesis testing scheme. The idea, as we saw above, is to set the size of the test and then look for a test with the greatest power.

One way of looking at the above example is that we simply didn't do a large enough experiment to detect an effect of the D/D genotype on viability. Before doing an experiment, it is useful to consider the power, the chance that our experiment will detect an effect, of the statistical test we plan to use. In this case, the power is simply 1 - β, the complement of the probability of making a Type II error. Unlike the Type I error probability, which we can view as being fixed by the investigator, the probability for a Type II error depends on both the value of α that we consider acceptable (that is, the largest value of α that would cause us to reject the null hypothesis) and on the degree of deviation from the null hypothesis that we would like to be able to detect (in this case, the largest value of the frequency of D/D offspring that we would find interesting). We will spend more time on this problem later (Chapter 10), but for our simple example we can compute the power of the experiment fairly easily.

Example 4.2 continued

In our example, say that we would like to fix α at 0.05 and be able to detect a reduction in the frequency of D/D progeny under the alternative hypothesis to 0.1 (as against the null hypothesis value of 0.25). We can compute the power (1-β) of the experiment as a function of N as follows. For a given value of N, we can determine the critical value of x, the number of D/D offspring, that would allow us to reject the null hypothesis with a significance level of at most 0.05. For N = 20, using the binomial probabilities given above we would reject the null hypothesis only in the cases that the number of D/D offspring was less than or equal to 1, since P(x ≤ 1) = 0.024 and P(x ≤ 2) = 0.091. Thus, the critical value of x is 1 for N = 20. Under the alternative hypothesis that the proportion of D/D progeny is 0.1, we can use a binomial distribution with N= 20, p = 0.1 to calculate the probability of 1 or fewer D/D animals as 0.391. Thus, the power of our experiment to detect a frequency of D/D animals less than or equal to 0.1 when N = 20 is 0.391. The table and figure below summarize power calculations for various values in this experiment. You can see from the table that in order to have a greater than 90% probability of detecting the desired alternative we would require a sample size of about 55 progeny.

N	Critical Value (α=0.05)	Power
15	0	0.206
20	1	0.391
30	3	0.647
40	5	0.794
50	7	0.877
60	9	0.926
70	11	0.956
80	13	0.973

4.7. Sample Problems

It seems reasonable to suppose that the more lethal the D/D genotype, the fewer progeny must be analyzed to demonstrate a deviation from Mendelian segregation. If we use a binomial probability of p=0.01 for our alternative hypothesis, what is the power of an experiment involving the analysis of 20 offspring? Assume that all of the other parameters of the above example are the same: the binomial probability under the null hypothesis is p=0.25, and the desired value for the significance level is α=0.05.
Again using the above example, consider the case in which the survival of both the D/D and D/+ progeny are reduced relative to the +/+ offspring. You do an experiment mating D/+ parents and examine the genotypes of 20 offspring. You observe that 2 are D/D, 9 are D/+, and 9 are +/+ in genotype. There are two alternative tests using the binomial distribution to compare these results to the expected proportions of 0.25:0.5:0.25, depending on whether you test for the presence of too few D/D offspring or too many +/+ offspring. What significance level (P-value) do you obtain for each test? In the case that the survival of D/D offspring is 10% and that for D/+ offspring is 50% (with that for +/+ being 100%), what is the power of each test when 20 offspring are analyzed and an α= 0.05 is used?