5. Testing for Differences in Location or Dispersion

The most commonly encountered statistical problem is that of testing for differences in location between measurements made on two independent populations. We are accustomed to thinking of this problem as testing for a difference in the mean values for the two populations but we could also phrase the problem in a slightly different way. For example, we might ask whether a randomly chosen member of population 1 is likely to be larger in magnitude than a randomly chosen member of population 2. The classical normal theory test for a difference in the mean (μ) of two normally distributed random variables is the Student t-test, which defines a test statistic

where is the mean for group i, s_i² is the variance, n_i is the number of observations, and ν, the number of degrees of freedom, is n₁+n₂-2.

The null hypothesis, that is equal to , is rejected for large or small (negative) values of the statistic. Although this familiar statistic is easy to compute, the two assumptions required for using this test are often violated by data from biological experiments. First, the test is quite sensitive to "heavy tails" in the distribution for the data. The frequency of observations for normally distributed data falls off rapidly as the value becomes much less than or greater than the mean value, making the test sensitive to outlier or extreme values. Second, the assumption that the variances for the two populations are equal and essentially independent of the mean values for the populations is contrary to the dependence of the variance on the mean for many simple distributions (e.g., Poisson) of biological interest.

5.1. Permutation Tests

The second version of the location question posed above, that randomly chosen members of population 1 are likely to be larger than those from population 2, provides the rationale for an alternative approach based on examining all possible permutations of the observed data set.

Consider the following small experiment, in which we make three independent observations under each of two conditions:

Group A: 29, 52, 49

Group B: 15, 36, 18

We want to test the null hypothesis that there is no difference in "location" between the two treatments against the one-sided alternative that the observations in group A are larger than those in group B. We observed that the mean of group A (m_A) is 43.3 and that for group B is 23; thus, we could define a statistic for testing our hypothesis, D = m_A - m_B, which is equal to 20.3 for the above case.

A (m_A)	B (m_B)	D	A (m_A)	B (m_B)	D
29,52,49 (43.3)	15,36,18 (23)	20.3	52,49,15 (38.7)	29,36,18 (27.7)	11
29,52,15 (32)	49,36,18 (34.3)	-2.3	52,49,36 (45.7)	29,15,18 (20.7)	25
29,52,36 (39)	49,15,18 (27.3)	11.7	52,49,18 (39.7)	29,15,36 (26.7)	13
29,52,18 (33)	49,15,36 (33.3)	-0.3	52,15,36 (34.3)	29,49,18 (32)	2.3
29,49,15 (31)	52,36,18 (35.3)	-4.3	52,15,18 (28.3)	29,49,36 (38)	-9.7
29,49,36 (38)	52,15,18 (28.3)	9.7	52,36,18 (35.3)	29,49,15 (31)	4.3
29,49,18 (32)	52,15,36 (34.3)	-2.3	49,15,36 (33.3)	29,52,18 (33)	0.3
29,15,36 (26.7)	52,49,18 (39.7)	-13	49,15,18 (27.3)	29,52,36 (39)	-11.7
29,15,18 (20.7)	52,49,36 (45.7)	-25	49,36,18 (34.3)	29,52,15 (32)	2.3
29,36,18 (27.7)	52,49,15 (38.7)	-11	15,36,18 (23)	29,52,49 (43.3)	-20.3

Under the null hypothesis that the two sets of observations come from the same population, we could randomly draw any three of the six observations and label them as "A". The distribution of our test statistic, D, can be computed for all permutations of the data set. The significance level for our hypothesis test can then be obtained by dividing the number of permutations for which D≥20.3 by the total number of permutations, 20. The 20 permutations of our data set, and the corresponding value of our test statistic, are enumerated in the table above. From the table, you can see that only 2 permutations give test statistics that equal or exceed our observed value, 20.3. Thus, the P-value for our one-sided statistical test is 0.1. If we had been testing against a two-sided alternative hypothesis, we would count the number of permutations for which |D| ≥ 20.3 (4 in this case).

The above permutation test has the advantage that it provides an exact test of our hypothesis without requiring us to make any assumptions about the underlying model that generated the data (e.g., the assumption of normality in the t-test). This approach can be applied quite generally as long as we can define an appropriate test statistic. For example, we might be interested in comparing two analytical methods for their precision in measuring some parameter and we've made multiple observations using each method for the same sample. To test for a difference in precision, we could define a test statistic D=V(A)-V(B), where V(x) is the variance of x.

5.1.1. Monte Carlo methods.

A major disadvantage to the permutation test above is that the magnitude of the computational problem scales up very rapidly with the sample size. The numbers of permutations that we would have to enumerate for two groups of 3, 5, 10, and 20 observations each are 20, 252, 184756, and 1.37847×10¹¹.

The complete set of permutations defines the sample space for our statistical test under the null hypothesis. One way of thinking about our hypothesis test is that we want to know what fraction of the points in the sample space would provide a test statistic that equals or exceeds our observed result. In the case that the sample space is very large, it is possible to estimate this fraction (our P-value) by simply examining a suitably large number of randomly chosen points within the sample space. This approach falls under the general heading of "Monte Carlo" methods. Given access to a desktop computer and with a little programming, this approach can be used as generally as the strict permutation test described above.

Example 5.1

We are studying the induction of cell proliferation in a target tissue following treatment of animals with a particular chemical. We treat groups of 14 and 13 animals with the agent or solvent vehicle, respectively, along with BUdR to label replicating cells. Animals are sacrificed, the tissue is prepared, and we examine 10,000 cells from each animal and count the number of labeled cells. We obtained the following data:

Treated: 563, 504, 837, 262, 435, 283, 218, 1296, 1310, 1311, 658, 426, 794, 297

Control: 231, 79, 290, 119, 346, 493, 349, 299, 747, 121, 109, 204, 114.

The means and standard deviations for the treated and control groups are 657±399 and 269±189, respectively.

We want to test the null hypothesis that the levels of proliferation in treated and control animals are the same against the one-sided alternative that the treated animals show a higher labeling index. We can define a test statistic, D=M_treated-M_control, as the mean of the treated group minus the mean of the control group.

Our sample space under the null hypothesis consists of more than 20 million points. We decide to use a Monte Carlo approach and run 100,000 random trials on our data set and compute the value of D for each (at the cost of about a half hour of programming and one minute of computation). The complement (upper tail) of the cumulative distribution for the test statistic is shown in the figure below. Our observed test statistic is 388 and the fraction of trials for which D ≥ 388 is approximately 0.001, which is our P-value for the statistical test.

5.2. Wilcoxon Rank Sum Test

The permutation tests described above require that we define (or approximate) a unique distribution for our test statistic for every experiment. A much more convenient approach is to base the statistical tests on the ranks (or ordered magnitudes) of the observations rather than on the observed values. As will become evident, this approach requires very few assumptions concerning the distribution(s) followed by the data and hence the methods are often referred to as distribution-free or nonparametric tests. A further advantage to tests based on ranks is that they are relatively insensitive to the presence of extreme values (outliers) in the samples.

5.2.1. Rationale and a simple example.

Consider an experiment to test the ability of a particular gene to confer the property of anchorage-independent growth on a mammalian cell line that normally has a low frequency of colony formation when cultured in semisolid medium. Two plasmid constructions, "A", which contains an antibiotic resistance marker, and "B", which is identical to the "A" plasmid except that it also includes the test gene, are each introduced into populations of cells. Those cells that express antibiotic resistance (and hence carry the A or B DNA) are selected and samples of 10⁴ cells are plated in semisolid medium. In this small experiment, 4 independent plates of "A" cells and 2 of "B" cells are scored for the number of colonies after a suitable period of time. The following data (number of colonies / 10⁴ cells) are obtained:

A plasmid: 32 24 28 40

B plasmid: 116 120

The null hypothesis of interest is that the observations for the two groups of data come from identical populations against an alternative hypothesis that cells carrying the B plasmid are more likely to grow in agarose. If the null hypothesis were true, we might imagine that the second sample (B plasmid) could consist of any 2 of the 6 observations made in the experiment and that all possible permutations of the data would be equally likely.

Because we are only interested in the range of possible permutations of the data observed in the two groups, the actual values are irrelevant and we can consider just the ranks of the observations (from 1 to 6, smallest to largest). Replacing the observed values by their ranks (with the data in the same order as above)

A plasmid: 3 1 2 4

B plasmid: 5 6

It is convenient to have a single value or statistic derived from the data that gives a consistent measure of the relative magnitude of the observations in the sample. The simplest such statistic would be the sum of the ranks for the sample; for example, 11 in the case of the B plasmid.

The possible permutations of the data for the B plasmid sample along with the statistic are given in the table above. We can then tabulate the distribution under the null hypothesis for the rank sum statistic for the case of two samples of 4 and 2 observations. Each of the 15 permutations is equally likely (with frequency 1/15) under the null hypothesis.

Permutations
B sample ranks		Rank sum
6	5	11
6	4	10
6	3	9
6	2	8
6	1	7

5	4	9
5	3	8
5	2	7
5	1	6

4	3	7
4	2	6
4	1	5

3	2	5
3	1	4

2	1	3

Rank Sum Distribution
Sum	Frequency
11	1/15 (0.067)
10	1/15 (0.067)
9	2/15 (0.133)
8	2/15 (0.133)
7	3/15 (0.2)
6	2/15 (0.133)
5	2/15 (0.133)
4	1/15 (0.067)
3	1/15 (0.067)

In the case of our small experiment, the statistic for the observed data is equal to 11. From the distribution of the statistic under the null hypothesis, the probability of obtaining a rank sum this large (it can not in fact be larger) is 0.067.

This approach to the two-sample location problem is generally referred to as the Wilcoxon rank sum test (Wilcoxon, 1945). Permutation tests for differences in location, such as the Wilcoxon rank sum test, have the advantage of requiring few assumptions concerning the distribution of the data obtained in an experiment. The broad applicability of these tests exacts a relatively small cost in power (that is, they generally require only a few more observations) when compared with parametric tests that are based on explicit knowledge of the form of the distribution of the data. In this chapter, we will provide a more formal description of the Wilcoxon rank sum test and discuss analogous approaches to testing for differences in location in blocked data and for multiple samples.

5.2.2. Wilcoxon rank sum test — formal description.

Data and assumptions: Observations are taken on two independent groups. The first group consists of n observations X_i (i=1…n) and the second consists of m observations Y_j (j=1…m). Without loss of generality, we can assume that n≤m. Within each group the observations are assumed to be mutually independent and the X_i are assumed to come from a population with continuous cumulative distribution F(x) while the Y_j are from a distribution F(y). We want to test the null hypothesis

H₀: F(x) = F(y)

against either the two-sided alternative

H₂: F(x) ≠ F(y)

or one-sided alternatives

H₁: F(x) < F(y) or F(x) > F(y)

Test statistic: The n+m observations are assigned ranks from 1 to (n+m). The test statistic W_X is

where R_i is the rank of the observation i.

The exact significance level for the appropriate alternative hypotheses can be obtained from the table in Appendix 4 as follows.

For the one sided hypothesis F(x)<F(y) (i.e., the X's are larger) enter the appropriate place in the table with x=W_X.

For the one-sided hypothesis F(x)>F(y) (i.e., the Y's are larger) enter the appropriate place in the table with the value x=[n(m+n+1)-W_X].

The distribution of the test statistic under the null hypothesis is symmetrical, so the significance level for the two-sided test is most simply obtained by doubling the P-value for the appropriate one-sided alternative.

Alternative form of the statistic: Recall that one of the ways to phrase the location question is in terms of the probability that a randomly chosen X-value will be larger than a randomly chosen Y-value. An equivalent statistic to the one described above, the Mann-Whitney (Mann and Whitney, 1947) statistic can be defined as

U_XY = number of pairs for which (X_i>Y_j)

such that U_XY/nm is the probability stated above. This is equivalent to the statistic given above since

X = U_XY + [n(n+1)/2]

Large-sample approximation: In cases where the number of observations exceeds the table for the Wilcoxon rank sum statistic, we can take advantage of the fact that, for large numbers of n and m observations, the statistic W_X is approximately normally distributed. The expected value for the statistic is

E(W_X) = n(n+m+1)/2

while the variance is

V(W_X) = nm(m+n+1)/12

We can thus define an approximate statistic

W* = [W_X - E(W_X)] / V(W_X)^1/2

where W* follows the standard normal distribution with mean of 0 and variance of 1. In the simple example given above, W_X was 11, n was 2 and m was 4. Thus, the approximate statistic would be

W* = (11-7)/4.67^0.5 = 1.85

From the table of the standard normal distribution, this would correspond to a P-value of 0.032 which rather overstates the true significance level of 0.067. Applying a continuity correction in this case (i.e., subtracting 0.5 from the numerator) improves the approximation and gives a P-value of 0.053. For values of n and m where both exceed 6 or so, the normal approximation is quite good, with or without a continuity correction.

Correction for tied observations: One of the assumptions made above was that the X_i and Y_j came from continuous distributions. Thus, there is no chance that any two values will be identical (i.e., tied). However, all measurement data are at some level discrete and some experiments necessarily give rise to discretely distributed data. In the case where two or more observations in the experiment have identical values, the "exact" significance level obtained from the table for the W_X statistic is conservative. We could, of course, obtain an exact significance level by recomputing the null distribution for the test statistic as we did in our simple example, but in that case the result would be conditional for the pattern of tied values in the observations. The normal approximation described above remains unconditional in the presence of ties as long as we correct for the presence of tied groups of values.

Consider the following data set

Group 1: 5 8 10 12

Group 2: 4 0 5

We can rank the observations as before, except that average ranks (mid-ranks) will be assigned to tied values. The ranks are thus

Group 1: 3.5 5 6 7

Group 2: 2 1 3.5

The test statistic (Group 2) in this case is W_X = 6.5. The expected value for the test statistic, E(W_X), remains unchanged when some of the values are tied, but the value of the variance decreases (since the number of possible values of the statistic is reduced) by an amount depending on the number of tied values.

Note that this is identical to the variance given above when there are no ties. We will follow Lehman's (1998) lead and not apply the continuity correction to the normal approximation when the data set contains tied values.

Examples 5.2

A. Consider an experiment identical to the one for our transformation example above, but with a larger number of observations. The data are

A plasmid: 70 55 80 140

B plasmid: 116 220 405 410 550 735

We wish to test the null hypothesis that the two plasmids yield the same colony forming efficiency against the one-sided alternative that the plating efficiency of the cells carrying the A plasmid is smaller. The ranks for the observations are

A plasmid: 2 1 3 5

B plasmid: 4 6 7 8 9 10

Thus, the value of the test statistic is W_X=11. We can enter the table using the value x=44-11=33, which indicates a P-value of 0.0095. Note that in this case, the normal approximation would give W*=-2.34 for an approximate P-value of 0.0095.

B. In Example 5.1, we used Monte Carlo methods to compare the BUdR labeling index for tissue from control animals and those treated with a test chemical. Ranking the observations for the two groups we obtain

Treated: 20, 19, 24, 9, 17, 10, 7, 25, 26, 27, 21, 16, 23, 12

Control: 8, 1, 11, 4, 14, 18, 15, 13, 22, 5, 2, 6, 3

The sum of the ranks for the treated group is 256. Using the normal approximation, we obtain

W* = (256-196)/(424.67)^0.5 = 2.912

From Appendix 2, the one-sided P-value is 0.0018, similar to the significance level we obtained by our Monte Carlo approach.

We can use midranks and the approximate statistic in a useful way for constructing a one-sided test in a 2 × t contingency table (see Chapter 6) where the t columns differ in an ordered way.

Example 5.3

We are interested in a gene a that we believe causes a variety of developmental defects when present in the homozygous state in the animal. The effect of the gene is pleiotropic but generally doesn't interfere with our ability to produce Aa and aa progeny. A number of individuals of each genotype are evaluated by an outside observer (who doesn't know the genotype) and classified according to whether the animals are normal, or defective to a mild, moderate, or severe extent. We obtain the following data

Genotype	Defect
	None	Mild	Moderate	Severe	Sum
Aa	2	10	4	2	18
aa	0	3	7	10	20
Sum	2	13	11	12	38

We could think of these data in terms of assigning a value of 0, 1, 2, or 3 to each animal according to the severity of its defect. The midranks for the individuals in each category would be

	None	Mild	Moderate	Severe
Midrank	1.5	9	21	32.5

We can compute the test statistic

W_X = 2(1.5) + 10(9) + 4(21) + 2(32.5) = 242

Using the normal approximation (and correcting for the ties), W* = -3.35, or P<0.0004.

Alternative approximations: While the normal approximation performs very well in estimating moderately small P-values and for similar sample sizes in the two groups, it becomes highly conservative for small (<<0.01) P-values. Hodges et al. (1990) have provided a correction, based on an Edgeworth expansion, to the normal approximation that greatly improves performance at small P-values. Note that when this approximation is used, a continuity correction should be applied, even when there are tied values. Widely disparate sample sizes may also lead to conservatism in the usual normal approximation. Such situations arise in gene-set analysis of microarray data, in which the two sample sizes may be 5-20 versus thousands. Fang et al. (2012) showed that a uniform approximation performs better than the normal approximation under such circumstances.

5.3. Analysis of Paired Data

In some experiments we would like to analyze a treatment effect under conditions where we expect that uninteresting sources of variation may also operate. We can test for such a treatment effect in spite of confounding variation by blocking the data according to the suspected confounding variables. A simple case of this approach is provided by Problem 7 in section 2.5.

Each of the n samples in this type of experiment consists of two observations (X_i, Y_i) (i=1…n) and we are interested in testing the null hypothesis

H₀: Z_i = Y_i - X_i = 0

against an appropriate one-sided or two-sided alternative. The Wilcoxon signed-rank test (Wilcoxon, 1945) provides a means of considering this hypothesis. For the one-sided alternative that the Y_i are larger than the X_i, we would expect that the magnitudes of the positive deviations (Z_i) would generally be larger than the negative deviations. Under the null hypothesis, each Z_i is equally likely to be positive or negative; thus, there are 2ⁿ possible permutations of the ranks of the magnitudes of the Z_i. As before, we can define a test statistic based on the sum of a subset of the ranks (e.g., the positive Z_i) and determine the null distribution of this test statistic by adding up the appropriate permutations.

5.3.1. Signed rank statistic.

To compute the statistic, determine the values of the n differences Z_i, and rank the absolute values of these differences from 1 to n to give R_i. If, for example, the number of negative ranks is smaller, compute the statistic as the sum of the ranks for which Z_i<0.

W_s = Σ R_i for all Z_i<0.

The significance level can then be determined by consulting the table in Appendix 5 or by using the normal approximation given below. Samples for which Z_i=0 are dropped from the analysis, reducing the value of n. In the case of tied values, midranks are used and the approximate procedure below should be used for determination of the significance level.

Large sample approximation and correction for ties: An alternative method for computing the signed rank statistic, suggested by Conover (1999), is convenient when the sample size is large or when ties are present in the Z_i. As above, samples for which Z_i=0 are discarded, leaving n nonzero observations. The n samples are ranked according to the magnitude of the difference, |Z_i|, and for each sample, the value R_i is assigned as the above rank, but given the sign of the corresponding Z_i. The test statistic

follows a standard normal distribution.

Example 5.4

Consider an experiment in which the level of transcription for a test plasmid is compared for a series of cell lines that have been constructed to express (or not express) a putative transactivating protein. Because we are generally interested in the ratio of expression, we will define the Z_i as log(Y_i)-log(X_i). We obtain the following data

	RNA (Arbitrary units)
Line	-transact.	+transact.	Z_i	Rank	R_i
1	20	30	+0.176	3	+3
2	14	35	+0.397	5	+5
3	47	46	-0.009	1	-1
4	5	50	+1.0	7	+7
5	11	9	-0.087	2	-2
6	6	18	+0.477	6	+6
7	8	16	+.301	4	+4

Because there are fewer negative values, we determine the sum of the ranks for Z_i<0 as

W_s = 1 + 2 = 3.

Entering the table in the appendix with the value 3, we obtain a P-value of 0.039.

Using the large sample approximation, we would obtain

T = 1.859 and a one-sided P-value of approximately 0.031.

5.4. Multiple Samples

The discussion above focused on simple experimental designs in which we were interested in testing for differences in location between two independent samples. Often our experiments will consist of three or more samples representing, e.g., expression data for a series of mutant constructs or responses of animals to several different dose levels of a drug. We will discuss the problem of making inferences regarding various pair-wise combinations of treatment groups in Chapter 9. Tests based on sample ranks for the hypothesis that all of the groups in an experiment show the same response are described below.

5.4.1. A multisample test against a general alternative.

Our data consist of k samples, each with n_i observations with values x_ij (i = 1 … k, j = 1 … n_i). The total number of observations in our experiment is N, the sum of the n_i. We want to test the null hypothesis that all of the samples are taken from the same population against the alternative that at least one sample is drawn from a population with a different location from the others. The Kruskal-Wallis test is based on jointly ranking all N observations and determining whether the ranks are randomly distributed across the k samples (Kruskal and Wallis, 1952).

For each sample, i, compute the sum of the ranks

where r_ij is the rank of observation x_ij (1 ≤ r_ij ≤ N). When no ties are present among the x_ij, the Kruskal-Wallis test statistic, H, is

which follows a χ² distribution with k-1 degrees of freedom. If some of the x_ij are tied, mid-ranks are used for the r_ij and R_i is computed as above. The test statistic, H, is

where

Unless the number of ties is large, there will be little difference in the results for the two formulas for H.

Example 5.5

We have been studying the genetics of blood pressure in rats and have become concerned that the various diets used by our group and other labs working on hypertensive rats may affect the measured blood pressure. To test this hypothesis, we place groups of rats on three different diets and, after several weeks, measure the blood pressure for animals from each group. Our results are (mm Hg)

Diet 1
139 167 132 144 129 113 126 153 174 127 149 133

Diet 2
126 93 147 154 98 128 126 132 169 133

Diet 3
116 152 143 125 141 154 152 147 159 141 115 113
171 114 140 139 149 96 140 152

The ranks for these observations (using mid-ranks for tied values) are

Diet 1
20.5 39 16.5 27 15 4.5 11 35 42 13 30.5 18.5

Diet 2
11 1 28.5 36.5 3 14 11 16.5 40 18.5

Diet 3
8 33 26 9 24.5 36.5 33 28.5 38 24.5 7 4.5 41 6 22.5 20.5 30.5 2 22.5 33

The sums of the ranks for the three diets are 272.5, 180, and 450.5, respectively. The Kruskal-Wallis test statistic is approximately

Using the χ² distribution with 2 degrees of freedom, we obtain a P-value of 0.58 and conclude that the blood pressure values do not differ for these three diets.

5.4.2. A multisample test against an ordered alternative.

In many experiments there is a natural ordering among the treatment groups, e.g., in the case of a dose-response. In this case, we obtain data for s groups, each consisting of n_i (i=1…s) observations and wish to test the null hypothesis

H₀: F(X₁) = F(X₂) = F(X₃) = … = F(X_s)

against the one-sided alternative (for example)

H₁: F(X₁) ≥ F(X₂) ≥ F(X₃) ≥ … ≥ F(X_s)

(where at least one of the inequalities holds).

The Jonckheere-Terpstra test can be performed for this set of hypotheses as follows (Jonckheere, 1954):

Define the value

U_ij = number of (X_ia < X_jb) where a=1…n_i and b=1…n_j

(Recall that this is the Mann-Whitney form of the rank sum statistic for the pair of samples i and j)

The Jonckheere-Terpstra statistic is computed using the 0.5(s)(s-1) values U_ij for i<j as

Since the distribution of U depends on the pattern of the n_i, it is most convenient to use the usual normal approximation

U* = [U-E(U)] / V(U)^1/2

where

E(U) = 0.25×(N²-Σn_i²)

V(U) = [N²(2N+3) - Σn_i²(2n_i+3)] / 72

N is the total number of observations and the summation is over i=1 … s.

In the case of tied observations, add 0.5 × the number of ties in each i,j comparison to the value of U_ij. If there are g groups of tied values each of size t_k, the modified formula for the variance of U is

As a final note, this test is most powerful when the treatment groups are, in some sense, equally spaced.

Example 5.6

We are studying a mutant gene that we suspect has a quantitative effect on male fertility. We collect a number of males (3 or 4) having 2, 1, or 0 copies of the mutant allele and mate each to 3 females. The observation for each male is the total number of progeny obtained in the matings.

i	Number of mutant alleles	Total number of progeny
1	2	16,8,6
2	1	27,16,15
3	0	31,29,18,42

We compute the values U_ij as

U₁₂ = 7.5

U₁₃ = 12

U₂₃ = 11

For this example U = 7.5 + 12 + 11 = 30.5, E(U) = 16.5, and V(U) = 27.25. Thus, U* = 2.69, giving a significance level of P<0.0036.

5.5. Tests for Differences in Dispersion

In some cases, you may be more interested in differences between two groups in the degree of variability in the data, independent of whether the two groups may also differ in location. For example, you are comparing two different assay methods and want to know which of the two gives results with greater precision. The most commonly used tests for differences in variance between two groups are based on the F distribution. However, such tests are very sensitive to non-normality in the data sets and may not be suitable for the types of data in our experiments. The Ansari-Bradley test provides a distribution-free alternative, but requires that the medians for the two groups are either identical or known. Below we discuss two distribution-free tests that do not make assumptions about the location of the data sets, a two-sample test due to Miller (1968) based on the jackknife method, and a multi-sample test due to Conover (1999) based on the squared ranks of the distances to the median for the data in each group. It should be noted that these tests for differences in dispersion will typically require larger sample sizes than tests for differences in location to achieve reasonable power. For example, at least 20 samples per group would be required to achieve 90% power for two groups with a five-fold difference in variance.

5.5.1. Miller jackknife test for two samples.

The starting point for Miller's (1968) application of the jackknife for testing for differences in variance was the utility of that method in problems involving estimation of ratios. In particular, given two data sets, X_i and Y_j, we want to test the null hypothesis that the ratio γ² of variances, V(X)/V(Y), is equal to 1 against either one- or two-sided alternatives. A more complete discussion of this test is in Miller's original paper and in Hollander et al. (2013). We will follow the notation used in the latter, below.

We have two data sets. The first group consists of m observations X_i (i=1…m) and the second consists of n observations Y_j (j=1…n). We assume the the X and Y observations are independent and come from continuous distributions with a finite fourth moment. We want to test the null hypothesis that the variance ratio is equal to 1

H₀: γ² = 1

against the one-sided alternatives

H₁: γ² < 1 or γ² > 1

or the two-sided alternative

H₂: γ² ≠ 1

For each observation, compute the jackknifed estimate of the mean and variance for the two groups.

Let

Where and are the respective sample means for the full data set. Next compute

The test statistic, Q, is

Q follows a standard normal distribution, with Q≥z_α or z_α/2 for a one- or two-sided test, respectively.

An estimate of the variance ratio for the two groups is

Example 5.5

We are interested in comparing two mouse strains for their reproductive potential. For each strain, we set up 10 independent matings and count the number of live born offspring in each litter. Our results are:

Strain A: 12 12 12 10 12 13 11 11 10 13

Strain B: 6 3 3 2 10 7 4 5 7 3

Based on the Wilcoxon rank sum test, there is indeed a significant difference between the two strains in the number of offspring per litter, with P(two-sided)≈3×10^-5. Do the strains also differ in the variability of litter sizes?

The sample sizes, m and n, are both equal to 10. The values for the variances for the drop-one subsets are

D_i²: 1.278 1.278 1.278 0.944 1.278 1.028 1.25 1.25 0.944 1.028

E_j²: 6.861 6.444 6.444 5.75 3.528 6.444 6.861 7 6.444 6.444

The logs of the subset variances are

S_i: 0.2451 0.2451 0.2451 -0.0572 0.2451 0.0274 0.2231 0.2231 -0.0572 0.0274

T_j: 1.9259 1.8632 1.8632 1.7492 1.2607 1.8632 1.9259 1.9459 1.8632 1.8632

The logs of the variances for the whole data sets are S₀=0.1446 and T₀=1.8281.

Jackknifed estimates for the logs of the variances are

A_i: -0.7603 -0.7603 -0.7603 1.9602 -0.7603 1.1992 -0.5625 -0.5625 1.9602 1.1992

B_j: 0.9485 1.5123 1.5123 2.5385 6.9353 1.5123 0.9485 0.7681 1.5123 1.5123

with means and variances for strains 1 and 2 of 0.2153 (0.1449) and 1.9700 (0.3284), respectively. Our test statistic Q is -2.55 with a two-sided P-value of 0.01. The variance ratio for the two strains is 0.17. We conclude that there is a significant difference between the two strains in the variability of litter sizes.

5.5.2. Conover test for multiple samples.

To test for differences in location among multiple groups, the Kruskal-Wallis test jointly ranks all of the observations and determines whether the ranks are randomly distributed among the groups. To examine differences in dispersion, it would make sense to similarly rank deviations between the observations and the means or medians of each group. Conover (1999) has proposed such a test, with the variation that squared ranks are used in order to improve the power of the test.

We have a set of observations, X_ij, where i=1…k are the k groups and j=1…n_i are the samples within each group. For each group, convert the observations to the distance from the group mean or median

U_ij = |X_ij - x_mi|

where x_mi is either the mean or the median for group i. We will use the median value because Conover et al. (1981) found that use of the median improved the power of the test, particularly for skewed distributions. Rank all of the

observations to obtain ranks Rij. Use mid-ranks for tied values. For each group, compute the sum of the squared ranks for each group

The test statistic, T₂ is

where

This test statistic, T₂, follows a Chi-square distribution with k-1 degrees of freedom.

Example 5.6

We are developing an assay for the presence of a respiratory virus in saliva samples. We plan to use three target sequences for PCR analysis and want to be sure that the targets exhibit similar variability in the results. We add a fixed amount of viral nucleic acid to 20 replicate control samples and measure the C_t value at which a signal is detected. Our results are

N gene (i=1): 28.5 28.6 28.5 28.7 29.7 29.3 29.3 29.2 29.8 29.5 29.4 29.3 27.1 27.1 27.1 27 28.4 28.4 28.6 28.9

S gene (i=2): 29.9 29.8 29.9 29.6 30.9 31 31.1 31.5 31.5 31.7 31 30.9 28.1 28.3 28.5 28.1 29.6 29.6 29.8 30

Orf1 gene (i=3): 29.3 29.3 29.5 29.2 30.9 30.4 30.5 30.1 30.7 30.8 30.8 31 27.5 27.9 27.8 27.7 29.4 29.3 29.1 30.2

The median values for our three targets are 28.65, 29.9, and 29.45, respectively. The absolute differences between the C_t values and the medians for each group are

U_1j: 0.15 0.05 0.15 0.05 1.05 0.65 0.65 0.55 1.15 0.85 0.75 0.65 1.55 1.55 1.55 1.65 0.25 0.25 0.05 0.25

U_2j: 0 0.1 0 0.3 1 1.1 1.2 1.6 1.6 1.8 1.1 1 1.8 1.6 1.4 1.8 0.3 0.3 0.1 0.1

U_3j: 0.15 0.15 0.05 0.25 1.45 0.95 1.05 0.65 1.25 1.35 1.35 1.55 1.95 1.55 1.65 1.75 0.05 0.15 0.35 0.75

Jointly ranking all 60 observations, using mid-ranks for tied values, we obtain

R_1j: 13 3.5 13 6 35.5 26.5 26.5 24 39 31 29.5 26.5 47 47 47 54.5 17.5 17.5 3.5 17.5

R_2j: 1.5 8.5 1.5 21 33.5 37.5 40 52.5 52.5 59 37.5 33.5 57.5 51 44 57.5 21 21 8.5 10

R_3j: 13 13 6 17.5 45 32 35.5 26.5 41 42.5 42.5 49.5 60 49.5 54.5 56 6 13 23 29.5

The sums of the squared ranks for our three targets are 18209.8, 28372, and 27196.2, respectively. The mean of the squared ranks across all 60 observations is 1229.63. Computing D² as 1.20721×10⁶, we obtain our test statistic,

T₂ = 2.56,

which follows a chi-square distribution with 2 degrees of freedom. Thus, our P-value is 0.28 and we conclude that the variability of the results for the three targets is not significantly different.

5.6. Sample Problems

There has been concern for some time that nitrite present in cured meats may result in the formation of carcinogenic N-nitrosamines in the stomach by reaction with amines in the diet. It is possible to safely measure the amount of nitrosamine production in humans by administering a large dose of proline and measuring the level of N-nitroso-proline, a non-toxic reaction product, excreted in the urine. You decide to test the hypothesis that eating a reasonable amount of a cured meat will increase the amount of nitrosation taking place in the stomach in the following way. Each of 6 subjects, who have been asked to avoid cured meats for the previous week, is given a dose of proline and the excretion of N-nitroso-proline is measured for 24 hours. Each subject is then fed a breakfast including 6 strips of bacon and a dose of proline; excretion of the nitrosated amino acid is again measured for 24 hours. You obtain the data listed below in the form (basal μg excreted, μg excreted after bacon).

(3.1,2.8)	(2.0,6.1)	(5.1,4.5)	(4.2,2.9)
(8.5,8.9)	(1.1,6.0)

Did bacon consumption significantly increase the excretion of the N-nitroso-proline?

The recessive mutation bg (beige) in mice results in a decrease in the activity of natural killer cells in homozygous mutant mice. You test the hypothesis that this cell type plays an important role in rejection of tumor cells (derived from a tumor induced in an isogenic animal) by injecting animals with genotypes bg/bg or bg/+ with 10⁶ cells and measuring the time required for the development of a palpable tumor in each animal. You obtain the following data:

Genotype	Time to tumor development (days)
bg/+	28, 51, 47, 80
bg/bg	14, 35, 17, 37, 16

Is there a significant difference in tumor latency between the two groups?

You have obtained data for four samples as follows:

Sample 1	32 24 28 40 0 15 20 55
Sample 2	20 36 5 15 30 25
Sample 3	35 25 130 160 820 825
Sample 4	116 220 405 410 550 735 835 705

You want to test the null hypothesis that the locations of all four samples are the same against the general alternative that they are not all the same.

You are studying a viral oncogene that induces anchorage-independent growth when expressed in fibroblasts. In order to map the functional domains within that oncogene, you construct a series of mutants and compare the abilities of the mutants to induce anchorage independent growth with that of the wild type gene. After cotransfecting the wild type or mutant oncogene with an antibiotic resistance marker into fibroblasts and selecting for antibiotic resistant cells, you plate 10⁴ cells in agarose in each of a series of dishes. After allowing time for growth, you obtain the following numbers of colonies per dish.

wild type 120 310 440 413 500

mutant 48 15 120 15 23

Does this mutation inactivate the oncogene?

You have tested a particular chemical for its mutagenic activity in mammalian cells. In your experiment, you treated cells with various doses of the agent, allowed the cells to grow for a few generations and then plated 10⁶ cells in the presence of 6-thioguanine to select for hprt mutants. After two weeks, you counted the mutant colonies. After doing the experiment several times, you obtained the following data:

Dose	Mutants/10⁶ cells
0	2,0,5,3,1
10	3,9,8,8
20	17,41,12,5
40	80,79,44,56

Is the chemical mutagenic?

wild type	120 310 440 413 500
mutant	48 15 120 15 23