5. Testing for Differences in Location or Dispersion
The most commonly encountered statistical problem is that of testing for differences in location between measurements made on two independent populations. We are accustomed to thinking of this problem as testing for a difference in the mean values for the two populations but we could also phrase the problem in a slightly different way. For example, we might ask whether a randomly chosen member of population 1 is likely to be larger in magnitude than a randomly chosen member of population 2. The classical normal theory test for a difference in the mean (μ) of two normally distributed random variables is the Student t-test, which defines a test statistic
where
is the mean for group i, si2
is the variance, ni is the number
of observations, and ν, the number of degrees of freedom,
is n1+n2-2.
The null hypothesis, that
is equal
to
, is
rejected for large or small (negative) values of the statistic. Although
this familiar statistic is easy to compute, the two assumptions required for
using this test are often violated by data from biological experiments.
First, the test is quite sensitive to "heavy tails" in the
distribution for the data. The frequency of observations for normally
distributed data falls off rapidly as the value becomes much less than or
greater than the mean value, making the test sensitive to outlier or extreme
values. Second, the assumption that the variances for the two populations
are equal and essentially independent of the mean values for the populations
is contrary to the dependence of the variance on the mean for many simple
distributions (e.g., Poisson) of biological interest.
5.1. Permutation Tests
The second version of the location question posed above, that randomly chosen members of population 1 are likely to be larger than those from population 2, provides the rationale for an alternative approach based on examining all possible permutations of the observed data set.
Consider the following small experiment, in which we make three independent observations under each of two conditions:
Group A: 29, 52, 49
Group B: 15, 36, 18
We want to test the null hypothesis that there is no difference in "location" between the two treatments against the one-sided alternative that the observations in group A are larger than those in group B. We observed that the mean of group A (mA) is 43.3 and that for group B is 23; thus, we could define a statistic for testing our hypothesis, D = mA - mB, which is equal to 20.3 for the above case.
A (mA) | B (mB) | D | A (mA) | B (mB) | D |
29,52,49 (43.3) | 15,36,18 (23) | 20.3 | 52,49,15 (38.7) | 29,36,18 (27.7) | 11 |
29,52,15 (32) | 49,36,18 (34.3) | -2.3 | 52,49,36 (45.7) | 29,15,18 (20.7) | 25 |
29,52,36 (39) | 49,15,18 (27.3) | 11.7 | 52,49,18 (39.7) | 29,15,36 (26.7) | 13 |
29,52,18 (33) | 49,15,36 (33.3) | -0.3 | 52,15,36 (34.3) | 29,49,18 (32) | 2.3 |
29,49,15 (31) | 52,36,18 (35.3) | -4.3 | 52,15,18 (28.3) | 29,49,36 (38) | -9.7 |
29,49,36 (38) | 52,15,18 (28.3) | 9.7 | 52,36,18 (35.3) | 29,49,15 (31) | 4.3 |
29,49,18 (32) | 52,15,36 (34.3) | -2.3 | 49,15,36 (33.3) | 29,52,18 (33) | 0.3 |
29,15,36 (26.7) | 52,49,18 (39.7) | -13 | 49,15,18 (27.3) | 29,52,36 (39) | -11.7 |
29,15,18 (20.7) | 52,49,36 (45.7) | -25 | 49,36,18 (34.3) | 29,52,15 (32) | 2.3 |
29,36,18 (27.7) | 52,49,15 (38.7) | -11 | 15,36,18 (23) | 29,52,49 (43.3) | -20.3 |
Under the null hypothesis that the two sets of observations come from the
same population, we could randomly draw any three of the six observations
and label them as "A". The distribution of our test statistic,
D, can be computed for all
permutations of the data set. The significance level for our hypothesis test
can then be obtained by dividing the number of permutations for which
D≥20.3 by the total number of permutations, 20. The 20 permutations of
our data set, and the corresponding value of our test statistic, are
enumerated in the table above. From the table, you can see that only 2
permutations give test statistics that equal or exceed our observed value,
20.3. Thus, the P-value for our one-sided statistical test is 0.1. If
we had been testing against a two-sided alternative hypothesis, we would
count the number of permutations for which |D| ≥ 20.3 (4 in this case).
The above permutation test has the advantage that it provides an exact test of our hypothesis without requiring us to make any assumptions about the underlying model that generated the data (e.g., the assumption of normality in the t-test). This approach can be applied quite generally as long as we can define an appropriate test statistic. For example, we might be interested in comparing two analytical methods for their precision in measuring some parameter and we've made multiple observations using each method for the same sample. To test for a difference in precision, we could define a test statistic D=V(A)-V(B), where V(x) is the variance of x.
5.1.1. Monte Carlo methods.
A major disadvantage to the permutation test above is that the magnitude of the computational problem scales up very rapidly with the sample size. The numbers of permutations that we would have to enumerate for two groups of 3, 5, 10, and 20 observations each are 20, 252, 184756, and 1.37847×1011.
The complete set of permutations defines the sample space for our statistical test under the null hypothesis. One way of thinking about our hypothesis test is that we want to know what fraction of the points in the sample space would provide a test statistic that equals or exceeds our observed result. In the case that the sample space is very large, it is possible to estimate this fraction (our P-value) by simply examining a suitably large number of randomly chosen points within the sample space. This approach falls under the general heading of "Monte Carlo" methods. Given access to a desktop computer and with a little programming, this approach can be used as generally as the strict permutation test described above.
Example 5.1
We are studying the induction of cell proliferation in a target tissue following treatment of animals with a particular chemical. We treat groups of 14 and 13 animals with the agent or solvent vehicle, respectively, along with BUdR to label replicating cells. Animals are sacrificed, the tissue is prepared, and we examine 10,000 cells from each animal and count the number of labeled cells. We obtained the following data:
Treated: 563, 504, 837, 262, 435, 283, 218, 1296, 1310, 1311, 658, 426, 794, 297
Control: 231, 79, 290, 119, 346, 493, 349, 299, 747, 121, 109, 204, 114.
The means and standard deviations for the treated and control groups are 657±399 and 269±189, respectively.
We want to test the null hypothesis that the levels of proliferation in treated and control animals are the same against the one-sided alternative that the treated animals show a higher labeling index. We can define a test statistic, D=Mtreated-Mcontrol, as the mean of the treated group minus the mean of the control group.
Our sample space under the null hypothesis consists of more than 20 million points. We decide to use a Monte Carlo approach and run 100,000 random trials on our data set and compute the value of D for each (at the cost of about a half hour of programming and one minute of computation). The complement (upper tail) of the cumulative distribution for the test statistic is shown in the figure below. Our observed test statistic is 388 and the fraction of trials for which D ≥ 388 is approximately 0.001, which is our P-value for the statistical test.
5.2. Wilcoxon Rank Sum Test
The permutation tests described above require that we define (or approximate) a unique distribution for our test statistic for every experiment. A much more convenient approach is to base the statistical tests on the ranks (or ordered magnitudes) of the observations rather than on the observed values. As will become evident, this approach requires very few assumptions concerning the distribution(s) followed by the data and hence the methods are often referred to as distribution-free or nonparametric tests. A further advantage to tests based on ranks is that they are relatively insensitive to the presence of extreme values (outliers) in the samples.
5.2.1. Rationale and a simple example.
Consider an experiment to test the ability of a particular gene to confer the property of anchorage-independent growth on a mammalian cell line that normally has a low frequency of colony formation when cultured in semisolid medium. Two plasmid constructions, "A", which contains an antibiotic resistance marker, and "B", which is identical to the "A" plasmid except that it also includes the test gene, are each introduced into populations of cells. Those cells that express antibiotic resistance (and hence carry the A or B DNA) are selected and samples of 104 cells are plated in semisolid medium. In this small experiment, 4 independent plates of "A" cells and 2 of "B" cells are scored for the number of colonies after a suitable period of time. The following data (number of colonies / 104 cells) are obtained:
A plasmid: 32 24 28 40
B plasmid: 116 120
The null hypothesis of interest is that the observations for the two groups of data come from identical populations against an alternative hypothesis that cells carrying the B plasmid are more likely to grow in agarose. If the null hypothesis were true, we might imagine that the second sample (B plasmid) could consist of any 2 of the 6 observations made in the experiment and that all possible permutations of the data would be equally likely.
Because we are only interested in the range of possible permutations of the data observed in the two groups, the actual values are irrelevant and we can consider just the ranks of the observations (from 1 to 6, smallest to largest). Replacing the observed values by their ranks (with the data in the same order as above)
A plasmid: 3 1 2 4
B plasmid: 5 6
It is convenient to have a single value or statistic derived from the data that gives a consistent measure of the relative magnitude of the observations in the sample. The simplest such statistic would be the sum of the ranks for the sample; for example, 11 in the case of the B plasmid.
The
possible
permutations of the data for the B plasmid sample along with the statistic
are given in the table above. We can then tabulate the distribution under
the null hypothesis for the rank sum statistic for the case of two samples
of 4 and 2 observations. Each of the 15 permutations is equally likely (with
frequency 1/15) under the null hypothesis.
Permutations | ||
B sample ranks | Rank sum | |
6 | 5 | 11 |
6 | 4 | 10 |
6 | 3 | 9 |
6 | 2 | 8 |
6 | 1 | 7 |
|
|
|
5 | 4 | 9 |
5 | 3 | 8 |
5 | 2 | 7 |
5 | 1 | 6 |
|
|
|
4 | 3 | 7 |
4 | 2 | 6 |
4 | 1 | 5 |
|
|
|
3 | 2 | 5 |
3 | 1 | 4 |
|
|
|
2 | 1 | 3 |
Rank Sum Distribution | |
Sum | Frequency |
11 | 1/15 (0.067) |
10 | 1/15 (0.067) |
9 | 2/15 (0.133) |
8 | 2/15 (0.133) |
7 | 3/15 (0.2) |
6 | 2/15 (0.133) |
5 | 2/15 (0.133) |
4 | 1/15 (0.067) |
3 | 1/15 (0.067) |
In the case of our small experiment, the statistic for the observed data is equal to 11. From the distribution of the statistic under the null hypothesis, the probability of obtaining a rank sum this large (it can not in fact be larger) is 0.067.
This approach to the two-sample location problem is generally referred to as the Wilcoxon rank sum test (Wilcoxon, 1945). Permutation tests for differences in location, such as the Wilcoxon rank sum test, have the advantage of requiring few assumptions concerning the distribution of the data obtained in an experiment. The broad applicability of these tests exacts a relatively small cost in power (that is, they generally require only a few more observations) when compared with parametric tests that are based on explicit knowledge of the form of the distribution of the data. In this chapter, we will provide a more formal description of the Wilcoxon rank sum test and discuss analogous approaches to testing for differences in location in blocked data and for multiple samples.
5.2.2. Wilcoxon rank sum test — formal description.
Data and assumptions: Observations are taken on two independent groups. The first group consists of n observations Xi (i=1…n) and the second consists of m observations Yj (j=1…m). Without loss of generality, we can assume that n≤m. Within each group the observations are assumed to be mutually independent and the Xi are assumed to come from a population with continuous cumulative distribution F(x) while the Yj are from a distribution F(y). We want to test the null hypothesis
H0: F(x) = F(y)
against either the two-sided alternative
H2: F(x) ≠ F(y)
or one-sided alternatives
H1: F(x) < F(y) or F(x) > F(y)
Test statistic: The n+m observations are assigned ranks from 1 to (n+m). The test statistic WX is
where Ri is the rank of the observation i.
The exact significance level for the appropriate alternative hypotheses can be obtained from the table in Appendix 4 as follows.
For the one sided hypothesis F(x)<F(y) (i.e., the X's are larger) enter the appropriate place in the table with x=WX.
For the one-sided hypothesis F(x)>F(y) (i.e., the Y's are larger) enter the appropriate place in the table with the value x=[n(m+n+1)-WX].
The distribution of the test statistic under the null hypothesis is symmetrical, so the significance level for the two-sided test is most simply obtained by doubling the P-value for the appropriate one-sided alternative.
Alternative form of the statistic: Recall that one of the ways to phrase the location question is in terms of the probability that a randomly chosen X-value will be larger than a randomly chosen Y-value. An equivalent statistic to the one described above, the Mann-Whitney (Mann and Whitney, 1947) statistic can be defined as
UXY = number of pairs for which (Xi>Yj)
such that UXY/nm is the probability stated above. This is equivalent to the statistic given above since
X = UXY + [n(n+1)/2]
Large-sample approximation: In cases where the number of observations exceeds the table for the Wilcoxon rank sum statistic, we can take advantage of the fact that, for large numbers of n and m observations, the statistic WX is approximately normally distributed. The expected value for the statistic is
E(WX) = n(n+m+1)/2
while the variance is
V(WX) = nm(m+n+1)/12
We can thus define an approximate statistic
W* = [WX - E(WX)] / V(WX)1/2
where W* follows the standard normal distribution with mean of 0 and variance of 1. In the simple example given above, WX was 11, n was 2 and m was 4. Thus, the approximate statistic would be
W* = (11-7)/4.670.5 = 1.85
From the table of the standard normal distribution, this would correspond to a P-value of 0.032 which rather overstates the true significance level of 0.067. Applying a continuity correction in this case (i.e., subtracting 0.5 from the numerator) improves the approximation and gives a P-value of 0.053. For values of n and m where both exceed 6 or so, the normal approximation is quite good, with or without a continuity correction.
Correction for tied observations: One of the assumptions made above was that the Xi and Yj came from continuous distributions. Thus, there is no chance that any two values will be identical (i.e., tied). However, all measurement data are at some level discrete and some experiments necessarily give rise to discretely distributed data. In the case where two or more observations in the experiment have identical values, the "exact" significance level obtained from the table for the WX statistic is conservative. We could, of course, obtain an exact significance level by recomputing the null distribution for the test statistic as we did in our simple example, but in that case the result would be conditional for the pattern of tied values in the observations. The normal approximation described above remains unconditional in the presence of ties as long as we correct for the presence of tied groups of values.
Consider the following data set
Group 1: 5 8 10 12
Group 2: 4 0 5
We can rank the observations as before, except that average ranks (mid-ranks) will be assigned to tied values. The ranks are thus
Group 1: 3.5 5 6 7
Group 2: 2 1 3.5
The test statistic (Group 2) in this case is WX = 6.5. The expected value for the test statistic, E(WX), remains unchanged when some of the values are tied, but the value of the variance decreases (since the number of possible values of the statistic is reduced) by an amount depending on the number of tied values.
Note that this is identical to the variance given above when there are no ties. We will follow Lehman's (1998) lead and not apply the continuity correction to the normal approximation when the data set contains tied values.
Examples 5.2
A. Consider an experiment identical to the one for our transformation example above, but with a larger number of observations. The data are
A plasmid: 70 55 80 140
B plasmid: 116 220 405 410 550 735
We wish to test the null hypothesis that the two plasmids yield the same colony forming efficiency against the one-sided alternative that the plating efficiency of the cells carrying the A plasmid is smaller. The ranks for the observations are
A plasmid: 2 1 3 5
B plasmid: 4 6 7 8 9 10
Thus, the value of the test statistic is WX=11. We can enter the table using the value x=44-11=33, which indicates a P-value of 0.0095. Note that in this case, the normal approximation would give W*=-2.34 for an approximate P-value of 0.0095.
B. In Example 5.1, we used Monte Carlo methods to compare the BUdR labeling index for tissue from control animals and those treated with a test chemical. Ranking the observations for the two groups we obtain
Treated: 20, 19, 24, 9, 17, 10, 7, 25, 26, 27, 21, 16, 23, 12
Control: 8, 1, 11, 4, 14, 18, 15, 13, 22, 5, 2, 6, 3
The sum of the ranks for the treated group is 256. Using the normal approximation, we obtain
W* = (256-196)/(424.67)0.5 = 2.912
From Appendix 2, the one-sided P-value is 0.0018, similar to the significance level we obtained by our Monte Carlo approach.
We can use midranks and the approximate statistic in a useful way for constructing a one-sided test in a 2 × t contingency table (see Chapter 6) where the t columns differ in an ordered way.
Example 5.3
We are interested in a gene a that we believe causes a variety of developmental defects when present in the homozygous state in the animal. The effect of the gene is pleiotropic but generally doesn't interfere with our ability to produce Aa and aa progeny. A number of individuals of each genotype are evaluated by an outside observer (who doesn't know the genotype) and classified according to whether the animals are normal, or defective to a mild, moderate, or severe extent. We obtain the following data
Genotype | Defect | ||||
None | Mild | Moderate | Severe | Sum | |
Aa | 2 | 10 | 4 | 2 | 18 |
aa | 0 | 3 | 7 | 10 | 20 |
Sum | 2 | 13 | 11 | 12 | 38 |
We could think of these data in terms of assigning a value of 0, 1, 2, or 3 to each animal according to the severity of its defect. The midranks for the individuals in each category would be
None | Mild | Moderate | Severe | |
Midrank | 1.5 | 9 | 21 | 32.5 |
We can compute the test statistic
WX = 2(1.5) + 10(9) + 4(21) + 2(32.5) = 242
Using the normal approximation (and correcting for the ties), W* = -3.35, or P<0.0004.
Alternative approximations: While the normal approximation performs very well in estimating moderately small P-values and for similar sample sizes in the two groups, it becomes highly conservative for small (<<0.01) P-values. Hodges et al. (1990) have provided a correction, based on an Edgeworth expansion, to the normal approximation that greatly improves performance at small P-values. Note that when this approximation is used, a continuity correction should be applied, even when there are tied values. Widely disparate sample sizes may also lead to conservatism in the usual normal approximation. Such situations arise in gene-set analysis of microarray data, in which the two sample sizes may be 5-20 versus thousands. Fang et al. (2012) showed that a uniform approximation performs better than the normal approximation under such circumstances.
5.3. Analysis of Paired Data
In some experiments we would like to analyze a treatment effect under conditions where we expect that uninteresting sources of variation may also operate. We can test for such a treatment effect in spite of confounding variation by blocking the data according to the suspected confounding variables. A simple case of this approach is provided by Problem 7 in section 2.5.
Each of the n samples in this type of experiment consists of two observations (Xi, Yi) (i=1…n) and we are interested in testing the null hypothesis
H0: Zi = Yi - Xi = 0
against an appropriate one-sided or two-sided alternative. The Wilcoxon signed-rank test (Wilcoxon, 1945) provides a means of considering this hypothesis. For the one-sided alternative that the Yi are larger than the Xi, we would expect that the magnitudes of the positive deviations (Zi) would generally be larger than the negative deviations. Under the null hypothesis, each Zi is equally likely to be positive or negative; thus, there are 2n possible permutations of the ranks of the magnitudes of the Zi. As before, we can define a test statistic based on the sum of a subset of the ranks (e.g., the positive Zi) and determine the null distribution of this test statistic by adding up the appropriate permutations.
5.3.1. Signed rank statistic.
To compute the statistic, determine the values of the n differences Zi, and rank the absolute values of these differences from 1 to n to give Ri. If, for example, the number of negative ranks is smaller, compute the statistic as the sum of the ranks for which Zi<0.
Ws = Σ Ri for all Zi<0.
The significance level can then be determined by consulting the table in Appendix 5 or by using the normal approximation given below. Samples for which Zi=0 are dropped from the analysis, reducing the value of n. In the case of tied values, midranks are used and the approximate procedure below should be used for determination of the significance level.
Large sample approximation and correction for ties: An alternative method for computing the signed rank statistic, suggested by Conover (1999), is convenient when the sample size is large or when ties are present in the Zi. As above, samples for which Zi=0 are discarded, leaving n nonzero observations. The n samples are ranked according to the magnitude of the difference, |Zi|, and for each sample, the value Ri is assigned as the above rank, but given the sign of the corresponding Zi. The test statistic
follows a standard normal distribution.
Example 5.4
Consider an experiment in which the level of transcription for a test plasmid is compared for a series of cell lines that have been constructed to express (or not express) a putative transactivating protein. Because we are generally interested in the ratio of expression, we will define the Zi as log(Yi)-log(Xi). We obtain the following data
RNA (Arbitrary units) | |||||
Line | -transact. | +transact. | Zi | Rank | Ri |
1 | 20 | 30 | +0.176 | 3 | +3 |
2 | 14 | 35 | +0.397 | 5 | +5 |
3 | 47 | 46 | -0.009 | 1 | -1 |
4 | 5 | 50 | +1.0 | 7 | +7 |
5 | 11 | 9 | -0.087 | 2 | -2 |
6 | 6 | 18 | +0.477 | 6 | +6 |
7 | 8 | 16 | +.301 | 4 | +4 |
Because there are fewer negative values, we determine the sum of the ranks for Zi<0 as
Ws = 1 + 2 = 3.
Entering the table in the appendix with the value 3, we obtain a P-value of 0.039.
Using the large sample approximation, we would obtain
T = 1.859 and a one-sided P-value of approximately 0.031.
5.4. Multiple Samples
The discussion above focused on simple experimental designs in which we were interested in testing for differences in location between two independent samples. Often our experiments will consist of three or more samples representing, e.g., expression data for a series of mutant constructs or responses of animals to several different dose levels of a drug. We will discuss the problem of making inferences regarding various pair-wise combinations of treatment groups in Chapter 9. Tests based on sample ranks for the hypothesis that all of the groups in an experiment show the same response are described below.
5.4.1. A multisample test against a general alternative.
Our data consist of k samples, each with ni observations with values xij (i = 1 … k, j = 1 … ni). The total number of observations in our experiment is N, the sum of the ni. We want to test the null hypothesis that all of the samples are taken from the same population against the alternative that at least one sample is drawn from a population with a different location from the others. The Kruskal-Wallis test is based on jointly ranking all N observations and determining whether the ranks are randomly distributed across the k samples (Kruskal and Wallis, 1952).
For each sample, i, compute the sum of the ranks
where rij is the rank of observation xij (1 ≤ rij ≤ N). When no ties are present among the xij, the Kruskal-Wallis test statistic, H, is
which follows a χ2 distribution with k-1 degrees of freedom. If some of the xij are tied, mid-ranks are used for the rij and Ri is computed as above. The test statistic, H, is
where
Unless the number of ties is large, there will be little difference in the results for the two formulas for H.
Example 5.5
We have been studying the genetics of blood pressure in rats and have become concerned that the various diets used by our group and other labs working on hypertensive rats may affect the measured blood pressure. To test this hypothesis, we place groups of rats on three different diets and, after several weeks, measure the blood pressure for animals from each group. Our results are (mm Hg)
Diet 1
139 167 132 144 129 113 126 153 174 127 149 133
Diet 2
126 93 147 154 98 128 126 132 169 133
Diet 3
116 152 143 125 141 154 152 147 159 141 115 113
171 114 140 139 149 96 140 152
The ranks for these observations (using mid-ranks for tied values) are
Diet 1
20.5 39 16.5 27 15 4.5 11 35 42 13 30.5 18.5
Diet 2
11 1 28.5 36.5 3 14 11 16.5 40 18.5
Diet 3
8 33 26 9 24.5 36.5 33 28.5 38 24.5 7 4.5 41 6 22.5 20.5
30.5 2 22.5 33
The sums of the ranks for the three diets are 272.5, 180, and 450.5, respectively. The Kruskal-Wallis test statistic is approximately
Using the χ2 distribution with 2 degrees of freedom, we obtain a P-value of 0.58 and conclude that the blood pressure values do not differ for these three diets.
5.4.2. A multisample test against an ordered alternative.
In many experiments there is a natural ordering among the treatment groups, e.g., in the case of a dose-response. In this case, we obtain data for s groups, each consisting of ni (i=1…s) observations and wish to test the null hypothesis
H0: F(X1) = F(X2) = F(X3) = … = F(Xs)
against the one-sided alternative (for example)
H1: F(X1) ≥ F(X2) ≥ F(X3) ≥ … ≥ F(Xs)
(where at least one of the inequalities holds).
The Jonckheere-Terpstra test can be performed for this set of hypotheses as follows (Jonckheere, 1954):
Define the value
Uij = number of (Xia < Xjb) where a=1…ni and b=1…nj
(Recall that this is the Mann-Whitney form of the rank sum statistic for the pair of samples i and j)
The Jonckheere-Terpstra statistic is computed using the 0.5(s)(s-1) values Uij for i<j as
Since the distribution of U depends on the pattern of the ni, it is most convenient to use the usual normal approximation
U* = [U-E(U)] / V(U)1/2
where
E(U) = 0.25×(N2-Σni2)
V(U) = [N2(2N+3) - Σni2(2ni+3)] / 72
N is the total number of observations and the summation is over i=1 … s.
In the case of tied observations, add 0.5 × the number of ties in each i,j comparison to the value of Uij. If there are g groups of tied values each of size tk, the modified formula for the variance of U is
As a final note, this test is most powerful when the treatment groups are, in some sense, equally spaced.
Example 5.6
We are studying a mutant gene that we suspect has a quantitative effect on male fertility. We collect a number of males (3 or 4) having 2, 1, or 0 copies of the mutant allele and mate each to 3 females. The observation for each male is the total number of progeny obtained in the matings.
i | Number of mutant alleles | Total number of progeny |
1 | 2 | 16,8,6 |
2 | 1 | 27,16,15 |
3 | 0 | 31,29,18,42 |
We compute the values Uij as
U12 = 7.5
U13 = 12
U23 = 11
For this example U = 7.5 + 12 + 11 = 30.5, E(U) = 16.5, and V(U) = 27.25. Thus, U* = 2.69, giving a significance level of P<0.0036.
5.5. Tests for Differences in Dispersion
In some cases, you may be more interested in differences between two groups in the degree of variability in the data, independent of whether the two groups may also differ in location. For example, you are comparing two different assay methods and want to know which of the two gives results with greater precision. The most commonly used tests for differences in variance between two groups are based on the F distribution. However, such tests are very sensitive to non-normality in the data sets and may not be suitable for the types of data in our experiments. The Ansari-Bradley test provides a distribution-free alternative, but requires that the medians for the two groups are either identical or known. Below we discuss two distribution-free tests that do not make assumptions about the location of the data sets, a two-sample test due to Miller (1968) based on the jackknife method, and a multi-sample test due to Conover (1999) based on the squared ranks of the distances to the median for the data in each group. It should be noted that these tests for differences in dispersion will typically require larger sample sizes than tests for differences in location to achieve reasonable power. For example, at least 20 samples per group would be required to achieve 90% power for two groups with a five-fold difference in variance.
5.5.1. Miller jackknife test for two samples.
The starting point for Miller's (1968) application of the jackknife for testing for differences in variance was the utility of that method in problems involving estimation of ratios. In particular, given two data sets, Xi and Yj, we want to test the null hypothesis that the ratio γ2 of variances, V(X)/V(Y), is equal to 1 against either one- or two-sided alternatives. A more complete discussion of this test is in Miller's original paper and in Hollander et al. (2013). We will follow the notation used in the latter, below.
We have two data sets. The first group consists of m observations Xi (i=1…m) and the second consists of n observations Yj (j=1…n). We assume the the X and Y observations are independent and come from continuous distributions with a finite fourth moment. We want to test the null hypothesis that the variance ratio is equal to 1
H0: γ2 = 1
against the one-sided alternatives
H1: γ2 < 1 or γ2 > 1
or the two-sided alternative
H2: γ2 ≠ 1
For each observation, compute the jackknifed estimate of the mean and variance for the two groups.
Let
Where and
are the respective
sample means for the full data set. Next compute
The test statistic, Q, is
Q follows a standard normal distribution, with Q≥zα or zα/2 for a one- or two-sided test, respectively.
An estimate of the variance ratio for the two groups is
Example 5.5
We are interested in comparing two mouse strains for their reproductive potential. For each strain, we set up 10 independent matings and count the number of live born offspring in each litter. Our results are:
Strain A: 12 12 12 10 12 13 11 11 10 13
Strain B: 6 3 3 2 10 7 4 5 7 3
Based on the Wilcoxon rank sum test, there is indeed a significant difference between the two strains in the number of offspring per litter, with P(two-sided)≈3×10-5. Do the strains also differ in the variability of litter sizes?
The sample sizes, m and n, are both equal to 10. The values for the variances for the drop-one subsets are
Di2: 1.278 1.278 1.278 0.944 1.278 1.028 1.25 1.25 0.944 1.028
Ej2: 6.861 6.444 6.444 5.75 3.528 6.444 6.861 7 6.444 6.444
The logs of the subset variances are
Si: 0.2451 0.2451 0.2451 -0.0572 0.2451 0.0274 0.2231 0.2231 -0.0572 0.0274
Tj: 1.9259 1.8632 1.8632 1.7492 1.2607 1.8632 1.9259 1.9459 1.8632 1.8632
The logs of the variances for the whole data sets are S0=0.1446 and T0=1.8281.
Jackknifed estimates for the logs of the variances are
Ai: -0.7603 -0.7603 -0.7603 1.9602 -0.7603 1.1992 -0.5625 -0.5625 1.9602 1.1992
Bj: 0.9485 1.5123 1.5123 2.5385 6.9353 1.5123 0.9485 0.7681 1.5123 1.5123
with means and variances for strains 1 and 2 of 0.2153 (0.1449) and 1.9700 (0.3284), respectively. Our test statistic Q is -2.55 with a two-sided P-value of 0.01. The variance ratio for the two strains is 0.17. We conclude that there is a significant difference between the two strains in the variability of litter sizes.
5.5.2. Conover test for multiple samples.
To test for differences in location among multiple groups, the Kruskal-Wallis test jointly ranks all of the observations and determines whether the ranks are randomly distributed among the groups. To examine differences in dispersion, it would make sense to similarly rank deviations between the observations and the means or medians of each group. Conover (1999) has proposed such a test, with the variation that squared ranks are used in order to improve the power of the test.
We have a set of observations, Xij, where i=1…k are the k groups and j=1…ni are the samples within each group. For each group, convert the observations to the distance from the group mean or median
Uij = |Xij - xmi|
where xmi is either the mean or the median for group i. We will use the median value because Conover et al. (1981) found that use of the median improved the power of the test, particularly for skewed distributions. Rank all of the
observations to obtain ranks Rij. Use mid-ranks for tied values. For each group, compute the sum of the squared ranks for each group
The test statistic, T2 is
where
This test statistic, T2, follows a Chi-square distribution with k-1 degrees of freedom.
Example 5.6
We are developing an assay for the presence of a respiratory virus in saliva samples. We plan to use three target sequences for PCR analysis and want to be sure that the targets exhibit similar variability in the results. We add a fixed amount of viral nucleic acid to 20 replicate control samples and measure the Ct value at which a signal is detected. Our results are
N gene (i=1): 28.5 28.6 28.5 28.7 29.7 29.3 29.3 29.2 29.8 29.5 29.4 29.3 27.1 27.1 27.1 27 28.4 28.4 28.6 28.9
S gene (i=2): 29.9 29.8 29.9 29.6 30.9 31 31.1 31.5 31.5 31.7 31 30.9 28.1 28.3 28.5 28.1 29.6 29.6 29.8 30
Orf1 gene (i=3): 29.3 29.3 29.5 29.2 30.9 30.4 30.5 30.1 30.7 30.8 30.8 31 27.5 27.9 27.8 27.7 29.4 29.3 29.1 30.2
The median values for our three targets are 28.65, 29.9, and 29.45, respectively. The absolute differences between the Ct values and the medians for each group are
U1j: 0.15 0.05 0.15 0.05 1.05 0.65 0.65 0.55 1.15 0.85 0.75 0.65 1.55 1.55 1.55 1.65 0.25 0.25 0.05 0.25
U2j: 0 0.1 0 0.3 1 1.1 1.2 1.6 1.6 1.8 1.1 1 1.8 1.6 1.4 1.8 0.3 0.3 0.1 0.1
U3j: 0.15 0.15 0.05 0.25 1.45 0.95 1.05 0.65 1.25 1.35 1.35 1.55 1.95 1.55 1.65 1.75 0.05 0.15 0.35 0.75
Jointly ranking all 60 observations, using mid-ranks for tied values, we obtain
R1j: 13 3.5 13 6 35.5 26.5 26.5 24 39 31 29.5 26.5 47 47 47 54.5 17.5 17.5 3.5 17.5
R2j: 1.5 8.5 1.5 21 33.5 37.5 40 52.5 52.5 59 37.5 33.5 57.5 51 44 57.5 21 21 8.5 10
R3j: 13 13 6 17.5 45 32 35.5 26.5 41 42.5 42.5 49.5 60 49.5 54.5 56 6 13 23 29.5
The sums of the squared ranks for our three targets are 18209.8, 28372, and 27196.2, respectively. The mean of the squared ranks across all 60 observations is 1229.63. Computing D2 as 1.20721×106, we obtain our test statistic,
T2 = 2.56,
which follows a chi-square distribution with 2 degrees of freedom. Thus, our P-value is 0.28 and we conclude that the variability of the results for the three targets is not significantly different.
5.6. Sample Problems
- There has been concern for some time that nitrite present in cured meats
may result in the formation of carcinogenic N-nitrosamines in the
stomach by reaction with amines in the diet. It is possible to safely
measure the amount of nitrosamine production in humans by administering
a large dose of proline and measuring the level of N-nitroso-proline, a
non-toxic reaction product, excreted in the urine. You decide to test
the hypothesis that eating a reasonable amount of a cured meat will
increase the amount of nitrosation taking place in the stomach in the
following way. Each of 6 subjects, who have been asked to avoid cured
meats for the previous week, is given a dose of proline and the
excretion of N-nitroso-proline is measured for 24 hours. Each subject is
then fed a breakfast including 6 strips of bacon and a dose of proline;
excretion of the nitrosated amino acid is again measured for 24 hours.
You obtain the data listed below in the form (basal μg excreted, μg
excreted after bacon).
(3.1,2.8) (2.0,6.1) (5.1,4.5) (4.2,2.9) (8.5,8.9) (1.1,6.0)
- The recessive mutation bg (beige) in mice results in a decrease in
the activity of natural killer cells in homozygous mutant mice. You test
the hypothesis that this cell type plays an important role in rejection
of tumor cells (derived from a tumor induced in an isogenic animal) by
injecting animals with genotypes bg/bg or bg/+ with
106 cells and
measuring the time required for the development of a palpable tumor in
each animal. You obtain the following data:
Genotype Time to tumor development (days) bg/+ 28, 51, 47, 80 bg/bg 14, 35, 17, 37, 16 - You have obtained data for four samples as follows:
Sample 1 32 24 28 40 0 15 20 55 Sample 2 20 36 5 15 30 25 Sample 3 35 25 130 160 820 825 Sample 4 116 220 405 410 550 735 835 705 - You are studying a viral oncogene that induces anchorage-independent
growth when expressed in fibroblasts. In order to map the functional
domains within that oncogene, you construct a series of mutants and
compare the abilities of the mutants to induce anchorage independent
growth with that of the wild type gene. After cotransfecting the wild
type or mutant oncogene with an antibiotic resistance marker into
fibroblasts and selecting for antibiotic resistant cells, you plate
104 cells in agarose in
each of a series of dishes. After allowing time for growth, you obtain
the following numbers of colonies per dish.
wild type 120 310 440 413 500 mutant 48 15 120 15 23 - You have tested a particular chemical for its mutagenic activity in
mammalian cells. In your experiment, you treated cells with various
doses of the agent, allowed the cells to grow for a few generations and
then plated 106 cells
in the presence of 6-thioguanine to select for hprt mutants.
After two weeks, you counted the mutant colonies. After doing the
experiment several times, you obtained the following data:
Dose Mutants/106 cells 0 2,0,5,3,1 10 3,9,8,8 20 17,41,12,5 40 80,79,44,56