9. Multiple Samples and Multiple Experiments
9.1. Multiple Comparisons
The methods described in the previous chapters have concentrated on comparing two independent samples for a difference in location. However, many experiments involve measurements made under a number of different conditions and we often want to make inferences regarding the relationships among these various treatment groups. Some of the relevant situations are:
- Independent samples are obtained for a number of treatment groups and we want to test the null hypothesis that all of the treatments give the same result against an alternative that at least one pair of treatments is different in location.
- The situation described in case (1) applies, but we wish to determine which pairs of treatments differ.
- A number of treatments are each compared to a single control sample and we want to test, for each treatment, the null hypothesis that the treated sample is the same as the control against an alternative that they differ.
- The treatments applied to the groups follow some natural order, e.g., are part of a dose-response experiment, and we want to test the hypothesis that all of the treatments result in the same level of response against an alternative hypothesis that the response for the first treatment is less than or equal to that for the second, and the response for the second is less than or equal to that for the third, etc.
In case (1), the object of the experiment is to determine whether a series of treatments all give equivalent results. For example, you are concerned that some nuisance variable will affect the interpretation of subsequent experiments. For this case, the Kruskal-Wallis test (see section 5.4.1) provides an efficient approach to analyzing the data. The null hypothesis that all of the groups yield the same response may be tested against the ordered alternative described in case (4) using the Jonckheere-Terpstra test (section 5.4.2).
Cases (2) and (3) are related, differing only in the number of desired comparisons, k. For example, if we have collected data on s treatments (one of which is the control for case (3)), we want to make k=s(s−1)/2 comparisons for case (2), while the number of relevant comparisons is only k=(s-1) for case (3). Note that the comparisons for case (3) will often be one-sided while those for case (2) would generally be two-sided. One approach to analyzing the data in these cases would be to consider each comparison independently and apply the two-sample Wilcoxon rank sum test for each relevant pair of groups. The difficulty in this approach is that we now have to consider two types of error rate for the experiment. The first, α, is the error rate per comparison, that is, the usual significance level for the two sample test. However, we also need to consider the experiment-wise error rate, α'. That these two error rates are different is obvious from considering the definition of the significance level (or P-value). Consider the case in which there are truly no differences among the treatment groups. If we set the error rate per comparison α=0.05, the chance of falsely stating that at least one pair of groups differs increases markedly with the number of comparisons.
We are thus left with two not very satisfying alternatives. First, we could fix the value of α at some reasonable level (e.g., 0.05) and run the risk of falsely stating that a particular pair of treatments differ. Alternatively, we could fix α' at some level (e.g., 0.05) and avoid the above risk. However, in that case we will require so stringent an α that we would easily fail to identify two treatments that truly differ.
9.1.1. Bonferroni and related methods.
The simplest approach to taking multiple comparisons into account takes advantage of the Bonferroni inequality, which states that the experiment-wise error rate, α', is less than or equal to the sum of the k per-comparison error rates. Dividing the Type I error into equal parts, we obtain α = α'/k as the per-comparison error rate required to achieve an experiment-wise error rate of α'. This highly conservative approach insures that the probability of incorrectly rejecting even one of the null hypotheses corresponding to our k comparisons is less than α'.
When all of the hypotheses to be tested are independent, the somewhat less conservative Dunn-Šidák method can be used (Ury, 1976). The relationship between α and α' in this case is
For example, if we compare 4 treatments each to a control sample and find that one of the treatments gives a significance level smaller than our criterion of α=0.05, there is slightly less than one chance in five (0.19) that our assertion that the treatment is better than the control will be false. In order to obtain an experiment-wise error rate of 0.05, we would need to set the per-comparison error rate to 0.0127. Our advice (somewhat off the cuff) for dealing with this case is to perform the appropriate Wilcoxon rank sum test to compare each treatment group with the control. Any treatments with per comparison P-values less than 0.05 that would not be significant at α'=0.1 should be retested in a separate experiment.
In case (2) we are interested in the relationships among the various treatment groups. For this case, the number of comparisons, k, is s(s-1)/2. The relationship given above may be used as a guide to determining the amount of weight to be given to assertions that a particular pair of treatments differ. A reasonable way to summarize the results of multiple comparisons of this type is provided by the graphical approach shown in the example below.
Example 9.1
Consider an experiment in which plasmids expressing Neor and various forms (wild type or deleted) of a transforming gene are introduced into fibroblasts. Plasmid bearing cells are selected by growth in G418 and 104 resistant cells are plated in agarose. The colonies in each plate are enumerated after a suitable growth period. Four plasmids are tested in multiple experiments with each assay consisting of two or more plates. The plasmids tested are SV2Neo (no transforming gene), SV2BNLF1 (wild type gene), N43b (N-terminal deletion), and C174 (C-terminal deletion). The following data are obtained.
No. of colonies / 104 cells | ||||
Expt. | SV2Neo | SV2BNLF1 | N43b | C174 |
1 | 32 | 116 | 20 | |
24 | 220 | 36 | ||
28 | ||||
40 | ||||
2 | 0 | 405 | 5 | 35 |
15 | 410 | 15 | 25 | |
3 | 20 | 550 | 30 | 130 |
55 | 735 | 25 | 160 | |
4 | 70 | 835 | 820 | |
55 | 705 | 825 | ||
80 | ||||
140 | ||||
5 | 75 | 790 | 585 | |
235 | 340 | |||
6 | 40 | 635 | 125 | 215 |
30 | 330 | 105 | 185 | |
7 | 60 | 815 | 60 | 245 |
90 | 695 | 75 | 220 |
[Data adapted from Baichwal and Sugden, 1989.]
The above example reflects several features of the way we actually do experiments. First, in contrast to the simple, two sample cases we have discussed so far, we were really interested in asking several different questions when we did the experiments. Second, the data were not all obtained at the same time and inspection of the data would indicate that there are systematic experiment to experiment variations. We will first deal with the multiple comparisons aspect of the first point, pooling the data across experiments. A method for analyzing the data that maintains the experiment-specific information will be discussed in section 9.2.
First consider the case of comparing three treatments against a control sample (SV2Neo). We want to test the null hypothesis that the treatment and control give the same response against an alternative that the response for the "treatment" plasmid is greater than that for SV2Neo. Using the one-sided version of the Wilcoxon rank sum test for each of the treatments vs. control (and pooling the data across the various experiments) we obtain significance levels of P=0.37, P<0.0003, and P<2×10-6 for the N43b, C174, and SV2BNLF1 plasmids, respectively. At an experiment-wise error rate of α'=0.05, we require that α<0.017; we find that the C174 and SV2BNLF1 plasmids give greater responses than the control SV2Neo plasmid.
Next consider the more general problem of determining the relationship among the responses for all of the plasmids. Performing a two-sided Wilcoxon rank sum test for each pair of plasmids we obtain the following significance levels
|
SV2Neo | N43b | C174 | SV2BNLF1 |
SV2Neo |
|
0.74 | 0.0006 | 3×10-6 |
N43b |
|
|
0.0014 | 5×10-5 |
C174 |
|
|
|
0.048 |
As a convenient way to summarize the data, simply write down the group names in order of increasing response and draw a line under those pairs that fail to be significant at some level of α (e.g., α=0.05)
N43b SV2Neo C174 SV2BNLF1
For this example, the plasmids fall into three discrete groups, but note that in other cases the lines may overlap. From the discussion above, C174 would differ from the first two plasmids at an experiment-wise error rate of α' ≤ 0.0084 and would differ from SV2BNLF1 with an error rate of α'≤0.288. Note that for an α' of 0.05, we would require α ≤ 0.0085.
As noted above, the Bonferroni and Dunn-Šidák methods for controlling the experiment-wise error rate are highly conservative in that they are concerned with the general null hypothesis that all k null hypotheses corresponding to our comparisons are simultaneously true. Perneger (1998) has argued that this general null hypothesis is rarely of interest in research and points out the difficulty inherent in the decreasing power that comes from increasing the number of tests that are performed. Thus, we have the Sisyphean dilemma of greater and greater difficulty in demonstrating significant differences between groups as we study more experimental conditions. We take the middle ground in this debate. For experiments such as the one discussed in the above example, report the per comparison P-values. However, you should be mindful of the experiment-wise error rate. Those comparisons that would not be judged significant using an α' less than 0.1 or 0.2 (or whatever risk you're willing to take) should be re-tested in an independent study. Two alternative approaches discussed below deal with the cases that we really do care about the general null hypothesis but need improved power (e.g., linkage analysis) or that we are willing to accept a certain level of false-positive results (e.g., microarray analyses of gene expression).
9.1.2. Estimating experiment-wise error rates by permutation.
A more complicated problem of multiple comparisons arises in the context of linkage analysis for quantitative traits. In such experiments, we are interested in mapping the genes responsible for a quantitative trait, such as blood pressure, body weight, or the number of tumors induced in a particular tissue, to specific chromosomal regions. We perform a test cross between two inbred lines of animals (or plants), such as an N2 backcross or F2 intercross, and analyze all of the progeny for both the phenotype of interest and their genotypes at a large number of marker loci (often 100 or more) that are distributed across all of the chromosomes. For each marker locus, we stratify the progeny by genotype and use an appropriate statistical test to determine whether the magnitudes of the phenotype values differ among the genotypes at that marker. For example, in a backcross the test progeny will either be heterozygous or homozygous at a given marker locus for the allele carried by the recurrent parent. We could use the Wilcoxon rank sum test to compare the phenotypes of the two groups.
After performing our analyses, we have a test statistic and associated P-value for each marker locus. However, the hypothesis that we really want to test is whether there is a significant association between phenotype and genotype anywhere in the genome; i.e., we want to translate our per marker P-values into genome-wide P-values. Using the approach described in the preceding sections, we could approximate the genome-wide P-values by multiplying the per marker P-values by the number of comparisons (the number of markers typed) we have performed. The problem with this approach is that our comparisons are not independent. The markers fall into linkage groups and the genotypes at linked markers are correlated. It is also important to note that the costs in time and resources of following up a spurious linkage are very high.
Churchill and Doerge (1994) provide a more satisfying, empirical approach to estimating the genome-wide P-value by permutation of the phenotype data. As above, we determine the value of our test statistic for each marker locus. We can obtain the distribution of this test statistic under the null hypothesis of no linkage between any marker and the quantitative trait locus determining our phenotype as follows. We perform a large number of Monte Carlo trials in which we randomly permute the phenotype data. For each trial, we perform the appropriate test at every marker locus and note the most extreme value obtained for the test statistic. The distribution of these extreme values approximates the distribution of our test statistic under the null hypothesis of no linkage (Lystig, 2003), providing us with an estimate of the genome-wide P-value.
Example 9.2
We have performed a backcross between two inbred mouse strains that differ in risk for liver cancer. The progeny are treated with a standard regimen to induce liver cancer, and each mouse is evaluated for the number of tumors arising in the liver and for its genotype at 103 marker loci. Data for one of the marker loci are:
Homozygotes | 19 80 1 2 3 60 59 7 0 3 5 3 1 11 23 9 25 31 58 49 12 13 6 44 18 14 7 14 11 6 |
Heterozygotes | 49 62 61 33 78 44 8 64 61 84 62 40 9 21 29 14 41 35 9 32 19 38 30 19 51 27 48 53 |
Using the Wilcoxon rank sum test, we obtain a normalized value for the test statistic, W*=-3.75, and a per-marker P-value of 0.00018 for the two-sided test.
To obtain the genome-wide null distribution for W*, we analyze 100,000 random permutations of the tumor multiplicities. The distribution of the maximum |W*| is shown in the figure below.
The proportion of trials giving a max(|W*|) larger than our observed value of 3.75 is 0.0067, indicated by the black shaded bars in the histogram. Thus, the genome-wide P-value for linkage is P<0.007.
[Data from Bilger et al., 2004.]
9.1.3. Controlling the false discovery rate.
The application of recently developed, high throughput technologies to biological problems gives rise to issues of multiple comparisons on an unprecedented scale. Hybridization of labeled cDNA populations, synthesized from mRNA isolated from cells or tissues under various treatment conditions, to microarrays of oligonucleotides or cDNA clones allows the simultaneous measurement of the levels of thousands or tens of thousands of transcripts. Consider the following typical experiment to study differential gene expression. We label mRNA samples isolated from cells or tissues of n1 cultures or animals treated under condition 1 and from n2 samples under condition 2, with the conditions being, for example, two different stages of the cell cycle, mutant and wild-type animals, etc. Hybridization of each labeled sample to a microarray yields measurements of the levels of m transcripts. For each transcript represented on the microarray, we perform an appropriate statistical test to compare the levels under our two conditions to give m unadjusted (per comparison) P-values.
There are several noteworthy features of the above experiment. First, m is often very large. Commercially available whole genome arrays for the mouse and human allow analysis of 20,000 to 40,000 transcripts. Second, it's very likely that a large (albeit unknown) fraction of the transcripts are truly not differentially expressed. Third, the m comparisons are not independent given that many sets of genes are coordinately regulated. How do we decide which transcripts are differentially expressed?
Using the unadjusted P-values is a definite non-starter. Depending on m and the proportion of truly null comparisons, we may have up to 2,000 false positive results from our analysis among those transcripts with a per comparison P-value < 0.05. Applying the Bonferroni correction to the per comparison P-values will be far too conservative to be useful. For m = 40,000, we would need to set α = 1.25×10-6 in order to achieve an experiment-wise error rate of α' = 0.05. We could better estimate by permutation the distribution of our test statistic under the general null hypothesis that there are no differentially expressed genes, as in the linkage example discussed above, but that misses the point of our experiment, which is directed toward discovering the set of differentially regulated genes. We are likely to pursue additional studies for this set of genes, including independent measurement of the transcript levels by Northern analysis or quantitative RT-PCR, in silico studies of their regulatory elements, or classification of the genes into various regulatory pathways.
The methods we've discussed so far have focused on controlling the experiment-wise error rate (the probability of falsely rejecting at least one truly null hypothesis) or the per comparison error rate (the proportion of falsely rejected null hypotheses). An alternative approach to the problem of multiple comparisons has been formulated by Benjamini and Hochberg (1995). They argue that, in many circumstances, it may be more appropriate to control the false discovery rate (FDR), which is the proportion of incorrectly rejected null hypotheses among all rejected null hypotheses. This approach is ideal for our discovery oriented, gene expression experiment (Storey and Tibshirani, 2003). Depending on the costs associated with our follow up studies, we may be willing to tolerate a certain amount of chaff among the wheat of differentially expressed genes, to be discarded later by independent measurements of the levels of specific transcripts or representing noise in our pathway analysis.
We've performed an experiment to simultaneously test m null hypotheses, of which m0 are true. Thus, the proportion of truly null hypotheses is π0 = m0/m. We can classify the results of our hypothesis tests according to the following 2×2 table:
|
Declared non-significant |
Declared significant |
Total |
True null hypotheses |
m0-F | F | m0 |
Non-true null hypotheses |
m-m0-T | T | m-m0 |
Total | m-S | S | m |
S is an observable random variable that depends on the level α used for each individual hypothesis test. Relating these variables to the hypothesis testing framework discussed above, E[F/m] is the per comparison error rate and P(F ≥ 1) is the experiment-wise error rate when each hypothesis is tested at a level of α'/m.
The false discovery rate, Q, is defined as
with the FDR set to 0 when S=0, since no false rejection is possible. Benjamini and Hochberg (1995) prove two important properties of the FDR. First, when all null hypotheses are true, the FDR is equivalent to the experiment-wise error rate; thus, controlling the FDR provides a weak form of control over the experiment-wise error rate. Second, controlling the FDR at a specific level will have significantly more power than controlling the experiment-wise error rate at the same level; the power advantage to the FDR increases as m0/m (i.e., π0) decreases.
We want to estimate the false discovery rate, Q(t), when all hypothesis tests with a P-value less than some threshold, t (0 < t ≤ 1), are declared significant. For our m hypothesis tests, we obtain P-values of p1, p2, …, pm. Then
When m is large,
The simplest estimate of E[S(t)] is just the number of null hypotheses declared to be significant. The P-values for truly null hypotheses should be uniformly distributed over the interval [0,1]. Thus, we can estimate E[F(t)] as m0t = π0mt. In the original treatment by Benjamini and Hochberg, π0 is assumed to be 1, which yields a conservative estimate of the false discovery rate. Storey (2002) provides two alternative methods to estimate π0 from the distribution of the P-values, but we'll use the simpler, conservative approach below.
Using the above, we can define a Q-value, qi, for each of the P-values obtained in our experiment in a straightforward way. Order the P-values from smallest to largest, such that
p1 ≤ p2 ≤ … ≤ pi < pi+1 ≤ … ≤ pm
(note the strict less than relationship above; the same Q-value will apply to all of the members of a set of equal P-values, with i taken as the largest of the set). The Q-values are estimated as
The Q-value, qi, is an upper limit for the false discovery rate when the P-value, pi (and all of the tests with smaller P-values), is declared to be significant and provides a measure of the confidence we can have in asserting that gene i is, in our example, differentially expressed.
Example 9.3
We want to determine the influence of growth hormone signaling on hepatic gene expression. Hepatic mRNA was isolated from 3 wild type and 5 (growth hormone deficient) mutant mice and labeled cDNA was hybridized to a microarray containing sequences representing 4,608 genes. For each gene, the null hypothesis of equivalent expression in mutant and wild-type was tested using a linear model to give a moderated t-statistic (Smyth, 2004). The resulting P-values were ordered from smallest to largest. For our experiment, 835 genes gave per comparison P-values less than 0.05. A subset of the results are shown in the table below, with i representing the rank of the gene; pi the unadjusted, per-comparison P-value; qi the false discovery rate; and BP the Bonferroni adjusted P-value.
i | Clone | pi | qi | BP |
1 | 4589 | 9.9×10-56 | 4.6×10-52 | 4.6×10-52 |
2 | 2079 | 1.3×10-52 | 3.0×10-49 | 5.9×10-49 |
3 | 540 | 2.5×10-50 | 3.9×10-47 | 1.2×10-46 |
: | ||||
265 | 4598 | 9.8×10-6 | 1.7×10-4 | 0.045 |
266 | 3703 | 1.0×10-5 | 1.8×10-4 | 0.047 |
267 | 1294 | 1.2×10-5 | 2.0×10-4 | 0.053 |
: | ||||
561 | 19 | 5.9×10-3 | 0.048 | 1 |
562 | 1266 | 6.0×10-3 | 0.050 | 1 |
563 | 3719 | 6.2×10-3 | 0.051 | 1 |
Clone 3703, which gave the 266th smallest P-value of 1.03×10-5, is the last gene that would be judged significant using the Bonferroni adjusted P-values (pi×4608). The false discovery rate for this clone is
Use of the Bonferroni adjustment would dictate that only 266 genes showed significant differential expression. Using a false discovery rate of 0.05, we would be interested in 562 genes. Our follow-up studies would likely find that approximately 28 of these 562 were not expressed at different levels in wild-type and mutant livers.
9.2. Combining Results of the Same Type
We quite frequently perform multiple independent experiments for both practical reasons and to insure that the result of an experiment is reproducible. It would be desirable to be able to jointly analyze the results of, for example, multiple experiments comparing two treatment conditions to give a single P-value that reflects the significance of the results as a whole. The following sections describe approaches to combining the results of replicate experiments involving measurement or categorical data.
9.2.1. Combining rank sum tests.
Consider a simple case in which we have performed s experiments and we have ni observations for treatment 1 and mi observations for treatment 2 (i=1 … s), giving a total of Ni observations in experiment i. Within each experiment, the number of possible permutations of the data is Ni!/[ni!(Ni-ni)!] and the total number of permutations for the set of experiments would be the product of this value over the s experiments. We could construct a test based on enumerating this set of permutations (at a cost of some effort), but a simpler approach, suggested by Lehman (1998), is to take advantage of the normal approximation to the Wilcoxon rank sum statistic and the properties of the normal distribution.
As above, we have performed s experiments comparing two treatments, with ni observations on treatment 1 for experiment i and a total of ni+mi=Ni observations in that experiment. For each experiment i, compute the Wilcoxon rank sum statistic Wi in the usual way and calculate an adjusted statistic
W(i) = Wi / (Ni+1)
[The weighting by (Ni+1)-1 provides a more powerful test for the case that the sizes of the individual experiments differ.]
The expected value and variance of W(i) are
E(W(i)) = E(Wi)/(Ni+1)
V(W(i)) = V(Wi)/(Ni+1)2
where the expected value and variance of Wi are as given previously (see section 5.2.2). For the set of experiments as a whole, we define the statistic
Because each of the W(i) are normally distributed, the statistic W is normally distributed with
and
Thus, the usual z-statistic (W-E(W)]/V(W)0.5) can be used to determine the significance level.
Example 9.4
Consider the data for the two plasmids C174 and SV2BNLF1 from the example given above. These two plasmids were compared in 6 independent experiments (Experiments 2-7)
Experiment | C174 | SV2BNLF1 |
2 | 25,35 | 405,410 |
3 | 130,160 | 550,735 |
4 | 820,825 | 835,705 |
5 | 340,585 | 235,790 |
6 | 185,215 | 330,635 |
7 | 220,245 | 695,815 |
For each of the experiments, ni is 2 and Ni is 4. We can compute the appropriate rank sums for the experiments as follows.
Experiment | Wi | W(i) |
2 | 3 | 0.6 |
3 | 3 | 0.6 |
4 | 5 | 1.0 |
5 | 5 | 1.0 |
6 | 3 | 0.6 |
7 | 3 | 0.6 |
Thus, for this set of experiments, the value of the statistic W is 4.4. Because each experiment is of the same size, the values of E(W(i)) and V(W(i)) are in each case 1 and 0.0667, respectively, giving E(W)=6 and V(W)=0.4. The value of the normalized statistic is
z = (4.4-6) / 0.41/2 = -2.52
and the significance level for the two-sided test is P<0.012. Note that by preserving the distinctions between experiments, we obtain an α that is more significant than that required for α'<0.072, giving us some confidence that these mutant and wild-type plasmids differ in activity.
9.2.2. Combining replicate 2 × 2 tables.
Combining the results of replicate r × c tables for a general test of independence is straightforward, given the reproductive property of the χ2 distribution. For k replicate tables, each with d degrees of freedom, the overall result can be obtained by simply summing the values of the X2 statistics for the tables. This sum also follows a χ2 distribution, with kd degrees of freedom.
This approach is less satisfactory when testing the hypothesis that success probabilities differ for two treatments and the goal is to combine the results of replicate 2 × 2 tables. Consider a set of k experiments in which we are comparing the success probabilities for two treatments, A and B. For each experiment, i, we obtain a table (note the difference in notation from that in Chapter 6):
Treatment | Success | Failure | Total |
A | xi | ri - xi | ri |
B | ci - xi | Ni - ri - ci + xi | Ni - ri |
Total | ci | Ni - ci | Ni |
We want to test the null hypothesis
H0: pAi = pBi for all i = 1 … k
against the one-sided alternative that pAi > pBi for at least some i, or the two-sided alternative that they are unequal for some i.
Mantel and Haenszel (1959) have derived the following test statistic
The test statistic, M, follows a standard normal distribution.
Example 9.5
Mice heterozygous for a mutant allele of a particular signal transduction gene occasionally exhibit a neurological defect—their whiskers twitch more rapidly than wild type mice. The mutant mouse line is carried by three separate laboratories. Each group observes a sample of heterozygous mutant and wild type mice and a member of the group (who doesn't know the genotype) scores each mouse as a "twitcher" or normal. The null hypothesis is that the frequency of twitchers is the same for both genotypes, and the one-sided alternative is that they are more frequent among the mutants.
Results for the three groups are
Lab. A | |||
Genotype | Twitcher | Normal | Total |
Mutant | 1 | 24 | 25 |
Wild type | 0 | 20 | 20 |
Total | 1 | 44 | 45 |
Lab. B | |||
Genotype | Twitcher | Normal | Total |
Mutant | 4 | 28 | 32 |
Wild type | 0 | 50 | 50 |
Total | 4 | 78 | 82 |
Lab. C | |||
Genotype | Twitcher | Normal | Total |
Mutant | 6 | 34 | 40 |
Wild type | 1 | 37 | 38 |
Total | 7 | 71 | 78 |
Using the Mantel-Haenszel test, our test statistic is
Using the table for the standard normal distribution, the P-value is 0.00069.
9.3. Combining Results of "Different" Types
The previous section described an approach to combining the results of several experiments in which the same statistical hypothesis was tested using the Wilcoxon rank sum test. However, there will often be a variety of experimental (and, hence, statistical) approaches to asking the same, underlying biological questions and it is desirable on scientific grounds to pursue multiple independent ways of testing our biological hypothesis. Methods for combining and interpreting the results of multiple experiments with diverse designs fall under the heading of "meta-analysis." Although meta-analysis has been an area of intense interest in recent years, in particular in its application to clinical trials, part of its origins can be traced to a very simple method proposed by R. A. Fisher in his classic "Statistical Methods for Research Workers," which was first published in 1925 (Fisher, 1973).
We have performed a number, s, of independent experiments that share the same "biological" null hypothesis and have, in each case, performed an appropriate statistical test providing us with a P-value, pi, i.e., the probability of obtaining our observed (or a more extreme) outcome under the condition that the null hypothesis is true. As described by Fisher, we can define a test statistic that allows us to combine these independent probabilities to yield an overall test for significance:
This test statistic follows a χ2-distribution with 2s degrees of freedom.
Example 9.6
We want to test the hypothesis that a particular mutant gene tu increases the risk that mice will develop lung cancer when exposed to a specific carcinogenic treatment. In order to test that hypothesis, we perform two experiments in which we treat backcross or intercross animals, genotype them at the tu locus, and enumerate lung tumors.
The lung tumor multiplicities for the backcross were:
tu/+ | 28 42 75 0 36 0 2 44 69 57 30 39 7 5 25 2 8 50 3 10 44 0 12 29 |
+/+ | 23 8 8 7 5 16 8 16 5 45 47 45 5 13 3 4 4 3 4 5 15 3 2 25 26 25 5 0 11 0 47 29 |
Using the Wilcoxon rank sum test, we obtain a value for the normalized test statistic of z = 1.385, and a P-value for our one-sided test of 0.083.
The lung tumor multiplicities for the intercross were:
+/+ | 76 8 25 53 108 81 26 1 139 1 13 10 17 1 21 103 3 51 11 14 4 46 |
tu/+ | 22 47 24 5 72 11 93 57 84 18 37 47 62 12 6 53 86 1 30 4 27 14 28 62 0 4 19 54 22 6 67 |
tu/tu | 59 37 57 72 93 27 4 68 20 96 98 56 63 |
Using the Jonckheere-Terpstra test against the alternative that the tumor yield increases with the number of tu alleles, we obtain a z of 2.159 and a P-value of 0.0154.
To combine the results of these experiments we use
X2 = -2 (ln 0.083 + ln 0.0154)
= -2 (-2.489 - 4.173)
= 13.32
Using the table of the χ2 distribution with 4 degrees of freedom, we obtain a combined P-value of 0.0098.
The above method for jointly analyzing experiments is quite powerful and can even be applied when the type of data collected in individual experiments is very different. In our example above, we could have used a different treatment protocol for the intercross that yielded a mean tumor multiplicity that was much less than 1. Thus, the data collected in that cross could have been tumor incidence as a function of genotype rather than tumor multiplicity and we may have analyzed those data using a contingency table. You would, of course, need to apply some judgment in using this approach. It seems unreasonable to combine the P-values in this way when they are derived from a two-sided statistical test and the direction of the results differs among the experiments (for example, if the backcross demonstrated a decrease in tumor yield in the presence of the tu allele but the intercross showed increasing yield with an increased number of copies of that allele).
9.4. Sample Problems
- You are interested in a gene that you believe may be expressed in a
temporally specific manner during development. In a fairly crude initial
experiment, you isolate mRNA from embryos of various stages and measure
the amount of transcript for the gene by Northern hybridization. You
obtain the data below, which are obtained by densitometry of an
autoradiogram. Analyze these data as completely as you can. State the hypotheses you are
testing and compute the significance levels.
Experiment Day of
GestationRNA Level 1 12 20,35,33
14 50,48,30
16 60,31,90
18 12,22,18 2 12 3,2,0
14 8,7,10
16 19,12,9
18 0,1,2 3 12 40,45,49
14 130,150,144
16 220,120,350
18 25,33,41 - You are studying the effect of folate deficiency on the incidence of
neural tube defects in a particular mutant mouse. You provide mothers
with a folate-deficient or normal diet and analyze the embryos for
neural tube defects. For two independent experiments, your results are:
Diet NT defect Normal Experiment 1
Control 0 22 Folate Deficient 3 15 Experiment 2
Control 1 42 Folate Deficient 7 55