10. Power and Experimental Design

We have generally considered statistical approaches to testing hypotheses underlying a particular experiment in a post hoc manner. However, this way of looking at statistics does not reflect the reality of doing (or trying to do) good science. Prior to performing an experiment directed toward determining the difference between two treatments you should ask yourself two questions. First, what is the smallest level of difference between the two treatments that I would find biologically interesting? The answer to this question depends on the kind of measurement you are making, the biology of the system you are studying, and a fair bit on your emotional outlook. Second, how large an experiment must I perform in order to have a reasonable chance of reliably detecting the above difference? This second question is the key issue in experimental design and the answer depends on the statistical method to be used and on the structure of the data obtained in the experiment. We will consider these issues for three commonly used tests, Fisher's exact test for categorical data, McNemar's test for association, and the Wilcoxon rank sum test.

As discussed earlier in the course, there are two errors associated with hypothesis testing. The Type I error, α, is the probability of incorrectly rejecting the null hypothesis and is the commonly quoted P-value or significance level for the test. The distributions under the null hypothesis of the test statistics we have discussed are generally quite simple and do not depend on the underlying distributions of the data obtained in an experiment. The Type II error, β, is the probability of incorrectly accepting the null hypothesis. In order to evaluate β, or the power (1-β), for a particular experiment we need additional information regarding the alternative hypothesis. The distribution of the test statistic under the alternative hypothesis depends on the desired value of α, the sample sizes, the distributions of the measurements, and the degree of difference between the two groups. The distribution of a test statistic under the alternative hypothesis is often referred to as its non-central distribution.

10.1. Fisher's Exact Test

Consider an experiment in which we have obtained 20 observations each on two treatments A and B and we classify each observation as a success or failure according to some criterion. We are interested in testing the null hypothesis that p1, the probability of success for treatment A, is the same as p2, the probability of success for treatment B, against a one-sided alternative hypothesis that p1>p2. We have obtained the following data.


Result
Treatment Success Failure Total
A 10 10 20
B 4 16 20
Total 14 26 40

Keeping the marginal totals fixed, the distribution of the number of successes in treatment A (x) under the null hypothesis follows a hypergeometric distribution. Using Fisher's exact test, we sum up the probabilities for x=10, 11, 12, 13, and 14 to obtain a significance level of 0.048.

The statistical model for this experiment is that the data for each treatment follow binomial distributions with success probabilities p1 and p2 and numbers of trials N1 and N2, respectively. The distribution of the test statistic under the alternative hypothesis depends on these 4 parameters and is quite difficult to evaluate. One point of relationship between α and β is that, under conditions that the statistical test gives rise to a significance level of exactly α, the power of the experiment is equal to 0.5. Casagrande et al. (1978) have provided an approximate formula for determining the number of observations required (N=N1=N2) to obtain a significance level α for a given β as a function of p2 and p1-p2. In the table below, we have computed the equal sample sizes required for α=0.05 and β=0.1 (90% power) as a function of p1 for various values of p2. From the table, we can note that approximately 70 observations per group are required to obtain a 90% chance of observing a significance level ≤ 0.05 when, as in the above example, p2=0.25 and p1=0.5. Suissa and Shuster (1985) provide a table of sample sizes required to achieve 80% power when Barnard's exact test is used. Generally 10-20% fewer samples are required for Barnard's test relative to Fisher's exact test.

Sample sizes (per group) required for α = 0.05, β = 0.1 for Fisher's Exact Test.
The upper and lower entries are for one- and two-sided tests, respectively.

p1= 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
p2









0.05 513 94 45 28 19 14 11 8 6 5

620 113 54 33 23 17 13 10 7 5











0.1
236 76 40 25 18 13 9 7 5


286 92 48 30 21 15 11 9 6











0.25

1404 178 70 38 23 16 11 8



1713 216 85 45 28 19 13 9











0.5




442 111 48 26 15






537 134 58 31 18











0.75






1232 121 36








1503 146 43

10.2. McNemar's Test

Consider the case-control experimental design discussed in section 6.3.1, often used to assess the association between genotype, environmental exposures, or other potential risk factors and the development of a specific disease. In our earlier example, we were interested in testing for an association between a particular genotype, G, and disease. The resulting 2 × 2 table, in terms of cell probabilities is:



Cases


G Not G Total
Controls G p11 p01 p0=p11+p01
Not G p10 p00 q0=1-p0

Total p1=p11+p10 q1=1-p1 1

To design our study, we need to know how many cases to include in order to achieve a specific power, 1-β, for a given α, prevalence of the genotype in the controls, p0, and odds ratio, ψ.

For this discussion, we make two simplifying assumptions. First, we will match a single control to each case. Second, the genotypes of the cases and controls are uncorrelated. Under the latter assumption of independence, the cell probabilities are determined by the marginal probabilities, i.e., p11=p1p0.

We can specify all of the probabilities in the table in terms of the prevalence of the genotype in the controls, p0, and odds ratio, ψ=p10/p01. Thus, p1 is given by

For a two-sided test, the number of discordant pairs required to detect an odds ratio, ψ, with power, 1-β is (Dupont, 1988)

The number of cases required is

(use the expression for p1 given above to obtain N in terms of p0 and ψ).

Example 10.1

The genotype of interest is present at a frequency of 20% in the population. How many cases are required to detect an odds ratio of 4 with 90% power (α=0.05)?

For this example, zα/2=1.96 and zβ=1.28. The number of discordant pairs required is

The value of p1 is

Thus, the number of cases required is

and we should design our study to include 50 cases.

10.3. Wilcoxon Rank Sum Test

Consider two samples, each of size N, with f1(x) the distribution of the data for group 1 and f2(y) that for group 2. In the Mann-Whitney form of the test we want to evaluate the null hypothesis

H0:   P(Xi>Yj) = 0.5

against the one-sided alternative

H1:   P(Xi>Yj) > 0.5

The test statistic in this case is

WXY = number of pairs for which (Xi>Yj)

We can use a normal approximation to the test statistic

WXY* = WXY - E(WXY) / V(WXY)1/2

In the absence of tied observations (continuously distributed data)

E(WXY) = N2 / 2

and

V(WXY) = N2(2N+1) / 12

Under the alternative hypothesis, the normalized test statistic follows a non-central t-distribution. The critical values, z', from this distribution for one-sided and two-sided tests with α = 0.05 are

Power 0.5 0.8 0.9 0.95
one-sided 1.645 2.486 2.926 3.290
two-sided 1.96 2.8 3.24 3.6

Using a little algebra, we can approximate N as a function of P(Xi>Yj) (which we will symbolize as P) for any value of z' given α=0.05. Noting that

WXY = N2P

we obtain

N ≈ {2 z'2} / {12(P-0.5)2}

For discretely distributed data in which tied observations will be frequent (e.g., Poisson distributions with relatively small means), remember to add 1/2 the frequency of ties to the quantity P(Xi>Yj). In addition, the quantity "2" in the numerator of the above expression should be adjusted downward by a small amount (usually 10% or less) to take into account the reduction in variance.

Example 10.2

You want to compare two Poisson populations under conditions where you will achieve 95% power (one-sided α=0.05) for a mean in the first sample of 2.0 against a second sample with a mean of 1.0. The distributions of the observations for the two samples is


Frequency
i m=2.0 m=1.0
0 0.135 0.368
1 0.271 0.368
2 0.271 0.184
3 0.180 0.061
4 0.090 0.015
5 0.036 0.003
6 0.012 0.0005
7  0.003  7×10-5

To compute

For the two distributions above, the value of P is 0.710. Thus,

   N = 2 × 3.292 / 12 × 0.2102 = 41

If a full correction for ties is applied (and more values of i used), the value would be 38.

Example 10.3

How many observations would be required to distinguish in a two-sided test between two Poisson populations with means of 10 and 5, respectively, under conditions of 90% power with α=0.05?

The distribution of the Xi in this case is approximately normal with mean and variance of 10, while that for the Yj is normal with mean and variance of 5. In order to compute P, we can note that the distribution of X-Y is approximately normal with a mean of 5 and a variance of 15. Using the standard cumulative normal distribution for z=5/151/2, we note that

   P[(X-Y)>0] = P(Xi>Yj) = N(1.29) = 0.901

Thus,

   N = 2 × 3.242 / 12 × 0.4012 = 11

10.4. Sample Problems

  1. You will be comparing the cloning efficiency (colony forming ability) of two cell lines in agarose. To do this experiment, you plate 100 cells in each of N wells for each cell strain and for each well count the number of colonies that arise after 2 weeks in culture. You will compare the cloning efficiencies of the two cell strains by testing for a difference in the colony counts using the Wilcoxon rank sum test. If past experience indicates that the cloning efficiency of Cell Strain A is 10%, how many wells would you have to plate (how large must N be) in order to detect a significant difference between the two strains at the 5% significance level with power of 95% if the true cloning efficiency of Cell Strain B were 20%?
  2. You are interested in studying the effect of treatment of a cell line with a hormone on the expression of an enzyme activity. The basal level of expression for the cell line is 100 units. How many independent observations (equal numbers for control and hormone-treated) would you need to make in order to have a 95% chance of detecting a two-fold increase in enzyme activity in the hormone treated cells? Assume that your measurements are normally distributed with a variance in each case of 1/2 the mean number of units.
  3. The incidence of lung tumors in untreated mice of a particular strain is 10%. How many animals would you need to study in order to detect with 90% power (for α = 0.05) an increase in tumor incidence to 30% for animals treated with a carcinogen (assume that the treated and concurrent control groups are of equal size).