10. Power and Experimental Design

We have generally considered statistical approaches to testing hypotheses underlying a particular experiment in a post hoc manner. However, this way of looking at statistics does not reflect the reality of doing (or trying to do) good science. Prior to performing an experiment directed toward determining the difference between two treatments you should ask yourself two questions. First, what is the smallest level of difference between the two treatments that I would find biologically interesting? The answer to this question depends on the kind of measurement you are making, the biology of the system you are studying, and a fair bit on your emotional outlook. Second, how large an experiment must I perform in order to have a reasonable chance of reliably detecting the above difference? This second question is the key issue in experimental design and the answer depends on the statistical method to be used and on the structure of the data obtained in the experiment. We will consider these issues for three commonly used tests, Fisher's exact test for categorical data, McNemar's test for association, and the Wilcoxon rank sum test.

As discussed earlier in the course, there are two errors associated with hypothesis testing. The Type I error, α, is the probability of incorrectly rejecting the null hypothesis and is the commonly quoted P-value or significance level for the test. The distributions under the null hypothesis of the test statistics we have discussed are generally quite simple and do not depend on the underlying distributions of the data obtained in an experiment. The Type II error, β, is the probability of incorrectly accepting the null hypothesis. In order to evaluate β, or the power (1-β), for a particular experiment we need additional information regarding the alternative hypothesis. The distribution of the test statistic under the alternative hypothesis depends on the desired value of α, the sample sizes, the distributions of the measurements, and the degree of difference between the two groups. The distribution of a test statistic under the alternative hypothesis is often referred to as its non-central distribution.

10.1. Fisher's Exact Test

Consider an experiment in which we have obtained 20 observations each on two treatments A and B and we classify each observation as a success or failure according to some criterion. We are interested in testing the null hypothesis that p₁, the probability of success for treatment A, is the same as p₂, the probability of success for treatment B, against a one-sided alternative hypothesis that p₁>p₂. We have obtained the following data.

	Result
Treatment	Success	Failure	Total
A	10	10	20
B	4	16	20
Total	14	26	40

Keeping the marginal totals fixed, the distribution of the number of successes in treatment A (x) under the null hypothesis follows a hypergeometric distribution. Using Fisher's exact test, we sum up the probabilities for x=10, 11, 12, 13, and 14 to obtain a significance level of 0.048.

The statistical model for this experiment is that the data for each treatment follow binomial distributions with success probabilities p₁ and p₂ and numbers of trials N₁ and N₂, respectively. The distribution of the test statistic under the alternative hypothesis depends on these 4 parameters and is quite difficult to evaluate. One point of relationship between α and β is that, under conditions that the statistical test gives rise to a significance level of exactly α, the power of the experiment is equal to 0.5. Casagrande et al. (1978) have provided an approximate formula for determining the number of observations required (N=N₁=N₂) to obtain a significance level α for a given β as a function of p₂ and p₁-p₂. In the table below, we have computed the equal sample sizes required for α=0.05 and β=0.1 (90% power) as a function of p₁ for various values of p₂. From the table, we can note that approximately 70 observations per group are required to obtain a 90% chance of observing a significance level ≤ 0.05 when, as in the above example, p₂=0.25 and p₁=0.5. Suissa and Shuster (1985) provide a table of sample sizes required to achieve 80% power when Barnard's exact test is used. Generally 10-20% fewer samples are required for Barnard's test relative to Fisher's exact test.

Sample sizes (per group) required for α = 0.05, β = 0.1 for Fisher's Exact Test.
The upper and lower entries are for one- and two-sided tests, respectively.

p₁=	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1.0
p₂
0.05	513	94	45	28	19	14	11	8	6	5
	620	113	54	33	23	17	13	10	7	5

0.1		236	76	40	25	18	13	9	7	5
		286	92	48	30	21	15	11	9	6

0.25			1404	178	70	38	23	16	11	8
			1713	216	85	45	28	19	13	9

0.5						442	111	48	26	15
						537	134	58	31	18

0.75								1232	121	36
								1503	146	43

10.2. McNemar's Test

Consider the case-control experimental design discussed in section 6.3.1, often used to assess the association between genotype, environmental exposures, or other potential risk factors and the development of a specific disease. In our earlier example, we were interested in testing for an association between a particular genotype, G, and disease. The resulting 2 × 2 table, in terms of cell probabilities is:

		Cases
		G	Not G	Total
Controls	G	p₁₁	p₀₁	p₀=p₁₁+p₀₁
Controls	Not G	p₁₀	p₀₀	q₀=1-p₀
	Total	p₁=p₁₁+p₁₀	q₁=1-p₁	1

To design our study, we need to know how many cases to include in order to achieve a specific power, 1-β, for a given α, prevalence of the genotype in the controls, p₀, and odds ratio, ψ.

For this discussion, we make two simplifying assumptions. First, we will match a single control to each case. Second, the genotypes of the cases and controls are uncorrelated. Under the latter assumption of independence, the cell probabilities are determined by the marginal probabilities, i.e., p₁₁=p₁p₀.

We can specify all of the probabilities in the table in terms of the prevalence of the genotype in the controls, p₀, and odds ratio, ψ=p₁₀/p₀₁. Thus, p₁ is given by

For a two-sided test, the number of discordant pairs required to detect an odds ratio, ψ, with power, 1-β is (Dupont, 1988)

The number of cases required is

(use the expression for p₁ given above to obtain N in terms of p₀ and ψ).

Example 10.1

The genotype of interest is present at a frequency of 20% in the population. How many cases are required to detect an odds ratio of 4 with 90% power (α=0.05)?

For this example, z_α/2=1.96 and z_β=1.28. The number of discordant pairs required is

The value of p₁ is

Thus, the number of cases required is

and we should design our study to include 50 cases.

10.3. Wilcoxon Rank Sum Test

Consider two samples, each of size N, with f₁(x) the distribution of the data for group 1 and f₂(y) that for group 2. In the Mann-Whitney form of the test we want to evaluate the null hypothesis

H₀: P(X_i>Y_j) = 0.5

against the one-sided alternative

H₁: P(X_i>Y_j) > 0.5

The test statistic in this case is

W_XY = number of pairs for which (X_i>Y_j)

We can use a normal approximation to the test statistic

W_XY* = W_XY - E(W_XY) / V(W_XY)^1/2

In the absence of tied observations (continuously distributed data)

E(W_XY) = N² / 2

and

V(W_XY) = N²(2N+1) / 12

Under the alternative hypothesis, the normalized test statistic follows a non-central t-distribution. The critical values, z', from this distribution for one-sided and two-sided tests with α = 0.05 are

Power	0.5	0.8	0.9	0.95
one-sided	1.645	2.486	2.926	3.290
two-sided	1.96	2.8	3.24	3.6

Using a little algebra, we can approximate N as a function of P(X_i>Y_j) (which we will symbolize as P) for any value of z' given α=0.05. Noting that

W_XY = N²P

we obtain

N ≈ {2 z'²} / {12(P-0.5)²}

For discretely distributed data in which tied observations will be frequent (e.g., Poisson distributions with relatively small means), remember to add 1/2 the frequency of ties to the quantity P(X_i>Y_j). In addition, the quantity "2" in the numerator of the above expression should be adjusted downward by a small amount (usually 10% or less) to take into account the reduction in variance.

Example 10.2

You want to compare two Poisson populations under conditions where you will achieve 95% power (one-sided α=0.05) for a mean in the first sample of 2.0 against a second sample with a mean of 1.0. The distributions of the observations for the two samples is

	Frequency
i	m=2.0	m=1.0
0	0.135	0.368
1	0.271	0.368
2	0.271	0.184
3	0.180	0.061
4	0.090	0.015
5	0.036	0.003
6	0.012	0.0005
7	0.003	7×10^-5

To compute

For the two distributions above, the value of P is 0.710. Thus,

N = 2 × 3.29² / 12 × 0.210² = 41

If a full correction for ties is applied (and more values of i used), the value would be 38.

Example 10.3

How many observations would be required to distinguish in a two-sided test between two Poisson populations with means of 10 and 5, respectively, under conditions of 90% power with α=0.05?

The distribution of the X_i in this case is approximately normal with mean and variance of 10, while that for the Y_j is normal with mean and variance of 5. In order to compute P, we can note that the distribution of X-Y is approximately normal with a mean of 5 and a variance of 15. Using the standard cumulative normal distribution for z=5/15^1/2, we note that

P[(X-Y)>0] = P(X_i>Y_j) = N(1.29) = 0.901

Thus,

N = 2 × 3.24² / 12 × 0.401² = 11

10.4. Sample Problems

You will be comparing the cloning efficiency (colony forming ability) of two cell lines in agarose. To do this experiment, you plate 100 cells in each of N wells for each cell strain and for each well count the number of colonies that arise after 2 weeks in culture. You will compare the cloning efficiencies of the two cell strains by testing for a difference in the colony counts using the Wilcoxon rank sum test. If past experience indicates that the cloning efficiency of Cell Strain A is 10%, how many wells would you have to plate (how large must N be) in order to detect a significant difference between the two strains at the 5% significance level with power of 95% if the true cloning efficiency of Cell Strain B were 20%?
You are interested in studying the effect of treatment of a cell line with a hormone on the expression of an enzyme activity. The basal level of expression for the cell line is 100 units. How many independent observations (equal numbers for control and hormone-treated) would you need to make in order to have a 95% chance of detecting a two-fold increase in enzyme activity in the hormone treated cells? Assume that your measurements are normally distributed with a variance in each case of 1/2 the mean number of units.
The incidence of lung tumors in untreated mice of a particular strain is 10%. How many animals would you need to study in order to detect with 90% power (for α = 0.05) an increase in tumor incidence to 30% for animals treated with a carcinogen (assume that the treated and concurrent control groups are of equal size).