6. Analysis of Categorical Data

A variety of experiments generate data that are qualitative in nature such that the observations may be classified as belonging to one of two or more categories. Such data can be summarized conveniently as a table, generally referred to as a contingency table. We will often want to test hypotheses in which we are interested in the relationship between two or more different classification schemes.

6.1. The 1 × c Table: Goodness of Fit Tests

The simplest contingency table is one dimensional. In this case, we may be interested in comparing the number of observations in each of the categories to those predicted by some predetermined model. As discussed in Chapter 4, our hypothesis may be simple, i.e., the proportions in the categories are completely specified, or composite, i.e., the proportions depend on one or more unknown parameters.

6.1.1. No unknown parameter to estimate.

We consider again, as in Example 4.2, the problem of deviation from Mendelian segregation. In this experiment, we observe particular numbers of animals of the three possible genotypes obtained in an intercross and want to test the hypothesis that the observed frequencies conform to the 1:2:1 proportions expected in the case that all three genotypes are recovered with similar efficiencies. More generally, we might classify n observations according to k categories and want to test the hypothesis that the observed ni (i=1 … k) do not differ from that specified by a multinomial distribution in which all of the pi are determined by the hypothesis. This "goodness of fit" test was first described by Pearson in 1900 and is based on the following statistic

When n is large, the observed frequencies will be normally distributed with a expected value of npi and a variance of approximately the same magnitude. Thus, the value inside the summation can be seen to be approximately distributed as the square of a standardized normal variate. The sum of such squared normal values follows a χ2 (Chi-square) distribution. Because it is sufficient to determine (k-1) of the pi (they must add up to 1), the test statistic, X2, follows a χ2 distribution with (k-1) degrees of freedom. Although this distribution is difficult to compute, tables of the critical values for the distribution may be found in virtually any statistics text (see Appendix 3). Note that the χ2 distribution is a good approximation only when n is relatively large. In practice the approximation is quite good when the expected frequencies are all greater than 5 and is still acceptable when the npi are greater than 1.5.

Example 6.1 Deviation from Mendelian segregation.

We mate D/d animals and recover 1 D/D, 9 D/d, and 9 d/d offspring. If all of the progeny classes are recovered with equal efficiency, we expect proportions of 0.25, 0.5, and 0.25, respectively, giving expected numbers of 4.75, 9.5, and 4.75. We can compute the test statistic as

Interpolating from the χ2 distribution with 2 degrees of freedom, we obtain a significance level of P<0.034. You may wish to compare this analysis for general deviations from Mendelian segregation with the approach using the binomial distribution.

6.1.2. At least one unknown parameter to estimate.

A very common situation arises when the expected values are not completely determined by the hypothesis, but are functions of unknown parameters that must be estimated from the data. A typical genetic example is the testing of Hardy-Weinberg equilibrium.

Example 6.2 Deviation from Hardy-Weinberg equilibrium.

We draw a random sample from a population and find 20 D/D, 90 D/d, and 90 d/d individuals. The gene frequency for the D allele in the sample is (40+90)/400 = 0.325. So the expected numbers are 200×0.3252, 400(0.325)(0.675), and 200×0.6752.

We can compute the test statistic as

In this case we refer to a χ2 distribution with 1 degree of freedom. The general rule is that the degrees of freedom equal (the number of categories) - 1 - (the number of parameters estimated). In Example 6.1, no parameters were estimated and so we had 2 df. Here one parameter was estimated and we have only 1 df. Using the table in Appendix 3, we find that P ~ 0.72, and would conclude that the population conforms to HWE.

6.2. The 2 × 2 Table

Consider an experiment involving the analysis of categorical data, in which we make n observations and classify each observation according to whether or not it possesses two properties, A and B. After the experiment, we can tabulate the data as a 2 × 2 table, with the values nij representing the number of observations in the i,j cell and ri and cj the sum of the number of observations in row i and column j, respectively:


B ~B Total
 A n11 n12 r1
~A n21 n22 r2
Total c1 c2 n

6.2.1. Four underlying sampling distributions.

Four different experimental designs can give rise to the above 2 × 2 table depending on whether under repeated sampling n is variable, n is fixed, n and either the row or column totals are fixed, or n and both row and column totals are fixed. Consider the following examples.

Example 6.3

Model 0: Each cell entry is a random variable.

In this example each cell represents a Poisson variable, and, hence, the grand total is also a Poisson variable, i.e., n is not fixed.

You sample numbers of accidents during 1999 on a particular road in Montana and classify them as to time of day and weather conditions. The results are:

 DayNight
Dry105
Wet312

Since each entry is the realization of a Poisson random variable, the sum is also a random variable. However, if we condition our test on the observed sum, then the entries are multinomial random variables and the problem reduces to Example 6.4, below. This is a standard result and we will not prove it here.

Example 6.4

Model 1: Double dichotomy; row and column totals not fixed.

You are interested in testing the hypothesis that two RFLP markers, M1 and M2, are linked. You perform a backcross in which (M1D/B M2D/B) animals are mated to (M1B/B M2B/B) animals and analyze the genotypes of 30 progeny, i.e., n is fixed. You observe that 12 offspring are (B B / B B), 3 are (B D/B B), 5 are (D B/B B) and 10 are (D D / B B). If Bi indicates homozygosity at locus i and Di indicates heterozygosity at locus i, we want to test the null hypothesis that

     H0:  P[B1B2] = P[B1] × P[B2]

against the one-sided alternative

     H1:  P[B1B2] > P[B1] × P[B2]

Example 6.5

Model 2: Test for homogeneity; row (or column) totals fixed.

You have developed a recombinant vaccine for a viral disease and want to test it for efficacy. You inoculate 15 animals with the vaccine and inject 15 animals with saline. All of the animals are then infected and the presence or absence of virus-induced disease is evaluated after 2 weeks. Among the control animals 12 develop the disease, while only 5 of the inoculated animals become ill. Using pu and pi to denote the probabilities of developing the disease for untreated and inoculated animals, respectively, we want to test the hypothesis

     H0:  pu = pi

against the one-sided alternative

     H1:  pu > pi

Example 6.6

Model 3: Both margins fixed.

It is difficult to define a good biological example for this case. You are studying a mutant that, when homozygous (aa), exhibits a fairly subtle phenotype that can be observed only on microscopic examination by someone who is highly trained. In order to test the ability of the person scoring for the phenotype, you assemble a collection of 30 animals, of which 17 are homozygous mutant and 13 are heterozygous (Aa) as determined by an independent method. You tell your scorer that all of the animals come from a backcross and ask that the animals be classified as homozygous mutant or heterozygous. Because the scorer is genetically trained, s/he will use the expected equal frequencies of the categories in making the assignments. Of the 17 mutant animals, the scorer assigns 12 correctly. Using the subscript conventions in the 2 × 2 table above, we want to test the hypothesis that our scorer does better at assigning the genotype than would be predicted by chance:

     H0:  p11/p1· = p12/p2·

     H1:  p11/p1· > p12/p2·

In spite of the differences in these experimental designs, each of the above examples yields the same 2 × 2 table. Quite remarkably, the same method provides a powerful test of each of the above hypotheses and the distribution of the test statistic is identical for all three null hypotheses. [Note that the same can not be said of the distribution under the alternative hypothesis, a matter we will revisit when discussing experimental design.] The method has its origin in considering the case with both margins fixed and is named for R. A. Fisher (1973).

6.2.2. Fisher's exact test: the hypergeometric distribution.

If all four of the marginal totals are fixed, the value of any one of the cells, e.g., n11, is sufficient to determine the entire table. Thus, the probability of obtaining any given 2 × 2 table can be obtained by considering the distribution of n11 conditional on n, r1, and c1. The appropriate probability is given by the hypergeometric distribution,

For Example 6.6, we can define the following table


True Aa True aa Total
Assign Aa 10 5 15
Assign aa 3 12 15
Total 13 17 30

Our scorer has made 15 Aa assignments, of which 10 are correct, and there are ways to do so. For each of these permutations, there are ways to incorrectly assign an aa genotype. Thus, the probability of making 10 correct Aa assignments is given by

Similarly, we can compute the probabilities of making 11, 12, and 13 correct Aa assignments as 0.0012, 5.7×10-5, and 8.8×10-7, respectively. Thus, the probability of making 10 or more correct Aa assignments is approximately 0.0127, which is the significance level for our statistical test. The distribution for the test statistic is equivalent to that for the hypergeometric distribution, which arises in the case of sampling from a finite population without replacement. Note that this test is exact in the sense that we can directly compute the distribution of our test statistic (the value of n11) under the null hypothesis.

The primary virtues of Fisher's exact test are that it makes optimum use of the available information and allows for a straightforward computation of the exact significance level. A disadvantage of this test is that computations by hand (i.e., by pocket calculator) are cumbersome. To simplify computation of the P-value, organize the table such that c1 is the smallest of the 4 marginal totals and n11 can increase in the direction specified by the one-sided alternative hypothesis. Then, the significance level is given by

 where  

In the case that the row (or column) totals are equal, the hypergeometric distribution is symmetrical and the two-sided significance level can be obtained by simply doubling the value computed for the upper tail. When the row totals are not equal, the simplest way to compute the two-sided P-value is to compute the upper tail as above and compute the lower tail by summing the f(x) that are no larger than the value for f(n11). This approach is used in the example below.

Example 6.7 Using Fisher's exact test for a two-sided alternative.

In a variation of the experiment described in Example 6.5, we are testing whether or not vaccination prevents or exacerbates a disease caused by infection with a particular agent. We vaccinate 10 animals and infect these animals along with 7 control animals. After two weeks, we find that 6 of the vaccinated animals and 2 of the control animals are healthy and the remaining animals suffer from the disease. We want to test the hypothesis

     Ho: pu = pi

     Ha: pupi

where pu and pi are the probabilities of illness in control and vaccinated animals, respectively. For this experiment, we obtain the following table with the distribution for all possible tables shown below. (Note that the table is rotated so that c1 is the smallest marginal value.)

 ControlVaccinatedTotal
Ill549
Healthy268
Total71017

The observed table, in which x=5, has a probability of 0.181. There are 5 additional tables with probabilities that are less than or equal to that for the observed table. Summing these 6 possible outcomes, we obtain a significance level of 0.335.

xf(x)
0  4 × 10-4
10.013
20.104
30.302
40.363
50.181
60.034
70.002

6.2.3. An unconditional exact test.

Fisher's exact test is conditional on both the row and column totals (Model 3 above), but nonetheless provides an appropriate (albeit conservative) test in the case that only one of the margins (rows or columns) is fixed. Barnard (1947) described an exact, unconditional test for homogeneity (Model 2) that is more powerful than Fisher's conditional test under that model (Mártín Andres et al., 2004).

Consider an experiment in which we compare two groups (e.g., treated and control), each with a number of observations fixed by design, for the proportion of subjects exhibiting some response. We can tabulate our results in the familiar 2 × 2 table


Responder Non-responder Total
Treatment x1 r1 - x1 r1
Control x2 r2 - x2 r2
Total (x1 + x2) n - x1 - x2 n

We want to test the hypothesis that the probability of responding in the treatment group, p1, is the same as that in the control group, p2,

H0:  p1 = p2

against, for example, the one-sided alternative

H1:  p1 > p2

Here we suppose that, under the null hypothesis, the common probability of responding for the two groups is p. Then, the probability of obtaining our observed table, To, is the product of two binomials:

To obtain the P-value for our test, we need to consider some critical region (CR) that contains all of the tables that represent an outcome at least as extreme as our observed table. Then, the significance level α(p) can be obtained by summing the above probabilities over all of the tables in the CR.

Of course, the value of p under the null hypothesis is unknown; it is a nuisance parameter. As discussed by Barnard, the significance level, α*, can be determined by finding the value of p that maximizes the above equation, that is,

All that remains is to define the critical region for the test. In Barnard's original paper, the CR is built up iteratively following a few simple rules. Start with the most extreme possible outcome, where x1=0 and x2=r2, and compute α*. The next possible table is added to the CR from among those that satisfy the rule of convexity, i.e., the newly added table must be immediately adjacent to the CR. The table that is added is the one that increases α* by the smallest amount (the rule of minimum). In the case of a two-sided test, a symmetry rule is applied. When a table is added to the CR, the corresponding table from the opposite tail is added simultaneously. This algorithm for building the CR is referred to in Barnard's paper as the CSM method, and it is repeated until the observed table is the last table added to the critical region.

The CSM method described by Barnard is computationally very demanding, because the maximization step to obtain α* must be performed many times to build up the critical region. The 1947 paper includes tables for the P-values for experiments in which r1 and r2 are (7,7), (6,8), and (5,9), which must have taken many weeks to compute by hand. Various less computationally demanding approaches to defining the CR have been proposed based on using an alternative statistic, S(T), and including all tables for which S(T) is more extreme than that for the observed table. Two common choices for S(T) include the Chi-square test (see section 6.2.4) and Fisher's exact test. The latter has been shown to be nearly as powerful as the CSM method over a wide range of r1/r2 (Martín Andrés and Silva Mato, 1994).

Example 6.8

We are interested in comparing the incidence of spontaneous liver tumors in two inbred mouse strains. Male mice of each strain are analyzed at 15 months of age for the presence or absence of liver cancer to obtain the following results

 TumorTumor-freeTotal
Strain 1   12021
Strain 251217
Total63238

We want to test the null hypothesis that tumor incidence is the same for the two strains, p1=p2, against the two-sided alternative that they are different, p1p2.

The CR is built up using the CSM method to obtain the following (1 indicates the table is included and the observed table is shaded)

 0123456789101112131415161718192021
171111111111111111100000
161111111111111110000000
151111111111111000000000
141111111111100000000001
131111111111000000000001
121111111100000000000011
111111111000000000000011
101111110000000000000111
91111100000000000001111
81111000000000000011111
71110000000000000111111
61100000000000001111111
51100000000000011111111
41000000000001111111111
31000000000011111111111
20000000001111111111111
10000000111111111111111
00000011111111111111111

Maximizing α(p) we obtain

and find that the maximum occurs at p = 0.166, to give our two-sided P-value of 0.049. Note that Fisher's exact test would give a two-sided P-value of 0.071.

6.2.4. The Chi-square test.

The Chi-square test is the classical test for independence in a 2 × 2 table (Model 1), in which only n, the total number of observations, is fixed. This test may also be applied to 2 × 2 tables under the other models in which the rows and/or column totals are fixed. The test statistic is:

This test statistic may be referred to the χ2 (Chi-square) distribution with 1 df (in the case of the two-sided alternative), or by making use of the fact that X is distributed as a standard normal distribution with mean 0 and variance 1. The latter may be used to test against one- or two-sided alternatives to the null hypothesis of equal success probabilities for the two treatments defined by the row categories.

6.2.5. Comparison of tests for 2 × 2 tables.

The three tests discussed above are motivated by three different experimental models, depending on whether neither, either, or both the row and/or column totals are fixed by design. The tests also range in computational difficulty from trivial (Chi-square) to modest (Fisher's exact test) to substantial (Barnard's exact test). Which test should you use for any give experimental context?

The derivation of Fisher's exact test was based on a model in which both the row and column sums are fixed in advance, a relatively uncommon experimental design. You will more often analyze data for experiments in which either n randomly chosen individuals are classified according to the row and column categories (neither ri nor cj fixed) or predetermined numbers of individuals are drawn at random from two or more populations and classified according to characteristic or response (either ri or cj fixed). For these two experimental designs, the P-value calculated using Fisher's exact test is conditional on the row and column totals. As discussed by Conover (1999), if the conditional P-value under H0 is α, then the unconditional probability under H0 is always less than or equal to α. Thus, the exact test remains valid for the cases in which the row and/or column totals are random variables. However, in these cases, Fisher's exact test tends to be conservative, yielding a larger P-value than the true α. For example, Storer and Kim (1990) found that the power of Fisher's exact test is 10-20% lower than that for the approximate (X or X2) test when moderate sample sizes (more than 20 per group) are tested for a difference in proportions. Note, however, that the latter, approximate test statistics converge to normal or χ2 distributions as n → ∞. The quality of these approximations also depends on the expected numbers of observations per cell, and is relatively poor when there are cell counts smaller than 5. The estimated P-value may be larger or smaller than the true value of α, i.e., Chi-square tests may be too conservative or too liberal when the sample size or cell counts are small.

For a model in which either the row or column totals are fixed, Barnard's exact test provides a reasonable alternative. Barnard's exact test was used rarely in the decades after its publication because the maximization step is computationally very demanding when compared to Fisher's exact test. However, Barnard's test is more powerful than Fisher's for the case of one fixed margin when the marginal totals are small, largely because the highly discrete nature of Fisher's exact test provides fewer possibilities in terms of the computed P-value (Martín Andrés and Silva Mato, 1994; Martín Andrés et al., 2004). The controversy over which of the two exact tests should be used has continued since the initial publication by Barnard (Kempthorne, 1979). One source of contention is the appropriateness of the binomial model. Consider an experiment in which you want to compare the prevalence of some phenotype in two groups of mice of different genotype. One can readily envision each group as having been drawn at random from a notionally infinite population of mice representing that genotype, leading to a binomial model in which only the group sizes are fixed, which would make Barnard's exact test most appropriate.

On the other hand, consider the following clinical trial. You want to compare the effectiveness of a standard treatment regimen with one in which patients are treated with an augmented version that includes an additional drug. You identify the next 80 patients that come into the clinic with the disease of interest and randomly assign them to the two treatments in equal numbers. It may reasonably be argued that these 80 patients do not represent a random sample of all potential patients and that, under the null hypothesis of no difference between treatments, the total number of responders is fixed. In that case, both the row and column totals should be viewed as fixed and Fisher's exact test would be most appropriate.

Thus, there is no simple answer to the question posed above for experiments in which the row and/or column totals are not fixed. Our advice is to use one of the exact tests when sample sizes are small (e.g., ri ≤ 50) or when one or more of the cells contain fewer than 5 observations. Using an exact test for these cases insures that the true value of α is no larger than the calculated P-value. When the experiment involves fixed group sizes, you should choose between Fisher's and Barnard's exact test based on your view of the appropriateness of the binomial model. In any event, you should decide which test to use before you begin the analysis: It is never appropriate to shop among statistical tests for the P-value you find most desirable.

6.3. Paired data and structural zeros

6.3.1. McNemar's test for analysis of paired data.

For all of the cases described above, the n observations in the 2 × 2 table are independent. Example 6.7 represents a frequently encountered experimental approach. We assign at random members of a population to two distinct groups (vaccinated and control) and are interested in testing the null hypothesis that the probability of some outcome (e.g., illness) is the same for the two groups. An alternative experimental design, in which each sample consists of a pair of observations, may be more appropriate when "uninteresting" sources of variation, in addition to the variable represented by our experimental groups, may contribute to the outcome we observe.

Consider the problem of testing the hypothesis that the incidence of a disease is greater among individuals with a particular genotype at a specific locus when compared with other members of the population. We could test that hypothesis in a prospective study by identifying a sufficient number of members of each group and monitoring them for the development of the disease over some fixed period of time. At the end of the study, our data would consist of the familiar 2 × 2 table, which may be analyzed using Fisher's exact test:


Diseased Healthy Total
Genotype 1 n11 n12 r1
Not Genotype 1 n21 n22 r2
Total c1 c2 n

This experimental design is expensive if the disease we are studying is relatively infrequent in both groups, is also influenced by many factors other than genotype, or develops after a long and variable period of time. An alternative strategy is to test the hypothesis in a retrospective, case-control study, in which we identify a number of individuals suffering from the disease (cases), choose for each an individual (control) that is similar to our case from a disease-free population, and determine the genotypes for both individuals in each paired sample. Four outcomes are possible for each (case, control) pair, where 1 indicates that the individual has Genotype 1: (1,1), (1,0), (0,1), (0,0). We could organize the data as shown in the table below.



Controls


Gen. 1 Not Gen. 1 Total
Cases Gen. 1 a b (a+b)
Not Gen. 1 c d (c+d)

Total (a+c) (b+d) n

This problem can be viewed as a variation on the Sign test alluded to in Problem 7 of Section 2.5. The a and d concordant pairs (both case and control have the same genotype classification) represent tied values and can be discarded from the analysis. Under the null hypothesis that the prevalences of Genotype 1 among the cases and controls are equal, the two types of discordant pairs [events (1,0) and (0,1)] will each occur with a probability of 0.5. Thus, we can use the binomial distribution with parameters p=0.5 and N=b+c to test our hypothesis (McNemar, 1947). For the one-sided alternative that (1,0) pairs are more frequent than (0,1) pairs, the exact P-value can be computed from:

The P-value for the two-sided alternative may be obtained by adding to P1 the value for the lower tail of the binomial distribution, i.e., summing the values for x=0 to x=c. Results for experiments of this type are often reported in terms of the odds ratio, which is simply b/c.

6.3.2. The structural zero.

All 2 × 2 tables are not necessarily contingency tables. Consider the hypothetical example below:



Secondary Infection


Yes No
Primary
Infection
Yes 20 60
No 0 60

One of the four cells is zero, not because we just happened to pick up a zero through random fluctuation, but because one cannot have a secondary infection without first having had a primary infection; the zero is said to be a structural zero. The hypothesis to be tested is that primary and secondary infections are equal and independent.

Let p=Pr(of infection). Then under the null hypothesis, the likelihood is

 where A=20, B=60, and C=60.

The maximum likelihood estimator of p is

 

with a variance of V=p(1-p)/[n(1+p)]. For our data and so the expected values of A, B, and C are 28.93, 34.71, and 76.36, respectively. We may now apply the Chi-square goodness of fit test for which we have three classes and one parameter estimated, yielding 1 df: X2 = 24.69. Clearly, first and second infections are either not equal or not independent; e.g., there are far too many in the "Yes, No" cell.

6.3.3. Another goodness of fit test masquerading as a 2 × 2 contingency table.

Here is one more example of a 2 × 2 table which is actually a goodness of fit problem with 3 degrees of freedom. The data are of one of Mendel's original two-factor crosses (F2):


Yellow Green Total
Round 315 108 423
Wrinkled 101 32 133
Total 416 140 556

In this case there are no parameters to estimate: We expect from Mendel's theory that the cell entries should be in a 9:3:3:1 ratio, and that the two marginals (row and column totals) should each be in a 3:1 ratio. Testing for the 9:3:3:1 ratio, the total X2=0.47 with 3 df. The row and column totals yield Chi-squares of 0.3453 and 0.0096, respectively, each with 1 df. The difference 0.47 - 0.3453 - 0.0096 = 0.115 with 1 df is a pure test of linkage. We see that the 3 df Chi-square test may be partitioned into three individual tests, each with 1 df. We will discuss the partitioning of contingency tables more generally later in this chapter.

6.4. The r × c Table

We now extend our analysis of contingency tables to the case of any arbitrary number of rows and columns. Our data consist of n observations classified according to r row categories and c column categories as in the table below.

n11 n12 n1c r1
n21 n22 n2c r2
: : : : :
nr1 nr2 nrc rr
c1 c2 cc n

6.4.1. The exact test: Extended hypergeometric distribution.

For an r × c table, the generalization of the hypergeometric exact probability distribution is given by

This distribution is no longer one-dimensional and how to order the tables is not immediately apparent. Consider the example below.

Example 6.9 Exact Analysis of a 3 × 2 table.

We are interested in the effects of pregnancy on the induction of mammary tumors in rats. Animals with a prior history of 0, 1, or more than 1 pregnancy are treated with a carcinogen at 6 months of age and analyzed for the presence of mammary tumors at 14 months of age. We obtain the following

 Tumor-
bearing
Tumor-
free
Total
Nulliparous224
Uniparous325
Multiparous066
Total51015

There are exactly 20 tables that correspond to the fixed row and column totals; they are shown below. The numbers directly above each table are the value of the Chi-square test statistic for that table and its exact probability (out of 3003).

11.25
6
5.625
60
3.75
120
5.625
60
11.25
6
0
4
1
3
2
2
3
1
4
0
0
5
0
5
0
5
0
5
0
5
5
1
4
2
3
3
2
4
1
5

5.4
75
1.275
400
0.90
450
4.225
120
11.4
5
0
4
1
3
2
2
3
1
4
0
1
4
1
4
1
4
1
4
1
4
4
2
3
3
2
4
1
5
0
6

2.25
200
0.225
600
1.35
360
6.225
40

0
4
1
3
2
2
3
1
2
3
2
3
2
3
2
3
3
3
2
4
1
5
0
6

3.6
150
2.475
240
5.10
60
(bold-italic is
observed table)
0
4
1
3
2
2
3
2
3
2
3
2
2
4
1
5
0
6

7.65
30
8.025
20

0
4
1
3
4
1
4
1
1
5
0
6

15.00
1

0
4
5
0
0
6

Now, how might we arrange these tables? At least two methods suggest themselves; by the size of the Chi-square test statistic, or by their exact probabilities. Let us look at the result.

Exact Pr Chi-square
1 15.00
5 11.4
6(2) 11.25
20 8.025
30 7.65
40 6.225
60(3) 5.10
5.625(2)
75 5.4
120(2) 4.275
3.75
150 3.6
200 2.85
240 2.475
360 1.35
400 1.275
450 0.9
600 0.225

We see that the two methods lead generally to the same ordering, but not quite. If we order the tables by Chi-square and sum up probabilities of all those tables with Chi-square values at least as large as the table we observed yields 0.1209, whereas if we order by exact probabilities, we get 0.096. Usually, the ordering criterion will make little difference, but the fact that they need not be exactly the same is a complication one must be aware of. Here, had we been testing at the 10% level, it would have made a difference.

6.4.2. The Chi-square test.

Under the null hypothesis that the row and column classifications are independent, we can determine the expected proportion for each cell from the fixed row and column totals as

As in the case of the 2 × 2 table, we may use the Chi-square test statistic,

This statistic follows a χ2 distribution with

(rc-1)-(r-1)-(c-1) = (r-1)( c-1)

degrees of freedom. Note, again, we are using the general principle that the number of degrees of freedom is

df = (number of cells) - 1 - (number of parameters estimated).

Example 6.10 Analysis of a 3 × 2 table.

We repeat Example 6.9 with realistic numbers. We are interested in the effects of pregnancy on the induction of mammary tumors in rats. Animals with a prior history of 0, 1, or more than 1 pregnancy are treated with a carcinogen at 6 months of age and analyzed for the presence of mammary tumors at 14 months of age. We obtain the following data (with expected values in parentheses)

 Tumor-
bearing
Tumor-
free
Total
Nulliparous3 (8.32)14 (8.68)17
Uniparous8 (7.83)8 (8.17)16
Multiparous12 (6.85)2 (7.14)14
Total232447

Summing the statistic over the 6 cells, we obtain a value of 14.25. Using the χ2 distribution with (3-1) (2-1) = 2 degrees of freedom, our significance level is P<0.001 and we would reject the hypothesis that susceptibility to mammary carcinogenesis is independent of parity.

6.4.3. An equivalent likelihood ratio test.

In the discussion above, we were interested in testing the hypothesis that the row and column classifications for the n observations in the r × c table are independent. Consider an alternative experimental design. We have obtained r independent samples, each consisting of ri observations, corresponding to different treatment groups. For each treatment, we classify the observations according to c mutually exclusive categories. We want to test the null hypothesis

H0: p1j = p2j = … = prj for all j

against the general alternative that at least one pair of success probabilities is different.

Under the null hypothesis, the maximum likelihood estimates (see section 3.2) for the column probabilities are , while the unrestricted maximum likelihood estimates are .

Thus, we can define the likelihood ratio statistic as

As with the Chi-square test, the likelihood ratio statistic approximately follows a χ2 distribution with (r-1)(c-1) degrees of freedom. In general, the Chi-square and likelihood ratio tests will give similar results and lead to the same conclusions. For the parity example given above, the value of G2 is 15.63 (compared with 14.25 for X2), yielding a P-value of 4×10-4. The χ2 approximation for both X2 and G2 improves with increasing sample size and expected cell counts, but the approximation is closer for the former statistic when some of the expected cell counts are smaller than 5.

6.4.4. Partitioning r × c tables.

A very useful property of Chi-square statistics is their additivity for multiple, independent tests. That is, the sum of two independent Chi-square statistics with degrees of freedom d1 and d2, respectively, also follows a χ2 distribution with d1+d2 degrees of freedom. This property allows us to partition an r × c table into (r-1)(c-1) tests (each a 2 × 2 table) with 1 degree of freedom.

To partition a table, order the rows and columns in a way that allows you to make the desired comparisons. Start at the upper left corner of the table, and compute X2 or G2 for the resulting 2 × 2 table. For each remaining cell in the (r-1) × (c-1) portion at the lower-right of the original table, form a new 2 × 2 table with the value in the selected cell placed as n22 and the rest of the table filled in by summing the values upward and/or to the left, as appropriate. For example, a 3 ×  4 table can be partitioned into 6 tests, with n22 containing the cells labeled 6, 7, 8, 10, 11, or 12, as indicated below.

The 2 × 2 table for cell 11 is

Adding together the values of G2 for all of the partitions should equal (exactly, outside of rounding) the value of G2 for the table as a whole, while the sum of the X2 values for the partitions will only approximate the value of X2 for the original table.

Example 6.11

In a study of the relation between the ABO blood groups and certain diseases, a large sample of patients and controls was collected. (The numbers of AB individuals were so small they are not shown).

Blood
type
Controls Peptic
Ulcer
Gastric
Cancer
A2625679416
B57013484
O2892983383

Analyzing the table as a whole, we obtain a likelihood ratio statistic of G2 = 40.6. Using the χ2 distribution with 4 degrees of freedom, the resulting P-value is 3×10−8, allowing us to conclude that gastric disease is strongly associated with blood type. Our four partitions are

2625 679 G2 = 0.84, P = 0.36
570134 
3195 813 G2 = 28.96, P = 7x10-8
2892983 
3304 416 G2 = 0.18, P = 0.67
70484 
4008 500 G2 = 10.66, P = 0.001
3875383 

From our partitioned analysis, we could conclude that having blood type O is strongly associated with peptic ulcers and, to a lesser extent, gastric cancer.

Our example of partitioning an r × c table follows a set of more general rules:

  1. The df of the sub-tables must sum to the df of the original table.
  2. Each cell count in the original table must be a cell count in one and only one sub-table.
  3. Each marginal total of the original table must be a marginal total for one and only one sub-table.

Notice that under these more general rules, the sub-tables need not be 2 × 2 tables.

6.5. Ordered Categories

In the statistical tests we discussed above, the order within the rows or categories had no effect on our analysis. However, for some experiments, the rows, columns, or both sets of categories may follow a natural order (e.g., age, dose of a test substance, the three genotypes in an intercross). Taking that ordering into account can provide us with a more powerful statistical test of particular hypotheses.

6.5.1. Ordered rows or columns: Cochran-Armitage test.

Consider the problem of testing for a trend in a dichotomous response for a data set consisting of r ordered rows. Below, we follow the treatment provided by Agresti (2002). For row i, let π1|i be the true probability of response 1, and p1|i be the observed proportion. Also let xi denote some score assigned to the ith row. Then for a strictly linear trend, we would have

and the least squares prediction equation

where

 and  

In the above formulas, ni+ is the total for row i and p+1 is the sum over column 1 divided by the total number of observations. This result is straightforward linear regression theory (see Chapter 8).

The Cochran-Armitage test (Cochran, 1954; Armitage, 1955) provides a way to partition the total Chi-square into two parts: the first, X2 (trend), has 1 df and tests for a linear trend in the row proportions; the second, X2 (model), has r-2 df (r is the number of rows) and tests the goodness of fit of the model, itself.

Example 6.12 Trend in ordered rows.

The data below were presented by a graduate student interested in testing if there is a trend over time in the use of birth control. We have a 10 × 2 contingency table; but what is special is that the rows are ordered (ordinal, rather than nominal). Ignoring the ordering for a moment we get a regular contingency table Chi-square test statistic of 85.07 with 9 df. Clearly, year and use of birth controls are not independent. But, is there any consistent trend?

Year Birth Control Use at Time of Screen
Do Not Use Use Total Women
1978 158 (91.9%) 14 (8.1%) 172
1981 108 (85%) 19 (15%) 127
1984 103 (85.8%) 17 (14.2%) 120
1987 71 (72.4%) 27 (27.6%) 98
1990 63 (76.9%) 19 (23.1%) 82
1995 65 (64.4%) 36 (35.6%) 101
1996 44 (62.9%) 26 (37.1%) 70
1997 29 (70.7%) 12 (29.3%) 41
1998 24 (68.6%) 11 (31.4%) 35
1999 12 (35.3%) 22 (64.7%) 34
Total 677 (76.9%) 203 (23.1%) 880

For our data, above, we chose, as the xi the years, themselves, (78, 81, 84, …). The value of p+1 is 677/880 or 0.769. We find  

b= -0.0162806 and and

   X2 (trend) = 68.16 with 1 df

   X2 (model) = 16.91 with 8 df.

These values sum to the total X2  already calculated. There is certainly a trend, but the model doesn't really fit very well, likely because the trend is not really linear.

6.5.2. Ordered rows and columns: Jonckheere-Terpstra test.

As shown in the example below, you can apply the Jonckheere-Terpstra test (section 5.4.2) to a contingency table that has both rows and columns ordered.

Example 6.13 Ordered rows and columns

Consider the following experiment in which wild-type, heterozygous, and homozygous mutant mice are compared for the severity of motor defect:

Genotype Motor defect
None Mild Severe
AA1020
Aa582
aa378

This is a contingency table with both rows (number of mutant alleles) and columns (severity of defect) naturally ordered, and n=45.

Applying the Jonckheere-Terpstra test to these data (taking into account the ties, of course), we get U= 507.5 with E(U)= 333 and V(U)=1985.8. So the normal test statistic is Z= 3.92, which corresponds to  P=0.0001, for a two-sided test.

6.6. Higher Dimensional Tables

Although our examples so far have consisted of one or two dimensional contingency tables, the approximate methods, based on X2 and G2, may be applied readily to data in higher dimensions. In the discussion below, we analyze a three dimensional table and illustrate an inherent pitfall in the analysis of contingency tables.

6.6.1. The 2 × 2 × 2 table and Simpson's paradox.

Consider the following data on smoking and cancer reported by an imaginary tobacco company in the 1960's.


Cancer No Cancer
Non-Smokers 190 (0.118) 1419
Smokers 182 (0.103) 1489

It appears from the table that smokers have slightly lower cancer rates that non-smokers. The result is not significant, but there is certainly no evidence in the data, as presented, of a risk from smoking.

But now, let us look at the original data which were actually in the form of a 2 × 2 × 2 table in which males and females are counted separately.

Women Cancer No Cancer
Non-Smokers 188 (0.12) 1318
Smokers 112 (0.17) 522
   
Men Cancer No Cancer
Non-Smokers 2 (0.02) 101
Smokers 70 (0.06) 967

When the data are subdivided into males and females we find that in both sexes there is a higher rate of cancer among smokers (in females it is even statistically significant). At face value, it appears that smoking is bad for males and bad for females, but good for people in general! Obviously collapsing the two tables into one results in a totally misleading impression. Fortunately, this sort of extreme result of collapsing of a higher dimensional table into a smaller dimensional one does not happen often, but it is always a possibility. The sobering fact to remember is that all contingency tables are collapsed tables—collapsed over all those categories you did not think of taking into account or were unable to measure.

6.6.2. A Plethora of Hypotheses

Multidimensional tables, like the one above, lead to many more hypotheses than just the question of independence versus non-independence; various kinds of conditional independences emerge. Using the hypothetical smoking data as an example, we show a more detailed analysis of these data.


Smoking and Cancer Example
(V=Smoking status, S=sex, C=Cancer status)

Model G2 df
P
value
df Expected
values
1. (S)(V)(C) 1315.64 4 <10-3 IJK-I-J-K+2 ni..n.j.n..k/n2
2. (VC)(S) 1267.85 3 <10-3 (I-1)(JK-1) ni..n.jk/n
3. (SC)(V) 1314.96 3 <10-3 (J-1)(IK-1) n.j.ni.k/n
4. (SV)(C)  62.20 3 <10-3 (K-1)(IJ-1) n..knij./n
5. (SC)(VC) 1267.16 2 <10-3 K(I-1)(J-1) ni.kn.jk/n..k
6. (SC)(SV) 61.52 2 <10-3 I(J-1)(K-1) nij.ni.k/ni..
7. (VC)(SV) 14.41 2 0.001 J(I-1)(K-1) nij.n.jk/n.j.
8. (SC)(VC)(SV) 1.88 1 0.18 (I-1)(J-1)(K-1) by iteration (see Agresti 2002, chapter 6)

The meanings of the eight hypotheses are as follows:

  1. Sex, smoking status, and cancer state are mutually independent.
  2. Sex is jointly independent of smoking and cancer.
  3. Smoking is jointly independent of sex and cancer.
  4. Cancer is jointly independent of sex and smoking.
  5. Sex and smoking are independent given cancer.
  6. Cancer and smoking are independent given sex.
  7. Cancer and sex are independent given smoking.
  8. No pair is even conditionally independent.

In our data, only the last hypothesis fits the data, so collapsing the 2 × 2 × 2 table into a 2 × 2 table was definitely unwarranted.

6.6.3. Loglinear Models

There is an extensive literature on the analysis of multidimensional tables using what are called "loglinear models." Essentially, these models treat statistical dependencies as interactions by writing the expected values in terms of logs. We will not deal with loglinear models in this book, but see Agresti (2002) for a detailed discussion of this very important method of analysis.

6.7. Sample Problems

  1. Two inbred mouse strains, C57BL and RFM, are compared for their risk for developing spontaneous lymphomas. Forty male animals of each strain are housed under identical conditions and allowed to live out their normal lifespan, which averaged 20 months for both strains. The incidences of lymphoma were 3/40 for C57BL mice and 8/40 for RFM mice. Do these two strains differ in their risk for lymphoma development?
  2. Reanalyze the data in Example 6.8 using the X2-test. Does the significance level differ from that obtained using Fisher's exact test? What might account for this difference?
  3. The X2 test of fit can be used to test composite hypotheses related to, for example, the form of the distribution. You are studying tumor development in the lung and for a particular group of animals, you observe the data below:
    Number of Tumors 0 1 2 3 4 5
    Number of Mice 13 7 3 1 1 1
    You want to test the hypothesis that the tumor multiplicity is distributed according to a Poisson distribution. To perform the test, estimate the Poisson parameter and use it to determine the expected values for the distribution (pool the last 3 categories so that the expected numbers are reasonable). The number of degrees of freedom should be reduced by 1 to account for the fact that you have estimated one of the parameters for the distribution.
  4. You are comparing four independent mutant alleles for their abilities to cause a developmental defect that is scored as absent, moderate, or severe. In the table below, is the expression of the defect independent of the allele?

    Absent Moderate Severe
    allele 1 10 4 4
    allele 2 5 8 2
    allele 3 6 2 1
    allele 4 3 7 8
  5. The following are real data from Radelet and Pierce (1991) and quoted in Agresti (2007). The study was initiated to examine the effect of race on whether individuals convicted of homicide receive the death penalty or not. The data are of 674 subjects convicted of homicide in 20 Florida counties during 1976-1987.


    Death Penalty


    Yes No
    Race of White 53 430
    Defendant Black 15 176
    The 2×2 table above is actually a summarization of the complete data. The complete data were in the form of the 2×2×2 table shown below:
    Victim White
    Death Penalty


    Yes No
    Race of White 53 414
    Defendant Black 11 37

    Victim Black
    Death Penalty


    Yes No
    Race of White 0 16
    Defendant Black 4 139
    Examine these tables carefully. What is so strange about these data? How do you explain it? How would you analyze such data?
  6. Different types of 2 × 2 tables. Analyze each of the tables given below. Although a minimum of information is provided, think about how the data were collected. How many degrees of freedom are appropriate for each example?
a. Tea Tasting Lady


guess


SM MS
set SM 4 11 15
MS 11 14 25


15 25 40
b. Traffic accidents

Dry Wet
Day 10 20 30
Night 14 32 46

24 52 76
c. Pneumonia in calves


2° infection


Y N
1° inf. Y 30 63 93
N 0 63 63


30 126 156
d. Testing a Drug

live die
Trt. 16 4 20
Con. 8 12 20

24 16 40




e. Population Sample


Hair


red black
Eyes blue 10 22 32
brown 24 9 33


34 31 65
f. Nausea (N) & 2 drugs


Drug A


No N N
Drug B No N 75 13 88
N 3 9 12


78 22 100
g. Effect of drug on HBP


After


N Y
Before N 28 6 34
Y 1 32 33


29 38 67