8. Correlation and Regression

8.1. Tests of Independence

For categorical data, Fisher's exact test (for 2 × 2 tables) or the Chi-Square test (for r × c contingency tables) provides a means to test for independence between the two methods of classifying a set of observations. We can generalize this problem by considering the case of making bivariate observations on n independent samples. That is, for each sample i (i=1…n), we measure X_i and Y_i, where X and Y represent two different random variables. The question of interest is whether or not the X and Y random variables are associated. If the X and Y random variables could only take on values of 0 or 1, we could organize the data in the familiar 2 × 2 table and use Fisher's exact test for independence. We need to develop a different approach for the case that the Xs and Ys are each random variables distributed along some portion of the real line.

A familiar measure of the degree of independence of the X and Y variates is the correlation coefficient, r, which was defined by Pearson as

The value of r ranges from -1.0 to +1.0, depending on whether large X values are associated with small or large Y values, respectively. When X and Y are uncorrelated, the value of r is 0. This measure of correlation is sensitive to "fat tails" in the distributions of the random variables. In addition, traditional tests of the hypothesis that the value of r differs significantly from 0 are based on the assumption that both X and Y are normally distributed. In order to avoid these limitations, we can define both measures of correlation and a test of independence based on permutations of the sample ranks. It is important to point out that such a test measures only the independence of the two variates and does not address issues of causality.

8.1.1. Spearman's rank correlation coefficient.

An equivalent to Pearson's product moment correlation coefficient, r, based on sample ranks rather than the values of the observations was proposed by Spearman (1904). If R(X_i) is the rank of X_i among the X values and R(Y_i) the rank of Y_i among the Y values (using mid-ranks for tied values), Spearman's rank correlation coefficient is given by

In the case that there are no ties among the ranks, this equation simplifies to

The value of ρ may be used to test the hypothesis that X and Y are independent. The value

follows a standard normal distribution. Depending on the sign of ρ, the value of z will fall within the lower (ρ < 0) or upper (ρ > 0) tail of the normal distribution. For a two-sided test of independence, simply multiply the appropriate tail probability by 2. One-sided tests for positive or negative correlation are provided by the upper- and lower-tail probabilities, respectively.

Example 8.1

A series of recombinant inbred mouse strains was compared for their spontaneous incidence of liver cancer and their sensitivities to chemically induced tumors at that site. The table below shows each phenotype (with the rank in parentheses):

Strain	Induced	Spont. (%)
3	26 (11)	27 (8)
4	6.3 (5)	82 (11)
6	1.9 (2)	4 (3.5)
7	23 (10)	26 (6.5)
8	20 (7.5)	64 (10)
9	20 (7.5)	26 (6.5)
10	4.1 (4)	7 (5)
14	2.2 (3)	3 (1.5)
19	12 (6)	3 (1.5)
C	22 (9)	52 (9)
B	1.1 (1)	4 (3.5)

Computing ρ, we obtain

Thus, z = 1.76, and the one-sided P-value is 0.04.

[Data from Lee and Drinkwater, 1995.]

8.1.2. Kendall's test for independence.

A conceptually simpler approach, described by Kendall (1938), arises when we rephrase our hypothesis of independence as

H₀: P[(X_i - X_j)(Y_i - Y_j) > 0] = 0.5

against, for example, the two-sided alternative

H₂: P[(X_i - X_j)(Y_i - Y_j) > 0] ≠ 0.5

We can use this form of the null hypothesis to define a test statistic, K, based on the n(n-1)/2 pairwise comparisons of the X and Y values as follows. If we define sgn(b) as the sign of b (that is, a value of -1 if the sign is negative, +1 if the sign is positive, and 0 if b = 0), we can compute

The value of K ranges from -n(n-1)/2 to +n(n-1)/2 for perfect negative and positive correlations, respectively. Note that by defining the test statistic in this way, we would obtain the same value whether we used the actual values of the observations or their ranks from 1 to n within the X and Y values. For small values of n we can readily compute the distribution of the test statistic under the null hypothesis of independence. If we reorder the X values by increasing rank, we can compute the value of K for all of the n! permutations of the Y values by considering only the ranks of the Y variate. As an example, the permutations of the Y ranks and the distribution of K for n = 4 are shown below.

Permutation	K	Permutation	K
1,2,3,4	6	3,1,2,4	2
1,2,4,3	4	3,1,4,2	0
1,3,2,4	4	3,2,1,4	0
1,3,4,2	2	3,2,4,1	-2
1,4,2,3	2	3,4,1,2	-2
1,4,3,2	0	3,4,2,1	-4
2,1,3,4	4	4,1,2,3	0
2,1,4,3	2	4,1,3,2	-2
2,3,1,4	2	4,2,1,3	-2
2,3,4,1	0	4,2,3,1	-4
2,4,1,3	0	4,3,1,2	-4
2,4,3,1	-2	4,3,2,1	-6

K	p(K)
6	1/24
4	3/24
2	5/24
0	6/24
-2	5/24
-4	3/24
-6	1/24

The distribution of K for various values of n has been tabulated in Appendix 6 for a one-sided statistical test (simply double the P-values to obtain the two-sided signficance level). From above, you can see that the expected value of K is 0 and that the distribution is symmetrical. Thus, for large values of n, we can define an approximate statistic, K*, that follows a standard normal distribution.

K* = K/[n(n-1)(2n+5)/18]^1/2

Using the statistic K, we can define a measure, τ, of the degree of independence that is analogous to ρ defined above, such that

-1 ≤ τ ≤ +1, τ = 2K / n(n-1)

This measure does not behave in quite the same way that r or ρ do. As a rule of thumb for evaluating (in your head only) the quality of a correlation, consider (τ)^1/2 to be the rough equivalent to r or ρ. Hypothesis tests for independence based on K and ρ will generally give very similar results.

Example 8.2

Inbred mouse strains have characteristic patterns of tumor induction for various tissues following treatment with a chemical carcinogen. Young male animals of several inbred strains were treated with a single dose of carcinogen and the number of tumors induced in the liver and lungs of each animal was evaluated at 8 months of age. For each strain, the mean liver and lung tumor multiplicities were calculated and we want to test the hypothesis that the sensitivities for tumor induction at the two sites are independent. In the table below, the strains were ordered on the basis of the liver tumor multiplicity

	Mean Tumor Multiplicity
Strain	Liver	Lung
SWR/J	0	16
A/J	0.2	23
C57BL/6J	0.3	1.5
C57BR/cdJ	7.1	1.0
P/J	15	1.8
SM/J	17	1.9
CBA/J	45	6.0

The ranks of the lung tumor multiplicities (ordered by increasing liver tumor multiplicity are

	6	7	2	1	3	4	5
K	-4	-5	2	3	2	1

and the test statistic is K = -1 and P = 1.0. The value of τ is

-2/7(6) = -0.048.

[Data adapted from Kemp and Drinkwater, 1989.]

8.2. Regression

In the discussion above, both the X and Y observations were random variables and we were interested in the degree to which the two random variables were associated or "varied together." We might also be interested in whether X and Y displayed a particular functional relationship. Estimation and hypothesis testing related to such a functional relationship can be approached by the use of regression analysis.

8.2.1. Estimation by least squares.

We will discuss only a small (but highly useful) part of the general problem of regression analysis – estimation by the method of least squares of the parameters describing a linear functional relationship between a dependent random variable Y_j and one or more independent variates X_ij. In the equation

Y_j represents our measured random variable (e.g., weight, body length, survival), the β_i are the parameters we want to estimate, the x_ij are known coefficients (e.g., dose of drug, age, genotype) that may be fixed by the experimental design, and ε_j is an "error" random variable. Note that by linear regression we are referring only to the parameters:

are all linear models while

is not.

Estimates of the parameters, β_i, obtained by the method of least squares are unbiased, minimum variance estimators whether or not the error terms, ε_j, are normally distributed. Unfortunately, hypothesis testing (e.g., that two regression lines have the same slope, that the value of the parameter β₂ is 0, etc.) does depend on the form of the distribution of the random variable. Thus, we will focus on the estimation problem using the method of least squares. Nonparametric methods for hypothesis testing are discussed in the next section.

The general linear regression equation above can be written in matrix notation as

Y = Xβ + ε

where Y is the (n×1) vector of observations, X is an (n×k) matrix of known coefficients, β is the (k×1) vector of regression coefficients, and ε is an (n×1) vector of "error" random variables with means and dispersion matrix E(ε) = 0 and V(ε)=σ²I (identity vector). We assume that n≥k and that |X'X|≠0.

The least squares estimates of the parameters are obtained from

and an unbiased estimate of σ² is

Using this estimate of σ², the variance for a particular parameter is

You can use the above results to predict the value of Y that would be obtained for a particular set of x_i conditions, represented by the vector a:

Example 8.3

Cultured fibroblasts are plated at low density, treated with varying doses of UV light, and allowed to grow for 2 weeks. The culture dishes are then stained and the number of colonies per dish determined. The data below are the surviving fractions (normalized to the cloning efficiency of untreated cells) as a function of dose. We want to estimate the parameters for the model

log(Survival) = β₀ + β₁× dose

Dose (J/m²)	Survival	log(Survival)
2.5	0.76, 0.85, 0.97	-0.12, -0.071, -0.013
5.0	0.67, 0.66, 0.64	-0.17, -0.18, -0.19
7.5	0.35, 0.36, 0.34	-0.46, -0.44, -0.47
10	0.19, 0.20, 0.17	-0.72, -0.70, -0.77

For a simple, straight line the parameters can be computed readily by hand using the formulas

where the summations are over all of the observations and (in this case) y refers to the log(Survival) values and x to the UV doses.

Using the above, we find that the parameters are

The variances for the parameters are

8.2.2. Nonparametric regression analysis.

For the simple linear regression problem described above, we could use Kendall's test to evaluate the correlation between UV dose and log(Survival), which would be equivalent to testing the null hypothesis that β₁ is equal to 0 against either a one- or two-sided alternative. Applying Kendall's test to the data in the example, we obtain a test statistic of K= -54 and a two-sided P-value of approximately 0.0002.

Theil (see Hollander et al., 2013) has extended this approach to testing the hypothesis that the slope of a straight line is equal to any specified value, m. The data consist of n observations, Y_i, each associated with an explanatory variable (which may be random or non-random), x_i. We want to test the null hypothesis that the slope β₁ is equal to a fixed value, m, against the two-sided alternative that they are not equal (or either of the one-sided alternatives). To compute Theil's statistic, first determine the differences between the observed values and that predicted by the slope under the null hypothesis

D_i = Y_i - mx_i (i = 1 … n)

The test statistic C is obtained from

where, as above, sgn(a) is equal to -1 if a<0, 1 if a>0, and 0 if a=0. The distribution of C is identical to that for Kendall's K statistic and the same table or large sample approximation should be used. A simple distribution-free estimator for the slope can be obtained by computing the n(n-1)/2 individual slope estimators

and determining the median value. In the case that not all of the x_i are distinct, determine the median from among those values for which S_ij is defined. For Example 8.3, there are 54 slope estimators (excluding those with 0=x_j-x_i) and the median value is -0.095.

8.2.3. Comparing the slopes of two or more regression lines.

As noted above, testing composite hypotheses regarding the parameters for linear regression models by the method of least squares requires that the responses are normally distributed. For the simplest regression model, that of a straight line, Sen (1969) and Adichie (1975) have proposed a nonparametric test of the hypothesis that two or more lines share a common slope, independent of the intercepts for the lines (i.e., the lines are parallel).

Consider an experiment in which we want to compare the dose-responses for mutagenesis by ionizing radiation of k cell lines that are derived from patients with various alleles for loci that we suspect may influence their sensitivity to radiation-induced mutations. For each cell line, we measure the mutation frequency, Y_ij, at various doses, x_ij, of ionizing radiation (i=1, …, k; j=1, …, n_i). The responses follow the regression model

We want to test the null hypothesis

H₀: β₁₁=β₁₂=…=β_1k

against the general alternative that at least one of the slopes differs from the others.

To compute the Sen-Adichie statistic, first estimate the pooled estimator for the common slope under the null hypothesis, , by the method of least squares:

where

For each of the k regression lines compute the residuals

and determine the ranks r_ij within the ith sample for each . Under the null hypothesis, these ranks will be independent of x_ij. Next compute for each sample the sum

The Sen-Adichie statistic, V, is computed from

which follows a χ² distribution with k-1 degrees of freedom under the null hypothesis.

The Sen-Adichie test is appropriate only when the responses are reasonably well-modeled by a straight line, which may require transformation of the original data. In the example below, transformed data are analyzed to provide a convenient solution to the double-ratio problem, where the desired test is for the relative responses to treatment of two samples.

Example 8.4

We have observed that a novel gene we are studying is induced following treatment of cultured cells with ionizing radiation (IR) and wish to test the hypothesis that this induction is dependent on the Atm gene. We prepare replicate cultures for Atm mutant and wild type cell lines, and, for each cell line, isolate mRNA from three untreated and three irradiated cultures, and measure the level of our transcript by Northern analysis. We want to test the hypothesis that the relative responses (i.e., the fold-induction) of the transcript differs for the wild type and Atm cell lines. Our data (10^-3×phosphorimager counts) are

Genotype	IR	Transcript	ln Transcript
wt	-	32,34,7	3.466,3.526,1.946
	+	128,95,111	4.852,4.555,4.710
Atm	-	48,22,41	3.871,3.091,3.714
	+	51,47,58	3.932,3.850,4.080

Using the methods described in section 3.4, we observe that IR resulted in a 4.6±2.4 fold induction in wild type cells and 1.4±0.4 fold in Atm cells. Do these cell lines differ in the fold-induction?

Since we have used only a single dose of ionizing radiation, we can conveniently set x_ij to 0 for untreated cells and 1 for treated cells. If the fold induction for the two cell lines differ, then the slopes of the log responses will differ. For both cell lines,

and

Using the log transcript levels and the formula given above, we obtain a common slope estimate of 1.057. The residuals for the wild type cells are (3.466, 3.526, 1.946) and (3.795, 3.497, 3.652), yielding ranks of (2, 4, 1) and (6, 3, 5), for untreated and treated cells, respectively. Using the formula for T*, we obtain a value of 0.5 for wild type cells. Using the same methods for Atm cells, the value of T* is -0.643. Thus, the value of our test statistic, V, is 5.306. Using the χ² distribution with 1 df, we obtain a P-value of 0.02. Thus, we would conclude that the fold-induction for the transcript differs between the wild type and Atm cell lines.

8.3. Sample Problems

You have also measured the liver tumor multiplicities induced in female animals of the strains given in the table in Example 8.1. In the same order as the strains in the table, the mean female liver tumor multiplicities were 0, 0.05, 0.2, 5.3, 0.6, 0.1, and 0.7. Is there a significant correlation between the sexes in induced liver tumor multiplicity?

Epstein Barr virus (EBV) infects human B-lymphocytes and the viral genome is present as a multicopy plasmid in latently infected cells. Such cells will rarely enter the lytic cycle, producing infectious virus. The data below have been borrowed (and modified a bit) from Sugden and Metzenberg (1983), who wanted to ask whether the copy number of viral DNA and frequency of lytic growth (as measured by the production of lytic cycle antigens) were independent in a series of infected cell lines.

Cell Line	Viral DNA (copies/cell)	% Antigen positive
LCL-721	5	<0.02%
11/17-3	10	0.4
11/17-5	20	0.2
3/15-9	110	1.25
11/17-4	150	1.3
3/15-31	700	3.8

a. Are the average viral copy number and frequency of lytic growth independent?
b. Briefly state two alternative models that could account for the observed correlation.