7. Survival Analysis

Approaches discussed in the preceding two chapters generally focused on data generated at a specific time point. An alternative design would be to observe continuously to generate data that consist of the times at which a particular class of events occurred. For example, we may be interested in comparing the times to death or relapse for patients receiving two different treatment regimens, or the latencies of a particular disease in groups of animals of various genotypes. Because the events of interest are often death or failure, this class of problems is generally referred to as survival analysis. However, the methods discussed in this chapter are applicable to any type of time-to-event data, such as time to flowering for two groups of plants raised under different conditions.

We could, of course, analyze our data using the categorical or rank sum methods discussed previously. For example, we could choose a fixed time point and ask whether the proportions of patients surviving to that time differed for our two treatments using Fisher's exact test or a Chi-square test. However, our result will depend on the time point we choose. Consider the case where the long-term survival of patients on the two therapies is similar, but most of the deaths of patients receiving treatment A occur soon after the start of therapy, while those on therapy B occur many months later. We would generally view the latter treatment to be superior to the former, in spite of the similar end result. Alternatively, we could treat the event times as the random variable of interest and use the Wilcoxon rank sum test to compare two groups. This approach would avoid the loss of information inherent in comparing proportions at a particular time, but still may not take full advantage of the available data. A feature of our observational design is that some patients or experimental animals may be lost before the end of the study. For example, a patient may elect to withdraw from a study or an experimental animal suffer an accidental death. Such observations are censored but still contain relevant information. If a subject is censored at time t, we do not know its true event time, but do know that it must be longer than t.

Because of its importance to fields ranging from industrial quality assurance to clinical research, survival analysis has been an active area of statistical research, and there are numerous books on the subject ( e.g., Cox and Oakes, 1984; Lee and Wang, 2003). In this chapter we will focus on just two topics, nonparametric estimation of survival distributions and tests for differences in survival between two treatment groups.

7.1. Estimating Survival Functions

Our experiment consists of n samples, each having an associated T_j, time to event, and C_j, time to censorship. The T_j and C_j are assumed to be mutually independent, i.e., censoring time provides no information on the true survival time. For each sample, we observe

The cumulative distribution of event times is F(x) = P(T≤x). We want to estimate the survival function, S(x) = 1 - F(x).

Within our experiment, we observe k distinct failure times, which we order t₍₁₎ < t₍₂₎ < … <t_(k). At any given time, t_(i), we observe d_i failures among n_i subjects at risk. A natural estimator for the probability of failing at time t_(i) is the ratio d_i/n_i. Thus, the conditional probability of surviving past t_(i), given that the subject has survived to that time, is (1 - d_i / n_i). Multiplying the successive conditional probabilities of survival gives us the unconditional probability of survival past time t_(i). This estimator for survival was originally described by Kaplan and Meier (1958) as the product limit estimator, and is now generally called the Kaplan-Meier estimator. Thus, the survival is estimated as

As above, d_i is the number of failures (events) at time i and n_i the number at risk at that time. By convention, if both failures and censored observations occur at t_(i), the failures are assumed to occur before censoring; i.e., the number of censored observations is included in n_i. Plotting as a function of time gives a stepped curve, with decreases at the distinct failure times. The curve does not change at times of censorship, but the censoring times are sometimes indicated on the survival curve by a tic mark or other symbol.

Note that the assumption specified above that censoring occurs independently of failure events is critical. If imminent failure makes it more likely that a sample will be censored, we will over-estimate survival. We can avoid this problem by properly defining the event of interest. As an obvious example, consider an experiment in which we follow the survival of animals infected with a pathogen. Deaths of individual animals would be noted as the group was inspected each morning. However, we should also humanely sacrifice those animals that are found to be moribund. These animals should not be considered censored observations; we sacrificed them because death would likely occur in a matter of hours or days. Defining our event as death or extensive morbidity would provide a better estimate of the true survival curve.

Example 7.1

You have constructed a transgenic mouse line that expresses a gene in the mammary gland that you believe will increase cancer risk. You follow a cohort of mice for up to 85 weeks, and note the time that a mammary tumor is first detected by palpation, evaluating all of the animals weekly. Among the 14 animals in the study, you observe tumors at 33, 41, 41, 55, 57, 66, 72, 73, and 84 weeks. Four animals are still tumor-free at 85 weeks and one was lost to a handling accident at 42 weeks. We can estimate the survival curve as in the table below:

Time	d_i	n_i	(1-d_i/n_i)
33	1	14	0.929	0.929
41	2	13	0.846	0.786
55	1	10	0.900	0.707
57	1	9	0.889	0.629
66	1	8	0.875	0.550
72	1	7	0.857	0.471
73	1	6	0.833	0.393
84	1	5	0.800	0.314

A plot of the survival curve is shown below.

[Data adapted from Tessier et al. (2004)]

7.1.1. Variance and confidence limits for survival.

To obtain the variance of the Kaplan-Meier estimator, we can consider the hazard function

which may be thought of as the probability that an event will occur in the interval t + Δt, given that an individual has survived to time t. The cumulative hazard function is

The variance of Λ(t) can be obtained as the sum of independent intervals:

Given the relationship between the survival and cumulative hazard functions, we can estimate the variance of S(t) as

(because V(ln f)=V(f)/f²).

We can use a normal approximation to obtain (1-α) confidence limits for survival at a given time. In order to insure that our confidence limits lie between 0 and 1, we can use a log-log transformation of the survival curve (i.e., the log hazard) and estimate the confidence limits in terms of the cumulative hazard:

Confidence limits (1-α) for the transformed survival are

where z_α/2 is from the standard normal distribution (1.96 for the 95% confidence limits). Transforming back to the same scale as survival, we obtain confidence limits of

More complex approaches to obtaining the confidence limits for the survival curve as a whole (rather than at arbitrary time points) are also available (see Hollander et al., 2013).

Example 7.1 (continued)

Using the approximations given above, standard deviations and 95% confidence limits for the survival curve obtained in our tumor study are given by the table below.

Time		s.d.	95% CI
33	0.93	0.07	(0.59, 0.99)
41	0.79	0.11	(0.47, 0.92)
55	0.71	0.12	(0.39, 0.88)
57	0.63	0.13	(0.32, 0.83)
66	0.55	0.14	(0.26, 0.77)
72	0.47	0.14	(0.20, 0.70)
73	0.39	0.14	(0.15, 0.64)
84	0.31	0.13	(0.10, 0.56)

7.1.2. Median survival.

The most commonly used summary statistic for survival data is the median survival and its associated confidence limits. In the absence of censoring, or if the few censored observations occur at the end point of the experiment, we can obtain the median survival in the usual way. The n survival times are ordered (with the terminally censored observations placed at the end of the list). If the number of observations is odd, the median is t_(n+1)/2. For an even number of samples, the median is 0.5(t_n_/2+t_1+n/2).

If our experiment includes censored observations, we can estimate the median survival from the Kaplan-Meier survival curve. The median is the time of the first event at which the Kaplan-Meier estimator is below 0.5. This approach is equivalent to drawing a horizontal line at a survival of 50% on the graph of the survival curve and determining the time at which the line intersects the curve. For our Example 7.1 above, the median survival (tumor latency) is 72 weeks.

The bootstrap method provides the most robust approach to obtaining confidence limits for the median survival. Example 3.4 estimated median survival for two data sets in which no censoring had occurred. Applying the bootstrap method to randomly censored data is equally straightforward (see Efron, 1981). As discussed above, each sample in our experiment consists of two values (x_i, δ_i), where the first in the pair is the time of the event or censoring and the latter is 1 in the case of failure or 0 if the sample is censored. Each bootstrap sample is constructed by drawing n times with replacement from the set of (x_i, δ_i) pairs. The median survival is then estimated from the Kaplan-Meier estimators obtained using the bootstrap samples.

7.2. Comparing Survival Curves

Consider an experiment in which we obtain survival or event times for two groups of subjects. For example, we have developed a new therapy and want to test the hypothesis that patients receiving this treatment regimen survive longer than those treated with a standard therapy. Another example: We are studying a mutation that we suspect may delay sexual maturation in male mice. We house individual mutant or wild-type males with a pair of females and for each animal record the age of the male at the time the first litter is delivered (the event time being age at fatherhood).

The most commonly used statistical test for comparing survival curves is the logrank test, also referred to as the Mantel-Cox test (Peto and Peto, 1972). There are several different approaches to deriving this test statistic. The formulation below is simplest to compute and can easily be extended to testing the null hypothesis that three or more survival distributions are identical.

Consider two groups of observations (x_ij, δ_ij), defined as above except that the subscript i indicates the experimental group and j the sample within the group. We assume that the distributions of failure times and censoring times are mutually independent. Order the event times jointly for the two groups, yielding K distinct event times. For each of these distinct times, determine d_ik, the number of events in group i at time k and n_ik, the number of subjects at risk. For each time, k, we can construct a 2×2 table

Time k	Event	Not Event	Total
Group 1	d_1k	n_1k-d_1k	n_1k
Group 2	d_2k	n_2k-d_2k	n_2k
Total	d_k	N_k-d_k	N_k

We can compute the number of expected events at time k for each group as E_ik=(n_ik×d_k)/N_k. For each group, the total number of observed events is

while the total number of expected events is

We compute the logrank test statistic as

which follows a χ² distribution with 1 degree of freedom. This form of the test makes it easy to compute the hazard ratio, or relative risk, for the two treatments as

This test can readily be extended to more than two groups, the number of degrees of freedom being one less than the number of groups. As formulated above, this test is two-sided; i.e., the null hypothesis that the two survival distributions are identical is tested against the general alternative that they are different.

A different approach to computing the test statistic for two groups allows for a one-sided test. Compute O₁ and E₁, the observed and expected number of events for group 1, as above. The one-sided test statistic, M, which follows a standard normal distribution, is

where the variance, V₁, is

Example 7.2

Extending our study of mammary tumor development in transgenic mice, we obtain a second line that expresses the transgene at higher levels than our first line. We set up cohorts of both lines and examine each animal weekly for the presence of a palpable tumor. The ages (in weeks) at tumor development in each group are given below (a '+' appended to a value indicates a censoring time):

Low expression: 30 34 40 45 56 56 56 56 63 68 78 76 80 81 82 85+ 85+ 85+ 85+ 85+ 85+ 85+ 85+ 85+ 85+

High expression: 24 37 37 39 40 40 40 45 49 50 60 63 66 68 69 70 72 84 85+

The survival curves for the two groups are illustrated below:

There are 23 distinct event times among the two groups. The appropriate 2×2 tables and expected number of events, E_ik, for the first few times are:

Time=24	Tum.	No Tum.	Total	E_ik
Low	0	25	25	0.568
High	1	18	19	0.432
Total	1	43	44

Time=30	Tum.	No Tum.	Total	E_ik
Low	1	24	25	0.581
High	0	18	18	0.419
Total	1	42	43

Time=34	Tum.	No Tum.	Total	E_ik
Low	1	23	24	0.571
High	0	18	18	0.429
Total	1	41	42

Time=37	Tum.	No Tum.	Total	E_ik
Low	0	23	23	1.122
High	2	16	18	0.878
Total	2	39	41

Time=39	Tum.	No Tum.	Total	E_ik
Low	0	23	23	0.590
High	1	15	16	0.410
Total	1	38	39

The number of observed events in the low and high expression lines is 15 and 18, respectively. Summing up the expected number of events for the two groups, we get 22.978 and 10.022. Thus, the value of our test statistic is

From the χ² distribution (1 df), we find that P<0.003. The hazard ratio for mammary tumor development in the high expression line relative to the low expression line is

[Data adapted from Tessier et al. (2004)]

7.3. Sample Problems

Plot the survival curves for the two treatment groups in Example 3.4. Is there a significant difference in survival between the two treatments?

You have developed a novel corn line that has been selected for earlier maturation. The new line is planted in a field together with its progenitor and the number of days to maturation is recorded for each plant. The experiment is terminated at 75 days. The results are (appended + indicates censored sample):

Progenitor	55, 58, 58, 61+, 62, 62, 64, 64+, 66, 69, 70, 74, 75+, 75+
Selected Line	48, 49, 49, 50, 52+, 58, 59, 59, 62, 64, 65, 65, 65, 66+, 68, 69, 75+

Does the new line mature more rapidly than the starting line?