2. Random Variables & Probability Distributions
In the previous chapter, we described probability as a numerical measure of chance and focused on methods for combining the probabilities of simple events. Recall that the events are considered as points in a sample space consisting of all possible events and that each event in the sample space has associated with it a measurable probability.
Probability distributions provide a kind of shorthand for describing the sample space for an experiment and its associated probabilities. In many cases, the probability distribution for an experiment can be summarized as a mathematical function. We can then use this function to calculate the probability of any particular event (or class of events) in a straightforward manner. In this chapter, we'll discuss the properties of a number of important probability distributions and their use in analyzing biological experiments.
2.1. Some Definitions
2.1.1. Distributions.
A probability distribution is a list of the events that may occur in an experiment together with their frequency or probability of occurrence. We must recognize two kinds of distributions.
First, the empirical or sample distribution of a specific experiment: The frequencies of the different events as they actually occurred in the experiment.
Second, the population or theoretical distribution: The expected frequencies with which the events would occur were we to perform an infinite number of experiments, or sampled the entire population.
Example 2.1
You have sequenced a stretch of DNA isolated from a particular organism and find that it contains a sizable open reading frame. For each of the 61 possible non-termination codons you count the number of times that the codons occur within the suspected orf to give the sample distribution for codon usage within this region. You may want to compare this sample distribution to the population distribution for codon usage obtained from the sequences for (ideally) all protein encoding regions or (realistically) all of the sequenced genes for the organism.
Most of the hypotheses we will want to test involve comparing two sample distributions or comparing a sample distribution to a (theoretical) population distribution.
2.1.2. Random variable.
When an event can be described numerically, for example, by counting or as a physical measurement, that event is represented by a random variable. We will denote the probability distribution of a random variable, x, as f(x). A random variable that can take on values from a finite or countably infinite (e.g., the non-negative integers) set is said to be discrete and its associated probability distribution is a discrete distribution. When the allowed values for the random variable are represented by an interval of real numbers, the random variable is continuous.
2.2. Properties of Probability Distributions
A discrete random variable, x, occurs with probability f(x) and is an element of the finite or countably infinite set of all possible events X. The list of ordered pairs (x,f(x)) for all x ∈ X defines the probability distribution for x. In many cases, this probability distribution may be stated as a simple mathematical formula in terms of particular parameters that specify the properties of the distribution.
A continuous random variable is defined for some interval on the line of real numbers, but the probability that a continuous random variable, x, takes on any specified real value in the interval is equal to zero. However, we can define the probability that an observation lies in a subinterval. In particular, the cumulative distribution function, F(x), is simply the probability that an observation has a value less than or equal to x. The probability density function, f(x), is the derivative of the cumulative distribution function
f(x) = dF(x) / dx
or
2.2.1. Convergence.
The total probability in the sample space X is equal to 1, i.e.,
for a
discrete distribution, or
for a
continuous distribution.
2.2.2. Measures of location.
Several measures are used to define the location of a sample or population distribution. The most commonly used measure is the expected or mean value. This measure is defined as the sum (or integral) over all possible values of the product of the random variable and its frequency.
Another measure, the median, xm, is defined as the value that divides the total frequency into equal halves, i.e.,
or a similar equality involving sums for a discrete random variable. This definition is imprecise for a discrete distribution. If there are 2N+1 members of the sample (ordered by increasing value), the median is the value of the xN+1 member. If there are 2 N members, it is the value (xN+xN+1)/2.
Similarly, quantiles (n-tiles) are defined as the values that divide the distribution into n portions containing equal (1/n) frequencies. For example, the second decile is the value for which 20% of the population is smaller in magnitude and the fifth decile is identical to the median.
2.2.3. Measures of dispersion.
One way to describe the dispersion, or degree of spread, in a sample distribution is to note its range, the difference between the largest and smallest observed values. However, this measure has the disadvantages of depending on only two of the observations and of tending to become larger as more observations are made. A better approach is to use the difference between two quantiles spaced on either side of the median. For example, the difference between the third and first quartiles is a range of values that contains the central half of the observations.
The most common measure of dispersion for a population is the variance, which is the average squared deviation from the mean (μ), the second equation providing a convenient formula for computation.
For a continuous distribution, the corresponding equations are
Generally, when you present your data, you would report the standard deviation (the square root of the variance) to place it in the same units as your measurements.
The above should be modified slightly in the case of a sample distribution (where you only have a subset of values drawn from a real or hypothetical population). For a sample, the variance of the sample gives a biased estimate (see Chapter 3) of the population variance, slightly underestimating it. An unbiased estimate for the population variance can be obtained by multiplying the result of the above formula by n/(n -1). [It may also be noted that, while the sample variance computed in this way is unbiased, taking its square root (to obtain the standard deviation) re-introduces a bias. The magnitude of this bias is small (less than 1% for n>30) and routinely ignored. See Sokal and Rohlf (2012) for further details.]
Example 2.2
You have treated young mice from a sensitive inbred strain with a carcinogen. At 8 months of age, the animals were found to contain the following numbers of induced lung tumors:
47 29 23 17 25 12 13 14 16 7 12 18 29 19 30 21 19 20 17 0 24 3 9 0 10 10 18 33 26 31
Descriptive statistics for this sample are given below.
Number of observations | 30 |
Median | 18 |
First quartile | 12 |
Third quartile | 24.5 |
Mean | 18.4 |
Standard deviation | 10.4 |
2.3. Common Distributions in Biology
A few simple distributions provide reasonable models for a variety of biological experiments.
2.3.1. Binomial distribution.
Consider an experiment that involves some fixed number of independent trials. Each trial has two possible outcomes, conveniently termed success and failure, and the probability of success is a fixed value. We want to know the probability that x of the N trials are successes (0≤x≤N).
Example 2.3
You know the frequency of a dominant gene, A, in a population is equal to 0.1. If you draw 5 members of the population at random, what is the probability that 3 of them will display the dominant phenotype? We assume that the population is sufficiently large that removing 5 individuals is without effect.
If the gene frequency is 0.1 (and the population is in Hardy-Weinberg equilibrium), the probability that an individual is genotype A_ is equal to 0.19. Suppose that the outcome of the experiment is
Outcome = S F S S F
where S denotes an individual of genotype A_. Since the trials are independent we can obtain the probability for this outcome by multiplying the probabilities of the individual events
What we really want is the probability of obtaining 3 successes irrespective of order
The generalization of the above example is the binomial distribution. A fixed number, N, of independent trials are conducted, each of which may have one of two possible outcomes (e.g., success or failure) that occur with fixed probability (e.g., p for success and q = (1-p) for failure). The probability of observing exactly x successes is given by
The expected value and variance of the binomial distribution are
E(x) = Np
V(x) = Npq
2.3.2. Poisson distribution.
The formal derivation of the Poisson distribution concerns events that occur with constant density in time or space (an example is radioactive decay). In biological experiments this distribution commonly arises as an approximation to the binomial distribution where the number of trials, N, is very large and the success probability, p, is very small. Consider the binomial distribution where p << 1 and N >> x; thus,
N-x+1 ≈ N and
N! / (N-x)! = N - (N-1) - (N-2) … (N-x+1) ≈ Nx
You can show by expansion that
(1-p)N-x ≈ e-pN
Substituting in the formula for the binomial distribution we obtain
remembering that pN is simply E(x), which we'll denote m (for mean), we obtain the Poisson distribution
x=0,1,2,…
The mean and variance of the Poisson distribution are equal to m:
E(x) = m
V(x) = m
Example 2.4
You transfect a plasmid containing a selectable marker into a population of 4.5×105 cells and expect the plasmid to be functionally integrated into the genome with a frequency of 10-5. What is the probability that you will get at least two resistant colonies?
In this case, the expected number of colonies, m, is 105(4.5×105), or 4.5. Using the Poisson distribution, we can calculate the probabilities of obtaining 0 or 1 colony
f(0) = (4.50 e-4.5) / 0! = 0.0111
f(1) = (4.51 e-4.5) / 1! = 0.0500
Thus,
P(x ≥ 2)= 1 - f(0) - f(1) = 0.939
2.3.3. Uniform distribution.
The discrete uniform distribution applies when all of the events in the sample space occur with equal probability. This simple distribution is useful because we can frequently transform data obtained from complex experiments so that it conforms to this distribution. This ability will be useful for hypothesis testing.
If the random variable takes on n possible values (conveniently 1 to n)
f(x) = 1 / n x = 1,2,…,n
E(x) = (n+1) / 2
V(x) = (n+1) (n-1) / 12
(The expression for the variance is not intuitively obvious. If you are daring and good at algebra, you might try deriving it.)
2.3.4. Negative binomial distribution.
An interesting variation on the binomial distribution is to consider a sequence of Bernoulli trials in which N is not fixed, but is the number of trials required to obtain r "successes". In that case, the last trial must be a success, so the probability that N trials will be required to obtain r successes is given by
which is defined for N≥1 and r≥1, and N≥r. We can rewrite this distribution by defining s=N-r (taking into account that we must always perform at least r trials),
In this form E(s)= rq/p and V(s)=rq/p2.
Example 2.5
You mate heterozygous animals to obtain a population containing mutant homozygotes that display an interesting behavioral phenotype. You want two mutants to study in detail. Because the apparatus you use to measure this phenotype can only accommodate a single animal, you test animals sequentially from the litter. What is the probability that you will obtain at least two mutants in no more than four chosen progeny?
Thus, the probability that you will have at least 2 mutants in the first 4 tested animals is 0.261.
In section 2.4.3, we'll revisit the negative binomial distribution as a mixture of Poisson distributions.
2.3.5. Hypergeometric distribution.
A condition of the binomial distribution (section 2.3.1) is that the success probability, p, is invariant from one trial to the next. This condition can only be met if the trials are independent, i.e., each trial is a sample taken from an infinite population (or a finite population to which we add back each item after choosing it). If instead our sample is taken from a finite population without replacement, the success probability for a particular trial depends on the outcomes of the preceding trials. For a population of size N, in which the proportion of "successes" is p (q=1-p), the probability of exactly x successes in k trials is given by the hypergeometric distribution
The mean and variance of the hypergeometric distribution are kp and kpq(N-k)/(N-1), respectively. For k small relative to the population size N, the frequencies approach those given by the binomial distribution.
Example 2.6
You are studying a mutant mouse that you suspect will exhibit a longer lifespan than wild-type animals. You collect a group of 35 animals, of which 15 are mutant and 20 wild-type, and allow them to age. Of the last 10 animals to expire, 8 are mutant. What is the probability that 8 (or more) of the last 10 mice to die will be mutant if deaths occur at random in the population as a whole?
In this example, we "choose" a sample of 10 (k) from a population of 35 (N) animals and the probability that an animal is mutant (p) is 15/35. Thus, the probability that exactly 8 of the mice are mutant is
Similarly, f(9) and f(10) are 5.4×10-4 and 1.6×10-5, respectively. The probability that 8 or more mutants among the last 10 animals to die is 0.0072.
2.3.6. Normal distribution.
The normal, or Gaussian, distribution is widely applicable as a probability model for continuous random variables that may be thought of as resulting from the sum of a large number of small effects. Although the formal derivation of this distribution follows from a small number of simple assumptions, it is mathematically somewhat abstruse. More useful for our purposes, it can also be derived as an approximation to the discrete distributions we have already discussed when the number of observations is fairly large.
The normal probability density function is given by
Thus, the two parameters for this distribution, μ and σ2, are the mean and variance for the distribution. The cumulative distribution is difficult to compute; most statistics books have tables (see Appendix 2) of the cumulative normal distribution standardized to
z = (x-μ)/σ
such that
The utility of this distribution is a consequence of the central limit theorem. Briefly stated, this theorem demonstrates that the distribution of the sum of n independent random variables converges to a normal distribution as n becomes large. Because of this property, we will make extensive use of the normal distribution as an approximation to the distributions for test statistics that we will discuss later.
The normal distribution can also be used to approximate the distributions we discussed above. We can consider the binomial distribution in this light if we note that x, the number of successes, may be thought of as the sum of N (the number of trials) random variables, each of which may be a 0 (failure) or a 1 (success). A similar rationalization of the Poisson distribution allows us to use the central limit theorem to approximate it.
Example 2.7
What is the probability of 16 or more successes in 50 trials when the probability of success is 0.25? Using the binomial distribution, we can calculate p(x ≥ 16) = 0.1631. Since the mean and variance for a binomial distribution are
E(x) = Np
V(x) = Npq
we calculate that the mean and standard deviation for this example are 12.5 and 3.061, respectively. The value z for the standardized normal distribution is
z = (16-12.5)/3.061
= 1.143
From a table of the standardized normal distribution
p(z) = 1-F(z)
= 1-0.8735
= 0.1265
We can obtain a better approximation to a discrete distribution by applying a continuity correction. This correction takes into account the stepped shape of a discrete probability function when plotted on a real line. The correction is performed by subtracting 0.5 from the lower bound and adding 0.5 to the upper bound for the desired interval. That is, if PD represents the probability calculated from the discrete probability distribution, m is the mean for the distribution, s is the standard deviation for that distribution, and PN represents the probability approximated from the normal distribution
For the above example, we would use the corrected z value
z = (16-0.5-12.5)/3.061 = 0.9797
and from the normal table we find that P(z ≤ 0.9797) = 0.1635, which is a closer approximation to the true probability of 0.1631. In general, the normal approximation to the binomial distribution is quite good as long as the value of p is not too close to 0 or 1 and N is moderately large (greater than 10).
2.3.7. Chi-square distribution.
A final continuous distribution that we will use extensively is the Chi-square, or χ2, distribution, which is closely associated with the normal distribution. The distribution function for Chi-square is given by
and is defined for x≥0. The parameter k is referred to as the number of degrees of freedom. We won't use the formula above; extensive tables of the Chi-square distribution for various values of k are provided in most statistics books (see Appendix 3). The mean and variance for the distribution are k and 2k, respectively.
The utility of this distribution in statistics stems from two theorems (which we will not prove here). First, if we have k independent random variables that follow the standard normal distribution, then the sum of the squares of those random variables follows a Chi-square distribution with k degrees of freedom. That is, for a set of xi (i=1 … k) independent and identically distributed normal (μ=0, σ=1) random variables, the sum
has a Chi-square distribution with k degrees of freedom.
Another important feature of this distribution is its reproductive property; the sum of Chi-square random variables also follows a Chi-square distribution. If the k random variables Xi are independent and follow Chi-square distributions with ni degrees of freedom, then
follows a Chi-square distribution with
degrees
of freedom.
2.4. Extensions
More often than not, biological experiments fail to conform to the simple probability models described above. We can consider a few extensions or variations on these simple univariate distributions that can take into account some of this biological variability.
2.4.1. Truncation.
It is not always possible to observe a random variable over its entire range. For example, it may not be possible to observe 0 successes in a binomial experiment, i.e., the data are truncated at 1. We can evaluate the probability of a specified value for an experiment where truncation is present by dividing the probability for the value from the nontruncated distribution by the total probability for the observable values. In the above example,
where ft(x) is the probability in the case of truncation and f(x) is the probability for the non-truncated distribution.
Example 2.8
You are studying the frequency of spontaneous mutations at a marker locus, a, in T cells in vivo. You cross AA × Aa animals, isolate T cells from the progeny, and plate 1.5×105 cells to select for aa mutants. Note that only half of the progeny (Aa) can give rise to selectable mutants (assume that it is costly to simply genotype the progeny). Thus, the cultures with 0 mutants aren't really informative. If the mean mutant frequency (among cells from Aa animals) is 1×10−5, what is the probability of obtaining 2 or fewer mutants among the samples that yielded any mutants?
We can use a truncated Poisson distribution (with no 0 class) to calculate this frequency. The mean number of mutants in cultures from Aa animals is 1.5. Thus,
The chance of obtaining 2 or fewer mutants (given that any were seen) is 0.75.
2.4.2. Multivariate distributions.
For some experiments, measurements are taken on more than one property of the system. In that case, a multivariate or joint probability distribution is required for the k variates x1, x2, …, xk. The multivariate extension of the binomial, the multinomial distribution, is obtained when the trials have k possible outcomes, with probabilities of occurrence p1 … pk. The probability for a particular set of observations x1 … xk in N trials is
where
and
(Cov = covariance)
In the first example given in this chapter, we considered the distribution of codons in a possible open reading frame without regard to the amino acids that might be encoded. A better way to think about this problem would be to look at the distribution of codons used for particular amino acids. For example, we could consider the number of times each of the 6 possible codons for leucine were used given that there are N leucine codons in the orf as a multinomial distribution where we might take the values of the pi from the data for all known protein encoding sequences for the organism.
Example 2.9
In a cross Aa × Aa, what is the probability of obtaining 2 aa, 6 Aa, and 4 AA offspring without regard to order?
The probability that an individual animal has each genotype is 0.25, 0.5, and 0.25, respectively. The desired probability is
2.4.3. Mixtures of distributions.
Even when a biological phenomenon might be expected to follow a simple distribution, heterogeneity in one of the parameters of the parent distribution might cause the data to fail to fit the expected distribution. An example is the distribution of the number of tumors per animal at a particular site in a carcinogenesis experiment. Since the process of tumor development is a rare event occurring among a large number of target cells at risk, we might expect these data to follow a Poisson distribution. In fact, because of variability from animal to animal in responsiveness (particularly for outbred animals), the number of tumors per animal generally deviates quite markedly from a Poisson distribution. In this case, the Poisson parameter, m, varies from animal to animal. If m follows a gamma distribution (a continuous unimodal distribution with a fat tail), the number of tumors per mouse is described well by a negative binomial distribution (where r is allowed to be non-integral) (Drinkwater and Klotz, 1981). The formula for the negative binomial distribution in this case differs from that shown in Section 2.3.4.
t = 0, 1, 2,…, ∞
In the above formulation, m is the mean for the distribution and its variance is m+(m2/k).
The mixture of a Poisson distribution with a variety of distributions for m gives rise to a family of so-called contagious distributions that have been applied to problems of the spread of infection in a population, clustering of accidents, and population growth.
In general if a random variable, x, follows a distribution with parameter t, f(x|t), and t is a random variable with distribution g(t), the probability distribution of x is given by
where the integration (or summation for a discrete g(t)) is taken over all possible values of t. Mixtures of distributions sometimes take fairly simple forms, but when this is not the case, the values of p(x) may be evaluated numerically.
2.5. Sample Problems
- In doing an experiment that yields "counting" data, where the
observations, x, range in value from 0 to 8, you obtained the
distribution indicated below. The value nx is the number
of times that you observed a value of x. Calculate the mean and
variance, E(x) and V(x).
x 0 1 2 3 4 5 6 7 8 nx 1 4 2 5 8 3 0 2 2 - For a binomial distribution with N trials and success probability p, the mean number of successes is Np. Prove that this assertion is true.
- The most rapid way to construct an inbred strain of animal is by continued brother-sister mating. Assuming that male and female offspring are equally likely, what is the probability that a litter of six animals would not provide a brother and a sister for mating? (Note that all would not be lost, you could set up another mating from the original parents to obtain another litter. In that case, what is the probability that two litters of six each would not provide a brother and a sister?)
- You are interested in mutagenizing an organism at a particular locus. You use a protocol that you expect will give you 15 mutants per 1000 progeny based on experience studying a large number of genes. When you do the experiment and screen 1000 progeny, you find 25 mutants. Do you find this to be an unusually large number of mutants (based on your prior expectations)? Justify your answer.
- You've been studying the function of a mutant gene in mice. The
mutant allele (a) is present in a particular inbred strain Y and
you want to study its effects on a different inbred background X that
has allele A at that locus. You thus decide to make a congenic
mouse strain. You begin by making an F1 hybrid between the
two strains to get mice with genotype aA. The congenic strain is
constructed by repeated backcrossing of aA heterozygotes to
strain X mice. The result of the first backcross is a mixture of
aA and AA progeny. Normally, you would select an aA
offspring and mate to a strain X mouse, but (unfortunately for you) the
phenotype that results from this gene is not manifest until after the
normal breeding age of the animals. To get around this problem, you
decide to randomly select some number N of the mixed group of
offspring for mating to strain X mice. When the phenotypes of these
animals becomes apparent, you identify a mouse that is aA in
genotype and choose N of its offspring to repeat the process.
a. How large must the number N be in order for you to be 95% sure that at least one of the animals will be aA in genotype.
b. Are you being overconfident? If 5 generations of backcrossing are required, what is the chance that you will have lost the a allele by the end of this experiment? - You are doing a series of experiments that involve the introduction of
recombinant DNA sequences into mammalian cells by transfection. For some
of the experiments, you need to simultaneously introduce two different
genes into the same cell. You suppose that, with the method you are
using, the few cells that successfully "adopt" the exogenous
DNA take up a large amount of DNA, such that co-introduction of two
markers is a likely event. In order to test this assumption, you do the
following experiment:
Cells are simultaneously transfected with equal quantities of two types of DNA that express independent selectable markers "A" and "B." After treatment, you divide the cells into two parts. For one group, you seed 6 dishes with 100 cells each and select for the presence of marker A. For the other group, you seed 6 dishes with 100 cells each and select for marker B. You observe the following numbers of colonies.
Group 1 (A+): 6, 11, 7, 13, 10, 9 Group 2 (B+): 9, 9, 10, 7, 9, 11 - You believe (from the results of other experiments) that a particular
protein "X" increases the expression of genes linked to a
specific promoter. In order to test this hypothesis, you construct 8
matched pairs of cell lines: one member of each pair expresses
"X" at a high level and in the other, "X" is not
expressed. Into each cell line, you introduce DNA in which the promoter
of interest is linked to a reporter gene and measure the amount of
transcription of the reporter gene by an RNA dot blot.
Construct a test of your hypothesis. [Hint: Consider this to be a binomial experiment with 8 trials and rephrase your null hypothesis (i.e., of no difference within a pair) with that in mind.]Cell line cpm hybridization pair no "X" with "X" 1 300 450 2 210 85 3 900 1300 4 750 950 5 490 375 6 50 95 7 195 500 8 650 1400 - We will most often use the normal approximation to a discrete distribution
to compute the probability in the upper tail of a distribution,
i.e., for a value b, the sum of all probabilities from b to
the maximum value for the distribution. It's worth knowing how
good these approximations are for a given distribution.
For the Poisson distribution, compare the exact and normal approximation upper tail probabilities for values of x greater than m + (m)0.5, m + 2(m)0.5, and m + 3(m)0.5 for values of m (the mean for the distribution) of 1 and 8. Round the above values of x to the nearest integer.
[Note: This problem is fairly easy if you are handy with computers. If you would rather do the problem by hand, consider the following helpful hints. First, you only need to compute the exact probabilities once for a given value of m and you don't want to add up all of the exact probabilities to ∞. To save time, compute the normal approximation for m + 3(m)0.5 first. Then start computing exact Poisson probabilities from the nearest integer to m + (m)0.5 up to a value such that p(x) is 0.001 × the above normal approximation. Second, there is an easy way to compute successive Poisson probabilities. Note that p(x+1)=p(x)×m/(x+1). Using the above formula will save you lots of time.] - Mice with the genotype a/a +/cch will occasionally develop light brown spots in their coats as a consequence of a somatic mutation in the wild type allele at the albino (c) locus that occurs during the proliferation of melanoblasts in utero. You've been studying a mutation in a gene (z) that you believe should increase the rate at which somatic mutations would occur. You do the following cross [Z/z a/a cch/cch] × [Z/z a/a +/+]. Say that you expect that the number of spots per offspring would be Poisson distributed with the mean number of spots (m) depending on the genotype at the Z locus: Z/Z, m = 0.8; Z/z, m = 2.0; z/z, m = 10. What would be the distribution of the number of spots per mouse from the above cross? Graph the distribution.