3. Estimation
We will often want to use the set of observed values obtained in an experiment, that is, the sample distribution for the random variable, in order to infer, or estimate, properties of the population from which the sample was drawn. These properties might include the mean or variance of the population distribution or parameters of the distribution function (e.g., the success probability for a binomial distribution). In genetics, we will often be interested in estimating such quantities as gene frequencies, recombination frequencies, or mutation rates. While most of this book focuses on nonparametric statistics, in which the form of the population distribution is not assumed to be known, it is worth discussing briefly some properties of estimators and a general method for obtaining a class of estimators with useful properties. We will also discuss the use of confidence limits to provide a measure of the quality of our estimates and describe an approach to using the sample data to obtain confidence limits for parameters when the form of the population distribution is unknown.
3.1. Properties of Estimators
The process of estimation is analogous to that of
measurement. When we measure some quantity (weight, blood pressure,
etc.) we want our result to be accurate (close to the true value) and
precise (highly reproducible from measurement to measurement). The same
considerations apply to problems of estimation in which we seek estimators
that are unbiased and consistent. Consider a sample consisting of
n observations, from which we can define an estimator,
, for a
parameter p. If this estimator is unbiased, then
for all
values of n.
Another desirable property for estimators is consistency, i.e., the estimated value should converge on the true value as the number of observations increases. In terms of probability, for any positive value of ε (however small):
Given a choice of estimators that are unbiased and consistent, we would obviously prefer the one with the smallest variance. In the figure below, the top panel depicts two alternative estimators. One of the estimators, the one with the smaller variance (narrower distribution), is biased because the mean value is different from the true value of p. In the bottom panel, distributions for an unbiased, consistent estimator are depicted. Consider the following question: Which of the two estimators shown in the upper panel (the unbiased one or the one with the smaller variance) is likely to be most useful to you?
3.2. Maximum Likelihood Estimation
Consider an experiment in which we make n, independent observations of the random variable x which follow a distribution f(x|p) where p is the vector of parameters (i.e., {p1,p2,…,pk}) for the distribution. The probability for obtaining the particular results for this experiment, the likelihood, is given by
The above expression is simply the product of the probabilities of obtaining the observations in our experiment, given the parameters for the distribution, and follows from the previous definition of independence and the product rule.
The use of the sample likelihood function in problems of hypothesis testing and estimation was first proposed by R. A. Fisher in 1925 (Fisher, 1973). This approach has a strong intuitive appeal. If we have two hypotheses that differ in the values of the parameters p, it seems quite natural to favor the hypothesis that yields the higher likelihood and that the ratio of the likelihoods would give us a measure of the confidence with which we should prefer that hypothesis. The use of likelihood ratio tests for statistical inference is an important topic because these tests have the general property of being the most powerful (see Chapter 4) for any case in which the form of the distribution function can be specified. Although we won't be able to cover this topic, the efficiencies of the statistical tests we discuss later will be considered relative to the appropriate likelihood ratio test.
We most often will be interested in testing hypotheses that do not specify precisely the values of the parameters of the distributions but only the form. What method should we then use to estimate the parameters under the hypothesis? Again, it seems natural to use an estimate that gives the highest possible likelihood under the hypothesis. This estimator, the maximum likelihood estimator (MLE), has, as will be discussed below, several important properties that make it the estimator of choice in a variety of experimental situations. Traditional and modern linkage analysis is, for example, dominated by the concept of likelihood and maximum likelihood estimation.
3.2.1. Determining the Maximum Likelihood Estimator.
The basic idea underlying maximum likelihood estimation is rather simple: You write down, in terms of the parameter to be estimated, the probability of what you observed, and then choose as your estimate of that parameter the value that maximizes this probability. Although the idea is straightforward, sometimes the implementation is difficult. We will stick to relatively simple examples below and describe three methods for obtaining an MLE: the analytic approach, numerical methods, and the EM algorithm. Our first example uses the analytic method to obtain the MLE for the binomial success probability, p.
Example 3.1
You want to estimate the frequency of recombination between two loci, A and B, and set up the following test cross:
You observe the following distribution of genotypes for 30 progeny (only the haplotype from the father is shown)
Genotype | No. of Progeny |
A B | 13 |
A b | 2 |
a B | 1 |
a b | 14 |
Thus, you observed 3 recombinant animals in your set of 30. Since the animals are "independent" and each is either a recombinant (success) or non-recombinant (failure), the number of observed recombinants, x, in a collection of N offspring should follow a binomial distribution with a parameter p, which is the probability of recombination.
For the binomial distribution, we could view our experiment as N trials of one observation each. Thus, the likelihood function is
We can obtain the maximum likelihood estimate,
, of the
success probability by setting the derivative of the likelihood function to
0 and solving for p.
In deriving the maximum likelihood estimator for a parameter it is often more convenient to work with the log of the likelihood, in order to convert the product in the likelihood function to a sum. The value of p that maximizes ln(L) will also maximize L. For our example, the log likelihood and its derivative are
To obtain the maximum likelihood estimator,
, we can
set the derivative to 0 and solve for p
The above estimator should come as no surprise, since it is the one you would use quite naturally. It is often the case that, when there is an intuitively obvious estimator, it is also the maximum likelihood estimator.
Without deriving it, we'll note that the variance for the maximum likelihood estimator of p can be obtained by taking the second derivative of the likelihood function (see below). The variance is -1 times the expected value for the reciprocal of the second derivative of the log likelihood:
Thus, our variance is
which can be estimated by substituting
for
p in the right hand side of the equation.
Example 3.1 (continued)
For our test cross, the maximum likelihood estimate for the probability of recombination is
and its standard deviation (the square root of the variance) is estimated to be
We can compute the likelihood for our experiment under the condition that the recombination probability is 0.10 from
You can satisfy yourself that 0.1 is the maximum likelihood estimate by trying a few alternative values. For example, the likelihoods for p=0.11 and 0.09 are 5.724 × 10-5 and 5.713 × 10-5, respectively.
A similar approach can be used to obtain estimates for multiple parameters of more complex likelihood functions. The partial derivative of the log likelihood with respect to each parameter is set to 0 and the resulting set of equations is solved for the parameters. In this case, the parameters will covary and we will need to invert a matrix of the partial second derivatives to determine the variances and covariances of the parameters.
3.2.2. Properties of MLEs.
If we use MLE theory, is our estimate a good one? Obviously, it depends on what we mean by good. Statisticians have defined various properties of goodness precisely. Under rather general conditions, maximum likelihood estimators (MLE) have many of these properties. Here are some of those properties; the value of most should make intuitive sense to you.
MLEs are consistent. As described in section 3.1, this assertion means that the estimator converges in probability to the true value of the parameter as the sample size increases. In other words, the more information we have, the better our estimator will be. Implicit in this discussion is a concept that is crucially important to an understanding of statistical methods: Estimators have distributions. If you sample over and over, your estimator won't yield the same estimate every time because the values in the sample will vary randomly. If you plot the estimates from a large number of samples, you want the resulting distribution of estimates to be closely concentrated around the true value of the parameter, and it would certainly be useful if the concentration became tighter for larger samples. That case is the property of consistency.
MLEs tend to normality as the sample size gets large. This property is useful because the normal distribution is nicely symmetrical and a great deal is known about it (see section 2.3.6).
MLEs have a built in (asymptotic) variance formula. For reasonably large sample sizes, the variance of an MLE is given by the formula
where V is the variance and E refers to the expectation. We might ask why the variance of the estimator would have anything to do with the second derivative of the log likelihood. Remember, the likelihood is a function of p whose graph is a unimodal curve. The first derivative of the log likelihood gives us the slope of the curve at any value of p. The second derivative tells us how fast that slope is changing with changes in p. If the slope is changing slowly (the second derivative is small), then the likelihood is relatively flat around the maximum. If the slope is changing quickly (the second derivative is large), the likelihood function is rather steep around the maximum. If the likelihood is rather flat, then the value of p corresponding to the maximum likelihood is not much more likely than nearby values of p. On the other hand, if the curve is steep, the maximum likelihood estimate is considerably more likely than most other values of p. The second derivative, thus, is a measure of how much information we have about the unknown p. A large second derivative means lots of information (small variance), while a small second derivative means little information (large variance). Hence the obscure sounding relationship: the large sample variance of the MLE equals the reciprocal of minus the average value of the second derivative of the log of the likelihood. Large, here, is hard to define precisely. The formula is only exact at a sample size of infinity, but N=30 is generally close enough to infinity for us to use the formula.
MLEs are efficient. This statement means that, under rather general assumptions, MLEs have a smaller variance (more information) than other kinds of estimators, at least in large samples. In fact, they often have the smallest variance possible.
MLEs are sufficient. Sufficiency is the most important general property of MLEs, but is also the most subtle. Loosely, it means that an MLE always uses all of the information in the sample. However, that is not a very informative statement unless the term information is carefully defined. What is meant is that the distribution of the sample, given the MLE, is independent of the unknown parameter. This concept is both interesting and deep, but a simple example may make it easier to assimilate. Suppose we wish to estimate the probability, p, of observing heads by flipping a coin 100 times. We observe 40 heads (and our MLE will therefore be 0.4). Note that in deriving this estimate, we do not need to know the order in which the heads or tails were observed (there are 100!/40!60! equally likely orders). In some sense then, the proportion of heads observed embodies all of the information in the sample there is concerning the unknown parameter, p. So the observed proportion of heads is a sufficient estimator of p.
MLEs are invariant. This very useful property means that if
is the
MLE of p, and f(p) is some function of p, then
f(
) is the
MLE of f(p). For example, the MLE of p2 is
.
MLEs are not, in general, unbiased. Unfortunately, it is not, in general, true that the average of the MLE equals the unknown parameter for finite sample size. Of course, even if an MLE is biased at finite sample sizes, it will become less so as the sample size gets larger because MLEs are consistent. Sometimes the exact bias is known. For example, the 'n-1' in the formula for the sample variance is a correction for bias.
3.2.3. Numerical methods for maximum likelihood estimation.
Deriving a closed expression for a maximum likelihood estimator is not always as straightforward as it was above for the binomial distribution. In such cases we can use numerical methods to obtain the estimator by computing the log likelihood as a function of the parameter of interest and finding the value of the parameter that gives the highest (i.e., least negative) log likelihood. As shown in the example below, this procedure is relatively simple to perform for distributions with a single parameter, but it can also be applied simultaneously for any arbitrary number of parameters.
Example 3.2
You have been studying a gene that causes coat color spots in mice as a consequence of somatic mutation in pigment cells and you are interested in the effect of the genetic background of the animals on the frequency of these somatic mutations. Animals heterozygous for this gene (D/+) develop the spots, but the homozygous mutant mice die before birth. You mate heterozygous mutant mice with wild type animals from a different strain and analyze the progeny for the number of spots. Since these somatic mutations are presumably rare events, you imagine that the number of spots per mutant animal should follow a Poisson distribution and you wish to estimate the Poisson parameter, m, for a variety of matings of mutant mice to different strains. Note that only half of the offspring will carry the mutant gene so that the "zero" class, the animals with no spots will contain both irrelevant mice (+/+) and D/+ mice that develop no spots. Thus, the distribution of the number of spots, x, per mouse (among mice that contain spots) will be a truncated Poisson distribution
Among 25 animals that display spots, we observe
Number of spots (xi) |
Number of mice (ni) |
1 | 8 |
2 | 7 |
3 | 4 |
4 | 3 |
5 | 3 |
We can compute the log likelihood as
We can find the maximum likelihood estimate for m by using a set of trial values and zeroing in on the maximum by Newton's method. We have plotted the log likelihood function below
Estimating the variance of
numerically, we obtain 0.187. Thus, the MLE for m is 2.16 ± 0.43.
3.2.4. The EM algorithm.
Because of dominance and epistasis, genetic experiments often pose the problem of obtaining maximum likelihood estimates in the face of incomplete data. For example, in crosses involving two loci with dominant alleles, the presence of both recombinant and non-recombinant offspring in some phenotypic classes may make it impossible to simply count recombinants, as we did in Example 3.1, to obtain an estimate of the recombination frequency. The Expectation Maximization (EM) algorithm provides an iterative approach to obtaining maximum likelihood estimates in such cases.
In our example of linkage between dominant markers, we would apply the EM algorithm as follows.
- We obtain (or just guess) the value of the recombination fraction, r, by some simple method.
- We use this value of r and our observed phenotypes to calculate the expected numbers of recombinants and non-recombinants under the assumption that the estimate is correct and we can distinguish all of the genotypes.
- We use the above expectations to obtain the MLE for r, which is simply the expected number of recombinants divided by the number of progeny.
- Repeat steps 2 and 3 until successive estimates differ by less than some
specified, small amount, giving us the MLE,
.
The EM algorithm is a powerful, general method for obtaining maximum likelihood estimates and tends to converge quickly. There are also methods (too complex to provide in detail here) using this approach to obtain estimates of the variance of the estimator. In general, it is wise to try several starting values for the parameter in case the likelihood surface is multimodal.
Example 3.3
We are analyzing the results of the mating (uppercase indicating the dominant alleles)
from which we obtain 5 offspring with the AB phenotype, 3 of Ab, 1 of aB and 1 of ab. The genotype frequencies (as a function of r) for this mating are
|
Ab | ab |
AB | (1-r)/4 | (1-r)/4 |
aB | r/4 | r/4 |
Ab | r/4 | r/4 |
ab | (1-r)/4 | (1-r)/4 |
The AB phenotype is comprised of three genotypes (indicated by the shading), AB/Ab, AB/ab, and aB/Ab, the last of which is recombinant. Among these AB offspring, the proportion of recombinants is
Thus, the expected number of recombinants in the AB phenotypic class is 5r/(2-r). Similarly, the expected number of recombinants for the Ab class is 3(2r)/(1+r). All of the aB offspring and none of the ab offspring are recombinant.
To apply the EM algorithm to obtaining the MLE of r, let's use 0.5 (i.e., non-linkage) as our starting value.
The expected number of recombinants, R, given r=0.5 and our observed phenotypes is
This expected value gives us a new estimate of r
of 0.46667 (dividing the E[R] by the number of offspring). We use
this estimate to again obtain the expected number of recombinants and a new
estimate of r. The complete sequence of iterations to estimate
to 5
decimal places is
0.50000 0.46667 0.44308 0.42652 0.41493 0.40684
0.40119 0.39726 0.39452 0.39261 0.39128 0.39036
0.38971 0.38926 0.38895 0.38873 0.38858 0.38847
0.38840 0.38835 0.38831 0.38829 0.38827 0.38826
0.38825 0.38824 0.38824
We converge to the same value, starting from r=0.1.
Our MLE,
, is
0.388 (with a variance,
V(
) of
0.072).
3.3. Confidence Limits
The variance of an estimator, obtained above from the second
derivative of the likelihood function, provides a measure of its precision.
Alternatively we could define a range of values for the parameter, a
confidence interval, for which we can assert that the probability is greater
than or equal to 1-α that this interval
contains the true value of the parameter. In the discussion above, we
focused on methods to obtain a unique estimate of a parameter,
, based
on our set of observations. We could instead devise a new method that
specifies a range of values for the parameter. If we did our experiment a
large number of times, the estimated range for the parameter is a confidence
interval if the range contained the true value of the parameter in
100(1-α) percent of the trials. In the discussion
below, we will describe methods for determining the confidence interval for
the binomial success probability, but the same approaches can be used in
any case that we can model our experiment with an easily computed
distribution function. We will also discuss a powerful method for obtaining
confidence limits when the form of the appropriate population distribution
is unknown.
3.3.1. Binomial confidence limits.
Estimating the confidence limits for the binomial paramater, p, is straightforward, if somewhat computationally intensive. As an example, consider the linkage experiment discussed above (Example 3.1). We observed 3 recombinant animals in a sample of 30, giving us a maximum likelihood estimator for the recombination frequency equal to 0.1. We wish to compute the 95% confidence limits for this frequency. In order to obtain the upper confidence limit, we want to determine the value pu such that the probability of obtaining 3 or fewer recombinants would be smaller than 0.025 if the true recombination frequency were equal to pu. Using the binomial distribution we can compute for various values of p, the probability
We can similarly define the lower bound for the confidence interval, pl, as the value of p that gives a probability of 0.025 for obtaining 3 or more recombinants
Setting each of the cumulative binomial probabilities above to 0.025, we need to solve for p to obtain our confidence limits. Although it is not possible to arrive at a solution algebraically, we can use numerical methods to estimate p. The simplest such approach is the bisection method for finding the root of the equation P-0.025=0. Choose a starting value for p and continue to alter the value until the above difference changes sign, i.e., we've bracketed the interval containing the value of p. Compute the value of the function at the midpoint of the interval and replace the endpoint of the same sign to yield a new interval of half the size. Repeat this process until we have an interval of the desired small size. The table below shows the results for estimation of pu.
Interval | p | P-0.025 |
|
0.4 | 0.0247 |
|
0.2 | -0.0977 |
[0.2,0.4] | 0.3 | 0.0157 |
[0.2,0.3] | 0.25 | -0.0125 |
[0.25,0.3] | 0.275 | 0.0058 |
[0.25,0.275] | 0.2625 | -0.0020 |
[0.2625,0.275] | 0.26875 | 0.0022 |
[0.2625,0.26875] | 0.26562 | 0.0002 |
[0.2625,0.26562] | 0.26406 | -0.0008 |
[0.26406,0.26562] | 0.26484 | -0.0003 |
[0.26484,0.26562] | 0.26523 | -0.00004 |
Thus, we obtain an upper confidence limit of approximately 0.265. Using an analogous approach, considering the probability of obtaining 3 or more recombinants as a function of p, we obtain a lower confidence limit of 0.021. Thus, the 95% confidence interval for our estimate of the recombination frequency is (0.021, 0.265). Note carefully the meaning of this confidence interval. It is not appropriate to say that there is a 95% probability that the true value for the recombination fraction lies in the interval (0.021, 0.265) because the true value is fixed. We can say that, if we repeated this experiment many times and estimated the confidence interval for each replicate as we did above, 95% of these intervals would contain the true value.
Estimating the confidence interval for the binomial parameter as described above would be difficult to perform without the aid of a computer. However, we can take advantage of the normal approximation to the binomial distribution (see Example 2.7) to compute more simply an approximate confidence interval.
Recall that the mean for the binomial distribution is equal to Np, and its variance is Np(1-p). We can use the standard normal distribution (μ=0, σ=1) to approximate the distribution of the number of successes, x, using the transformation z=(x-μ)/σ.
In order to define the endpoints for the (1-α) confidence interval, we can use the tabulated, cumulative standard normal distribution (Appendix 2) by finding the value of z that yields an upper tail probability of α/2. For the 90%, 95%, and 99% confidence intervals, the appropriate values of z are 1.64, 1.96, and 2.57, respectively. The lower and upper confidence limits for p satisfy the equations
Solving the above equations for pl and pu, we obtain
In applying these approximations, note that pl can not be less than 0 nor pu greater than 1. For the problem detailed in Example 3.1, we would obtain approximate 95% confidence limits for p of (0.026, 0.277), which may be compared to the interval (0.021, 0.265) we determined above by iteration. When the binomial distribution is more nearly normal, a better approximation to the confidence limits is obtained. For example, if we had observed 30 recombinants in 120 offspring, the approximate 95% confidence limits for the recombination fraction would be (0.177, 0.339), while those obtained iteratively using the binomial distribution are (0.175, 0.337).
3.3.2. Bootstrap confidence intervals.
Using the above method to determine the confidence limits for an estimator depends on our ability to specify the type of probability distribution that adequately describes the population from which our sample was taken. Unfortunately, in many experiments we will only have a dim idea of the nature of the appropriate population distribution. The bootstrap method, first discussed in detail by Efron (1979; Efron and Tibshirani, 1993), is based on the idea that, in the absence of other information, the distribution of values in our sample (the empirical distribution) is the best model for the population distribution. We can model the distribution of our estimator by repeatedly re-sampling our set of observations and computing the estimator for each new sample.
The most frequent application of the bootstrap method is to determine confidence limits for estimators, such as the mean, variance, or median. Application of this method is straightforward, though computationally intensive. Consider the problem of determining the 95% confidence limits for the mean of a set of N observations. We construct a bootstrap sample by choosing at random one of our observations and noting its value. We return that observation to the pool and repeat the process until we obtain a set of N randomly chosen values and compute the mean of our new random sample. That is, we are sampling with replacement from our original set of observations. If we construct 1000 bootstrap samples and determine the mean of each, we can approximate the confidence interval for our sample mean by putting our bootstrap means in increasing order and taking the 25th (i.e., the 2.5 percentile point) and 975th values in the list.
Example 3.4
You are comparing the efficacies of the standard therapy for a particular type of cancer with a new treatment. Patients are assigned at random to the two treatments and you observe the length of time (in months) that each survives. The ordered survival times for the two groups are
Standard (n=32): 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 2.5 2.5 2.5 2.5 3 4.5 4.5 4.5 5 5 6 6 7 7 8 8 8 8 8 11 15 16 26 33
New (n=26): 1.5 2.5 2.5 3.5 3.5 6.5 8.5 8.5 8.5 9.5 10.5 13 13 13 13 14 14 15 15 19 21 26 26 35 35 42
The median survival times for the Standard and New therapies are 4.75 and 13 months, respectively. What are the 95% confidence limits for these medians?
Construct 10000 bootstrap samples by choosing with replacement 32 values from the set of Standard observations and determine the median of each sample. The first few medians are
5 4.5 3 4.75 2.75 4.75 4.75 4.5 4.75 5.5 7 5 4.5 2.5 6 5.5 4.75 6 …
Placing our 10000 medians in order, we find that the 250th and 9750th values are 2.5 and 7, respectively. Thus, the 95% confidence limits for the median survival on the Standard therapy are (2.5,7).
A similar approach yields 95% confidence limits for median survival on the new therapy of (8.5,15). Given that the confidence limits for the two treatments do not overlap, we would have a good reason to prefer the new therapy.
Although a number of more sophisticated methods for determining the bootstrap confidence limits of an estimator have been devised, largely to reduce bias in the case that the distribution of the data is highly skewed, the simple approach described above has been shown to perform well under most circumstances. A detailed discussion of various bootstrap methods can be found in Bryan Manly's book, Randomization, Bootstrap and Monte Carlo Methods in Biology (2001).
How many bootstrap samples are required? Based on several studies, Manly suggests at least 1000 resamplings for 95% confidence limits, scaling upward as α decreases (e.g., 5000 trials for the 99% confidence limits). The great increase in readily available computing power over the last decade has generated increasing interest in the bootstrap and other "computationally intensive" methods. When first proposed by Efron in 1979, bootstrap methods required access to a mainframe computer. The 10,000 re-samplings used in the example above to generate the confidence interval for a median required less than 0.03 seconds on a mid-range desktop computer.
3.4. Expectations and Variances of Functions
We often are interested in derived quantities obtained from measurements on one or more random variables. Examples include normalizing the number of counts in a particular band on a Southern blot to a reference band in the same lane or calculating the volume of a sphere from its measured diameter. We can broaden the definitions for expectation and variance to include any function g(x) of a random variable x as follows
where f(x) is the distribution of x and the integral is taken to indicate summation for a discrete variable. Calculating these values explicitly for most distributions and functions is not easy. An approximate method is available that is generally accurate to order 1/n where n is the number of measurements.
For a function of a single random variable g(x) where the mean of x is m and the variance of x is V(x)
E[g(x)] ≈ g(m)
V[g(x)] ≈ {dg/dx}2 V(x)
where the derivative is evaluated at m.
A similar approximate method may be used for functions of several random variables using partial derivatives. For most purposes, it will suffice to consider functions of two random variables, x and y, involving constants, a and b. When working with two random variables it is useful to define their covariance
Cov(x,y) = E(xy) - E(x)E(y)
Note that if x and y are independent, the covariance is equal to 0.
The table below gives the expectation and variance for several common functions.
Function | Expectation | Variance |
ax + b | aE(x) + b | a2 V(x) |
x + y | E(x) + E(y) | V(x) + V(y) +2 Cov(x,y) |
x y (independent) | E(x) E(y) | V(x)E2(y) + V(y)E2(x) + V(x)V(y) |
x / y (independent) | E(x) / E(y) |
|
where E2(x) = {E(x)}2 |
Example 3.5
Extracts are prepared from cells co-transfected with an expression plasmid and with a control vector or one carrying a dominant negative mutant of a particular transcription factor. The extracts are assayed for luciferase activity, giving the following values (relative light units):
Control: 2127120, 1235417, 1053546, 1571762
Dom. Neg: 320310, 265511, 373367, 452025
What is the activity in the dominant negative transfections normalized to the control activity? The mean and variances are
E(control) = 1.497x106; V(control) = 2.226x1011
E(mutant) = 3.528x105; V(mutant) = 6.314x109
mutant/control = 0.24 ± 0.09
3.5. An Aside on Significant Figures
In reporting your results, it is important to be mindful of the number of significant figures presented for measurements or derived quantities, such as means and standard deviations. It is fairly easy to deal with issues related to the precision of the tool that you used to obtain the measurement. You should not imply that your measurement is more precise than it was. For example, if you weighed tissues on a balance that was accurate to the nearest mg, you obviously should not represent the mean for a group of animals as 115.2357 mg.
A more subtle problem arises when experimental variability (as a consequence of randomness in the biological process you are studying) greatly exceeds the variability introduced by your measuring tool. One approach, suggested by Sokal and Rohlf (2012), is to obtain your measurements (i.e., record the value to the number of significant figures) such that the number of unit steps between the smallest and largest values in the set is between 30 and 300. For example, if you are measuring bone length in a group of animals and the observations range from 1.0 to 1.3 cm, you should use a tool that will measure to an accuracy of 0.01 cm (and report means and standard deviations to that precision).
Our bias is to use the magnitude of the standard deviation as a guide and to report the data to one or two significant figures in the standard deviation and the mean value to the same decimal place. For example, in the luminometer data given in Example 3.5, the means (± standard deviation) for the control and mutant constructs would be reported as (15 ± 4)×105 and (3.5 ± 0.7)×105, respectively.
3.6. Sample Problems
- For the Poisson distribution, the natural estimator for the parameter m is the mean of the observations. Prove that this is also the maximum likelihood estimator.
- Consider a double intercross of the type AB/ab×AB/ab which
produces the following numbers of offspring for the distinguishable
phenotypes:
AB Ab aB ab 187 35 37 31 Find the maximum likelihood estimate of the recombination frequency (assume it's the same in the two sexes). What is the variance of your estimate? - You want to estimate the frequency of a particular transcript in a cDNA
library. You plate out aliquots of the library (104 bacteria per plate),
transfer the colonies to filters, hybridize to your probe, and count
the number of spots on each filter. For a set of 10 filters you obtain
the following results:
15 23 20 20 17 13 22 22 27 19
What is the mean transcript frequency (copies/104 clones) and what are its 95% confidence limits? - You are studying a series of promoter mutants by transfecting promoter/reporter constructs into cells and measuring reporter gene activity. You want to express the data as the ratio (R) between the mutant and control (wild type) values and have performed three independent experiments. Using the data below, compute the mean value for R and its standard deviation.
- You have measured the diameters of a collection of perfectly spherical seeds and want to report the mean volume of the seeds. Based on the material above regarding the expectations and variances of functions, derive an approximate formula for the mean and variance of the volume in terms of the mean and variance of the measured diameters. For the following data set of diameters, use this approximate formula to estimate the mean and variance of the volume and compare these values to what you would get if you computed the volume for each seed and calculated the mean and variance using these calculated volumes. Values for the diameters are 125, 189, 334, 110, 48, and 99.
Mutant | Activity |
control | 87, 94, 68 |
m1 | 41, 42, 31 |
m2 | 153, 117, 182 |
m3 | 5, 21, 11 |