Unformatted text preview: 7 ) may become inapplicable because of negative arguments of logarithms. Therefore, the application of this equation should be confined to the case of relatively closely related sequences (say, d -c 1) and a relatively large value of n. Equation (7) was derived under the assumption that the rate of nucleotide substitution h is the same for all sites considered. In the control region, however, h is known to vary extensively (table 2) and approximately follows the gamma distribution, f(h). Therefore, following Nei and Gojobori ( 1986) and Jin and Nei ( 1990), we obtain the means of PI, P2, and Q as follows: p1 2 = - gAgG gR a+ 2&R& + gYP)t (12) %T& P* = gY a + 2&Y& a + g&t - i”+~R(-$$f], (l3) (z = 2gRgY
[ 1 (14) where a and p are the means of a and p, respectively. derive the following formula for the average number site between two sequences compared: From these equations, we can of nucleotide substitutions per -l/a -l/a + gTi?C t?Y ( 1_ gY %T& gAgGgY + ( gRgY gR -- gTgCgR 8Y )( p2 ) -I/a I ) (15) - gA8G - gTgC - gRgY 1
, where 8,) p2, and & are the estimates of pi , P2, and Q, respectively. variance of d is approximately given by The large-sample Nucleotide Substitution in mtDNA 5 19 V(d) = [(c:Pl + cp* + CSQ) - (C#l + c#* + c&)2]/n) (16) where (17) ad
c2=z= (18) _L(j QY l-[ l$ 2gRJ?Y 1 -(1+1/a) 1
-(1+1/a) (19) . With the present model, sitional (s) and transversional following equations: it is possible to estimate the average numbers of tran(v) substitutions separately. They are given by the (20) 1
-lJU - gAgG - gTgC (21)
where s^and 6 are the estimates approximately given by of s and v, respectively. The variances for s^and 6 are V(i)
and = [(&P, + c:& + c$Q)- (Cl& + c& + cL&)2]/n) (22) V(v^) = respectively, c:&( 1 - 0)/n, ( 16) and (23) where cl and c2 are the same as those in equation lamura anu ~wi (24) , We can also compute the variance of the $/a ratio by the following formula: where (27)
Estimation of the Number of Nucleotide Substitutions To apply equation ( 15) to human and chimpanzee data, we must consider the entire control region, because the parameter a in the equation has been estimated for this region. There are 20 human sequences, 3 common chimpanzee sequences, and 1 pygmy chimpanzee sequence that are complete for the control region (Kocher and Wilson 199 1; Vigilant et al. 199 1). We therefore computed the average distance between each of the chimpanzee sequences and all human sequences, as well as the pairwise distances for all chimpanzee sequences, using equation ( 15 ). In this computation, a = 0.11 was used. The results obtained are presented in table 4. The a value for the human-chimpanzee comparison varies considerably with chimpanzee sequence, but the differences among different $ values are not statistically different, and the average value becomes 0.752 + 0.224. This value is substantially larger than the estimate (0.150) obtained by Kimura’ s ( 1980) two-parameter method, without taking into account unequal nucleotide frequencies and variation in h among different nucleotide sites. This indicates the importance of using a proper mathematical model in estimating the number of nucleotide substitutions in the control region. Table 4 also shows the $ values obtained for various chimpanzee sequence comparisons. The $ value between the common and pygmy chimpanzee sequences is 0.256-0.349, which is approximately one-third of the d value between human and chimpanzee sequences. In this’ case, too, Kimura’ s method gives substantial underestimates of nucleotide substitutions. The d values for the comparisons of common chimpanzee sequences are considerably lower than those for the comparison of common and pygmy chi...
View Full Document
This note was uploaded on 01/06/2010 for the course NS 2750 taught by Professor Haas&gu during the Spring '08 term at Cornell.
- Spring '08