9 Scoring Matrices

1 Introduction to Bioinformatics/ Elements of Bioinformatics Scoring Matrices

2 Reference Mount, D.W. (2004) Bioinformatics: Sequence and Genome Analysis. 2 nd ed. Cold Spring Harbor Lab. Press, N.Y. Chapter 3. Baxevanis, A.D., and Ouellette, B.F.F. (2005) Bioinformatics - A practical guide to the analysis of genes and proteins (3 rd ed). John Wiley and Sons, NY. Chapter 11. • Eddy S. R. (2004) Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology 22: 1035-1306. • Dayhoff, M.O. (1978) Atlas of Protein Sequence and Structure, vol. 5 . supplement 3. pp. 345-352. National Biomedical Research Foundation. • Henikoff, S., and Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci USA 89: 10915- 10919.
3 BLOSUM62 matrix

4 • Evolutionarily related or random alignment? • The odds ratio: gives us an idea which one is more likely to be correct. likelihood that the alignment is found in related sequences likelihood that the alignment arise from chance match ···QVKGH··· || | ···KVKAH··· ···QVKGH··· ···KVKAH··· ···QVKAH··· Background
5 Odds ratio • Odds ratio for aligning i with j: q ij = probability of residue i substituted by residue j in related sequences. Estimated from counting ( i, j) pairs in alignments of related sequences. p i * p j = probability of randomly aligning i against j p i and p j are frequency of occurrence of i and j, respectively. • Take the logarithm of the odds ratio (log odds scores) for each ( i, j ) pair and then add the log odds scores together. j i ij *p p q = ratio odds ···QVKGH··· || | ···KVKAH···

6 Scoring matrices Scoring matrices are log-odds matrices: s(i,j) = log-odds score for aligning i with j s(i,j) > 0 if q ij > p i *p j s(i,j) < 0 if q ij < p i *p j s(i,j) = 0 if q ij = p i *p j λ = scaling constant to round up scores to integers ) ln( λ 1 j) s(i, j i ij p p q * =
7 Nucleotide scoring matrix % Identity Match/Mismatch 99% 1/-3 1/-2 2/-3 3/-4 4/-5 1/-1 95% 90% 85% 80% 75%

8 A C G T A 0.75 0.083 0.083 0.083 C 0.083 0.75 0.083 0.083 G 0.083 0.083 0.75 0.083 T 0.083 0.083 0.083 0.75 Mutation matrix (probability that i change to j) A C G T A 1 -1 -1 -1 C -1 1 -1 -1 G -1 -1 1 -1 T -1 -1 -1 1 Log-odds matrix (S ij ) A C G T A 0.1875 0.0208 0.0208 0.0208 C 0.0208 0.1875 0.0208 0.0208 G 0.0208 0.0208 0.1875 0.0208 T 0.0208 0.0208 0.0208 0.1875 q ij (probability of finding i, j pairs in related sequences) To construct a log-odds matrix that is optimized to find 75% identity in DNA alignment: Assume frequency of occurrence for each nucleotide = 0.25
9 1 λ set we if 1 1.0986 λ 1 ) 0.25 * 0.25 0.1875 ln( λ 1 T) s(T, G) s(G, C) s(C, A) s(A, = = = = = = = 1 λ for -1 1.0986) ( λ 1 ) 0.25 * 0.25 0.0208 ln( λ 1 j i where j) s(i, = = = = ) ln( λ 1 j) s(i, j i ij p p q * = To construct a log-odds matrix that is optimized to find 75% identity in DNA alignment: A C G T A 1 -1 -1 -1 C -1 1 -1 -1 G -1 -1 1 -1 T -1 -1 -1 1 Log-odds matrix (S ij )

10 A C G T A 1 -2 -2 -2 C -2 1 -2 -2 G -2 -2 1 -2 T -2 -2 -2 1 To construct a log-odds matrix that is optimized to find 95% identity in DNA alignment: q AA = q CC = q GG = q TT = 0.25*0.95 = 0.2375 Assume all mismatches equiprobable = 0.25*(0.05/3) = 0.0042 1.335 λ set we if 1 ) 0.25 * 0.25 0.2375 ln( λ 1 T) s(T, G) s(G, C) s(C, A) s(A, = = = = = = 1.335 λ for -2 ) 0.25 * 0.25 0.0042 ln( λ 1 j i where j) s(i, = = =
11 PAM and BLOSUM matrices • The two commonly used protein log odds scoring matrices.

