Finding mo,fs in DNA is similar to the problem posed by Edgar Allan Poe (1809 – 1849) in his Gold Bug story. •  Decipher the message encrypted in the fragment Hints for The Gold Bug Problem The Gold Bug Problem: Symbol Counts •  Addi,onal hints: Naive approach to solving the problem: –  Count the frequency of each symbol in the encrypted message –  Find the frequency of each le]er in the alphabet in the English language –  Compare the frequencies of the previous steps, try to ﬁnd a correla,on and map the symbols to a le]er in the alphabet –  The encrypted message is in English –  Each symbol correspond to one le]er in the English alphabet –  No punctua,on marks are encoded Symbol Frequencies in the Gold Bug Message Gold Bug Message: Symbol 8; 4) +* 5 6 ( ! 1 0 2 9 3: ?` - ] . Frequency 34 19 15 12 25 16 14 11 9 8 7 6 5 5 4 4 3 2 1 1 The Gold Bug Message Decoding: First A]empt •  By simply mapping the most frequent symbols to the most frequent le]ers of the alphabet: 1 sfiilfcsoorntaeuroaikoaiotecrntaeleyrcooestvenpinelefheeosnlt arhteenmrnwteonihtaesotsnlupnihtamsrnuhsnbaoeyentacrmuesotorl eoaiitdhimtaecedtepeidtaelestaoaeslsueecrnedhimtaetheetahiwfa taeoaitdrdtpdeetiwt English Language: etaoinsrhldcumfpgwybvkxjqz •  The result does not make sense Most frequent Least frequent 4 9/2/13 The Gold Bug Problem: L-mer Count A be]er approach: –  Examine frequencies of L- tuples, combina,ons of 2 symbols, 3 symbols, etc. –  "The" is the most frequent 3- tuple in English and ";48" is the most frequent 3- tuple in the encrypted text –  Make inferences of unknown symbols by examining other frequent L- tuples
