Lec13-motiffind

Motif Finding CMSC 423

Motif Finding CMSC 423

Motif Finding Given p sequences, find the most mutually similar length- k subsequences, one from each sequence: dist(s i ,s j ) = Hamming distance between s i and s j . 1. ttgccacaaaataatccgccttcgcaaattgacc TACCTCAATAGCGGTA gaaaaacgcaccactgcctgacag 2. gtaagtacctgaaagttacggtctgcgaacgctattccac TGCTCCTTTATAGGTA caacagtatagtctgatgga 3. ccacacggcaaataaggag TAACTCTTTCCGGGTA tgggtatacttcagccaatagccgagaatactgccattccag 4. ccatacccggaaagagttactccttatttgccgtgtggttagtcgctt TACATCGGTAAGGGTA gggattttacagca 5. aaactattaagatttttatgcagatgggtattaagga GTATTCCCCATGGGTA acatattaatggctctta 6. ttacagtctgttatgtggtggctgttaa TTATCCTAAAGGGGTA tcttaggaatttactt Transcription factor argmin s 1 ,...,s p i<j dist( s i , s j ) Hundreds of papers, many formulations (Tompa05)
Motif-finding by Gibbs Sampling “Gibbs sampling” is the basis behind a general class of algorithms that is a type of local search. It doesn’t guarantee good performance, but often works well in practice. Assumes: 1. we know the length k of the motif we are looking for. 2. each input sequence contains exactly 1 real instance of the motif. Problem . Given p strings and a length k , find the most “mutually similar” length-k substring from each string.

Gibbs Sampling: Profiles 1. ttgccacaaaataatccgccttcgcaaattgacc
