Lec02-seqalign

CMSC 423: Sequence Alignment Slides By: Carl Kingsford Department of Computer Science University of Maryland, College Park Based on Section 6.6 of Algorithm Design by Kleinberg & Tardos.

The Problem Given: Two strings a = a 1 a 2 a 3 a 4 . . . a m b = b 1 b 2 b 3 b 4 . . . b n a i , b i L for some alphabet L like { A , C , G , T } . Compute how “similar” the two strings are. What do we mean by similarity between two strings?
Alignment Examples prin-ciple |||| |||XX prinncipal (1 gap, 2 mm) misspell ||| |||| mis-pell (1 gap) aa-bb-ccaabb |X || | | | ababbbc-a-b- (5 gaps, 1 mm) prin-cip-le |||| ||| | prinncipal- (3 gaps, 0 mm) prehistoric |||||||| ---historic (3 gaps, 0 mm) al-go-rithm- || XX ||X | alKhwariz-mi (4 gaps, 3 mm)

Motivation Alignment is used extensively in molecular biology, where a and b are the DNA sequences of two genes (see NCBI BLAST) Spell checkers Inexact search of web pages
NCBI BLAST

NCBI BLAST Alignment >gb|AC115706.7| Mus musculus chromosome 8, clone RP23-382B3, complete sequence Query 1650 gtgtgtgtgggtgcacatttgtgtgtgtgtgcgcctgtgtgtgtgggtgcctgtgtgtgt 1709 |||||||||| | || | ||||||||| | |||||||| ||| || ||||| Sbjct 56838 GTGTGTGTGGAAGTGAGTTCATCTGTGTGTGCACATGTGTGTGCA--TGCATGCATGTGT 56895 Query 1710 gtg-gggcacatttgtgtgtgtgtgtgtgcctgtgtgtgggtgcacatttgtgtgtgtgc 1768 || ||||| || ||| ||||||| |||||||| ||| ||| ||||| || | Sbjct 56896 GTCCGGGCA------TGCATGTCTGTGTGCATGTGTGTGTGTGTGCAT--GTGTGAGTAC 56947 Query 1769 ctgtgtgtgtgtgcctgtgtgtgggggtgcacatttgtgtgtgtgtgtgcctgtgtgtgg 1828 |||||||||| ||| ||| |||| | ||| ||| ||||| |||||| ||||| | Sbjct 56948 CTGTGTGTGTATGCTTGTATGTGTGTGTGTGCATGTGTGTAGGTGTGTATATGTGTAAGT 57007 Query 1829 gggtgcacatttgtgtgtgtgtgtgcctgtgtgtgtgggtgcacatttgtgtgtgtgtgt 1888 ||| ||||||| |||||| |||| | ||| |||| |||||||||| || Sbjct 57008 T------CATCTGTGTGTATGTGTG--TGTGAGAGTGCATGCA----TGTGTGTGTGAGT 57055 Query 1889 gcctgtgtgt--gtgggtgcacatttgtgtgtgtgtgcctgtg--tgtgt--gggtgcac 1942 | | ||||| ||| ||| || || | | | ||||| ||||| | ||| | Sbjct 57056 TCATCTGTGTCAGTGTATGCTTATGGGTATAACT-TAACTGTGCATGTGTAAGTGTGTTC 57114 Query 1943
