# 16-cs481-similarity_search.pdf - SIMILARITY SEARCH...

• 62

This preview shows page 1 - 13 out of 62 pages.

SIMILARITY SEARCH
Similarity Search Given large text (i.e. a genome) and a set of (n≥1) query sequences, find approximate string matches DNA is double-helix, genome assemblies represent one strand: Search both strands!
Heuristic Similarity Searches Genomes are huge: Smith-Waterman quadratic alignment algorithms are too slow Alignment of two sequences usually has short identical or highly similar fragments Many heuristic methods (i.e., FASTA) are based on the same idea of filtration Find short exact matches, and use them as seeds for potential match extension “Filter” out positions with no extendable matches
Dot Matrices Dot matrices show similarities between two sequences FASTA makes an implicit dot matrix from short exact matches, and tries to find long diagonals (allowing for some mismatches)
Dot Matrices (cont’d) Identify diagonals above a threshold length Diagonals in the dot matrix indicate exact substring matching
Diagonals in Dot Matrices Extend diagonals and try to link them together, allowing for minimal mismatches/indels Linking diagonals reveals approximate matches over longer substrings
Approximate Pattern Matching Problem Goal : Find all approximate occurrences of a pattern in a text Input : A pattern p = p 1 p n , text t = t 1 t m , and k , the maximum number of mismatches Output : All positions 1 < i < ( m n + 1) such that t i t i + n - 1 and p 1 p n have at most k mismatches (i.e., Hamming distance between t i t i + n - 1 and p < k )
Approximate Pattern Matching: A Brute- Force Algorithm ApproximatePatternMatching (p, t, k ) 1 n length of pattern p 2 m length of text t 3 for i 1 to m n + 1 4 dist 0 5 for j 1 to n 6 if t i+j-1 != p j 7 dist dist + 1 8 if dist < k 9 output i
Approximate Pattern Matching: Running Time That algorithm runs in O( nm ). We can generalize the “Approximate Pattern Matching Problem” into a “Query Matching Problem”: We want to match substrings in a query to substrings in a text with at most k mismatches Motivation : we want to see similarities to some gene, but we may not know which parts of the gene to look for
Query Matching Problem Goal : Find all substrings of the query that approximately match the text Input : Query q = q 1 q w , text t = t 1 t m , n (length of matching substrings), k (maximum number of mismatches) Output : All pairs of positions ( i , j ) such that the n -letter substring of q starting at i approximately matches the n -letter substring of t starting at j , with at most k mismatches
Query Matching: Main Idea Approximately matching strings share some perfectly matching substrings. Instead of searching for approximately matching strings (difficult) search for perfectly matching substrings (easy).
Filtration in Query Matching We want all n- matches between a query and a text with up to k mismatches “Filter” out positions we know do not match between text and query Potential match (seed) detection : find all matches of l -tuples in query and text for some small l Potential match verification : Verify each potential match by extending it to the left and right, until ( k + 1) mismatches are found