503_lecture11_S11

503_lecture11_S11 - UMass Lowell Computer Science 91.503...

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Spring, Spring 2011 Tuesday, 4/26/2011 String Matching Algorithms g g g Chapter 32 1 Chapter Dependencies Automata Ch 32 String Matching You're responsible for material in Sections 32.1-32.4 of this chapter. 2 String Matching Algorithms Motivation & Basics 3 String Matching Problem Motivations: textMotivations: text-editing, pattern matching in DNA sequences 32.1 Text: Text: array T [1...n] [1...n nm Pattern: Pattern: array P [1...m] [1...m Array Element: Character from finite alphabet Element: Pattern P occurs with shift s in T if P [1...m] = T [s +1...s + m] [1...m +1...s 0 s n-m 4 source: 91.503 textbook Cormen et al. String Matching Algorithms: Worst-Case Worst Case Execution Time Naive Algorithm Preprocessing: 0 Matching: O((n-m+1)m) O((n +1)m Overall: O((n-m+1)m) ((n +1)m ) RabinRabin-Karp Preprocessing: (m) Matching: O((n-m+1)m) O((n +1)m O Overall: ll O((n-m+1)m) ((n +1)m 1) (Better than this on average and in practice) practice) Finite Automaton Preprocess: O(m O(m||)) Matching: (n ) Overall: O(n + m||) O(n Knuth-MorrisKnuth-Morris-Pratt Preprocessing: (m) Matching: (n ) Overall: (n + m) Text: Text: array T [1...n] [1...n Pattern: Pattern: array P [1...m] [1...m 5 Notation & Terminology * = set of all finite-length strings formed finiteusing characters from alphabet g p Empty string: |x| = length of string x ab abcca w is a prefix of x: w x cca abcca w is a suffix of x: w x p prefix, suffix are transitive , 6 Overlapping Suffix Lemma 32.1 32.3 32 3 32.1 32 1 7 source: 91.503 textbook Cormen et al. String Matching Algorithms Naive Algorithm 8 Naive String Matching implicit loop worstworst-case running time is in ((n-m+1)m) ((n +1)m 32.4 9 source: 91.503 textbook Cormen et al. String Matching Algorithms RabinRabin-Karp 10 Rabin-Karp Algorithm Assume each character is digit in radix-d notation (e.g. d=10) radixp = decimal value of pattern Strategy: Convert to numeric representation for mod operations. ts = decimal value of substring T[s+1..s+m] for s = 0,1...,n-m 0,1...,ncompute p in O(m) time (in O(n)) compute all ti values i total of O( ) time ll l in l f O(n) i find all valid shifts s in O(n) time by comparing p with each ts p = P[m] + d(P[m-1] + d(P[m-2] + ... + d(P[2] + dP[1]))) d(P[md(P[mdP[1]))) Compute p in O(m) time using Horner's rule: Horner s Compute t0 similarly from T[1..m] in O(m) time p y [ ] ( ) Compute remaining ti's in O(n-m) time O(n ts+1 = d(ts - d d(t m-1T[s+1]) + T[s+m+1] 11 source: 91.503 textbook Cormen et al. rolling window Rabin-Karp Algorithm p, ts may be large, so mod by a prime q p pattern match 32.5 32 5 12 source: 91.503 textbook Cormen et al. Rabin-Karp Algorithm (continued) source: 91.503 textbook Cormen et al. ts+1 = d(ts - d m-1T[s+1]) + T[s+m+1] d m-1 mod q (32.2) p = 31415 spurious hit 13 Rabin-Karp Algorithm (continued) 14 source: 91.503 textbook Cormen et al. Rabin-Karp Algorithm (continued) d is radix; q is modulus ; (m) high-order digit position for m-digit window (m) Matching loop invariant: when line 10 executed ts=T [s+1..s+m] mod q ((n-m+1)m) ((nTry all possible shifts (m) stopping condition rule out spurious hit worstworst-case running time is in ((n-m+1)m) ((n- 15 source: 91.503 textbook Cormen et al. Rabin-Karp Algorithm (continued) source: 91.503 textbook Cormen et al. d is radix; q is modulus (m) high-order digit position for m-digit window (m) ((n-m+1)m) ((nTry all possible shifts Matching loop invariant: when line 10 executed ts=T[s+1..s+m] mod q rule out spurious hit stopping condition set of all finite-length finitestrings formed from (m) Assume reducing mod q is like random mapping from * to Zq Estimate (chance that ts= p (mod q)) = 1/q Expected matching ti E t d t hi time = O( ) + O(m(v + n/q)) O(n) O( ( / )) preprocessing + ts updates Expected # spurious hits is in O(n/q) (v ( = # valid shifts) lid hift ) time for explicit matching comparisons If v is in O(1) and q >= m averageaverage-case running time is in (n+m) 16 String Matching Algorithms Finite Automata 17 Finite Automata 32.6 source: 91.503 textbook Cormen et al. Strategy: Strategy: Build automaton for pattern, then examine each text character once. worstworst-case running time is in (n) + automaton creation time 18 Finite Automata 19 source: 91.503 textbook Cormen et al. String-Matching Automaton Pattern = P = ababaca Absent arrows go to state 0. Automaton accepts strings ending in P 32.7 20 source: 91.503 textbook Cormen et al. String-Matching Automaton Suffix Function for P: (x) = length of longest prefix of P that is a suffix of x ( x) = max{k : Pk x} (32.3) (32.4) Automaton's operational invariant We will build up to this proof... i-character prefix of T (32.5) at each step: keep track of longest pattern prefix that is a suffix of what has been read21 far step: so source: 91.503 textbook Cormen et al. String-Matching Automaton Simulate behavior of string-matching automaton that finds occurrences of pattern P of length m i T [1 ] f tt fl th in [1..n] We'll show automaton is in state (Ti) after scanning character T[i]. Since (Ti)=m iff P (Ti) , machine is in accepting state m iff it has just scanned pattern P. assuming automaton has already been created... created... worstworst-case running time of matching is in (n) 22 source: 91.503 textbook Cormen et al. String-Matching Automaton (continued) Correctness of matching procedure... 32.4 source: 91.503 textbook Cormen et al. Automaton keeps track of longest pattern prefix that is a suffix of what has been read so far in the text. board work k (32.4) 32.3 ( xa) = ( P ( x ) a) to be proved next... 23 String-Matching Automaton (continued) Correctness of matching procedure... 32.2 source: 91.503 textbook Cormen et al. to be used to prove Lemma 32.3 32.8 = P ( xa ) 32.8 32.2 24 String-Matching Automaton (continued) Correctness of matching procedure... 32.3 source: 91.503 textbook Cormen et al. 32.9 32 9 32.2 32.1 = P ( x ) = P ( xa ) 32.9 32.3 25 String-Matching Automaton (continued) 32.4 source: 91.503 textbook Cormen et al. Correctness of matching procedure is now established... (32.4) 32.3 ( xa) = ( P ( x ) a) 26 String-Matching Automaton (continued) source: 91.503 textbook Cormen et al. This procedure computes the transition function from a given pattern P [1...m]. worstworst-case running time of automaton creation is in (m3 ||) can be improved to: (m ||) worstworst-case running time of entire string-matching strategy stringis in (m ||) + (n) 27 automaton creation time pattern matching time String Matching Algorithms Knuth-MorrisKnuth-Morris-Pratt 28 Knuth-Morris-Pratt Overview Achieve (n+m) time by shortening automaton preprocessing time below (m ||) Approach: Approach: don't precompute automaton's transition function don t automaton s calculate enough transition data "on-the-fly" "on-theobtain data via "alphabet-independent" pattern "alphabet-independent alphabetalphabet preprocessing pattern preprocessing compares pattern against shifts of itself 29 Knuth-Morris-Pratt Algorithm determine how pattern matches against itself 32.10 30 source: 91.503 textbook Cormen et al. Knuth-Morris-Pratt Algorithm 32.6 Equivalently, what is largest k < q such that Pk Pq? Prefix function shows how pattern matches against itself (q ) = max{k : k < q and Pk Pq } (q) is length of longest prefix of P that is a proper suffix of Pq Example: 31 source: 91.503 textbook Cormen et al. Knuth-Morris-Pratt Algorithm Somewhat similar in structure to FINITE-AUTOMATON-MATCHER... FINITE-AUTOMATONusing amortized analysis (see next slide) (m) # characters matched scan text left to right left-to-right (m+n) next character does not match (n) next character matches using amortized analysis analysis* Is all of P matched? Look for next match 32 *source: 91.503 textbook Cormen et al., 2nd edition uses potential function with = q. 3rd edition uses aggregate analysis. Knuth-Morris-Pratt Algorithm Amortized Analysis Potential Method = k k represents current state of algorithm Similar in structure to KMP-MATCHER KMP-MATCHER... Potential is never negative since (k) >= 0 for all k initial potential value (m) time potential decreases potential increases by <=1 in each execution of for loop body source: 91.503 textbook Cormen et al., 2nd edition. 3rd amortized cost of loop body is in (1) (m) loop iterations 33 edition uses aggregate analysis to show while loop executes O(m) times overall. Knuth-Morris-Pratt Algorithm Correctness... Correctness Iterated It t d prefix function: fi f ti 34 source: 91.503 textbook Cormen et al. Knuth-Morris-Pratt Algorithm Correctness... Correctness 35 source: 91.503 textbook Cormen et al. StringMatch Correctness of Compute-Prefix-Function. This is Compute-Prefix-Function. nontrivial... Lemma 32.5. (Prefix-function iteration lemma) Let P be a 32.5. (Prefixpattern of length m with prefix function . Then for q = 1 2 Then, 1, 2, ..., m, we have *[q] = {k : k < q and Pk Pq}. (using > for suffix symbol) Proof. 1. *[q] {k : k < q and Pk Pq}. Let i = (u)[q] for some u > 0. We prove the inclusion by induction on u. For F u = 1, we h 1, have i = [q], and th claim follows since i < q d the l i f ll i and P[q] Pq. Assume the inclusion holds for i = (u)[q]. We need to prove u+1) it for i = (u+1)[q] = [(u)[q]] . i < (u)[q] and Pi P(u)[q]. )[q 4/25/2011 36 source: Textbook and Prof. Pecelli StringMatch By induction assumption, P(u)[q] Pq . Transitivity of the )[q relation give that Pi Pq, as desired. 2. {k : k < q and Pk Pq} *[q]. By contradiction. Suppose, to the contrary, that there i an integer in {k : k < S h h h is i i q and Pk Pq} - *[q], and let j denote the largest such integer. Since [q] is the largest value in {k : k < q and Pk Pq}, and g [q] *[q], we must have j < [q]. Let j' denote the s.t. smallest integer in *[q] s.t. j' > j. j {k : k < q and Pk Pq} i li Pj Pq; d implies j' *[q] implies Pj' Pq . Lemma 32.1 (Overlapping Suffix) implies that Pj Pj', and j 32 1 4/25/2011 is the largest value less than j' with this property. 37 source: Textbook and Prof. Pecelli StringMatch This, in turn, forces the conclusion that [j'] = j j'] and, since j' *[q], we must have j *[q]. Contradiction. We now continue with another lemma: it is clear that, since [1] = 0, line 2 of Compute-Prefix0, Compute-PrefixFunction provides the correct value. We need to extend this statement to all q > 1. 4/25/2011 38 source: Textbook and Prof. Pecelli StringMatch Lemma 32.6. Let P = P[1...m], and let be 32.6. [1...m p , , , the prefix function for P. For q = 0, 1, ..., m, if [q] > 0, then [q] - 1 [q-1]. 0, 1]. Proof. Proof. If r = [q] > 0, then r < q and Pr Pq. 0, Thus r - 1 < q - 1 and Pr-1 Pq-1 (by dropping the last characters from Pr and Pq) Lemma ). 32.5 implies that [q] - 1 = r - 1 [q-1]. 1]. 4/25/2011 39 source: Textbook and Prof. Pecelli StringMatch We now introduce a new set: for q = 2 3, ..., m, 2, 3 define Eq-1 [q-1] by: Eq-1 = {k [q-1]: P[k+1] = P[q]} {k ] ] = {k : k < q-1 and Pk Pq-1 and P[k+1] = P[q]} {k qk+1] {k q= {k : k < q-1 and Pk+1 Pq}. k+1 In other words, Eq-1 consists of the values k < q - 1 for which Pk Pq-1 and for which Pk+1 Pq , because P[k+1] = P[q]. Eq-1 consists of those values k [q-1] for which we can extend Pk to Pk+1 and still k+1 get a proper suffix of Pq. 4/25/2011 40 source: Textbook and Prof. Pecelli StringMatch Corollary 32.7. Let P be a pattern of length m, and let p be y 32.7. p g the prefix function for P. For q = 2, 3, ..., m, if Eq -1 = , 0 [q ] = 1 + max{k Eq -1 } if Eq -1 . Proof. Case 1: Eq-1 is empty. There is no k [q-1] empty (including k = 0) for which we can extend Pk to Pk+1 and get a 0) k+1 proper suffix of Pq. Thus [q] = 0. 0. Case 2: Eq-1 is not empty. 1. Prove [q] 1 + max{k q-1}. For each k q-1 we have k+1 < q and Pk+1 Pq. The definition of [q] gives the inequality. 4/25/2011 41 source: Textbook and Prof. Pecelli StringMatch 2. 2 Prove that [q] 1 + max{k q-1}. Since Eq-1 is max{k nonnon-empty, [q] > 0. Let r = [q] - 1, hence r + 1 = 0. [q]. Since r + 1 > 0, P[r + 1] = P[q]. By Lemma 0, 1]. 32.6 we also have r = [q] - 1 [q - 1]. Therefore r q-1, which implies r max{k q-1} and max{k and, immediately, the desired inequality. Combining b th i C bi i both inequalities, we have the result. liti h th lt Now glue all these results together to obtain a proof of correctness. 4/25/2011 42 source: Textbook and Prof. Pecelli ...
View Full Document

This note was uploaded on 02/13/2012 for the course CS 91.503 taught by Professor Staff during the Spring '11 term at UMass Lowell.

Ask a homework question - tutors are online