Unformatted text preview: UMass Lowell Computer Science 91.503 Analysis of Algorithms
Prof. Karen Daniels
Spring, Spring 2011 Tuesday, 4/26/2011 String Matching Algorithms g g g Chapter 32
1 Chapter Dependencies Automata Ch 32 String Matching You're responsible for material in Sections 32.132.4 of this chapter. 2 String Matching Algorithms
Motivation & Basics 3 String Matching Problem
Motivations: textMotivations: textediting, pattern matching in DNA sequences 32.1 Text: Text: array T [1...n] [1...n nm Pattern: Pattern: array P [1...m] [1...m Array Element: Character from finite alphabet Element: Pattern P occurs with shift s in T if P [1...m] = T [s +1...s + m] [1...m +1...s
0 s nm 4 source: 91.503 textbook Cormen et al. String Matching Algorithms: WorstCase Worst Case Execution Time Naive Algorithm Preprocessing: 0 Matching: O((nm+1)m) O((n +1)m Overall: O((nm+1)m) ((n +1)m ) RabinRabinKarp Preprocessing: (m) Matching: O((nm+1)m) O((n +1)m O Overall: ll O((nm+1)m) ((n +1)m 1) (Better than this on average and in practice) practice) Finite Automaton Preprocess: O(m O(m)) Matching: (n ) Overall: O(n + m) O(n KnuthMorrisKnuthMorrisPratt Preprocessing: (m) Matching: (n ) Overall: (n + m) Text: Text: array T [1...n] [1...n Pattern: Pattern: array P [1...m] [1...m 5 Notation & Terminology * = set of all finitelength strings formed finiteusing characters from alphabet g p Empty string: x = length of string x ab abcca w is a prefix of x: w x cca abcca w is a suffix of x: w x p prefix, suffix are transitive ,
6 Overlapping Suffix Lemma
32.1 32.3 32 3 32.1 32 1 7 source: 91.503 textbook Cormen et al. String Matching Algorithms
Naive Algorithm 8 Naive String Matching implicit loop worstworstcase running time is in ((nm+1)m) ((n +1)m 32.4 9 source: 91.503 textbook Cormen et al. String Matching Algorithms
RabinRabinKarp 10 RabinKarp Algorithm Assume each character is digit in radixd notation (e.g. d=10) radixp = decimal value of pattern Strategy: Convert to numeric representation for mod operations. ts = decimal value of substring T[s+1..s+m] for s = 0,1...,nm 0,1...,ncompute p in O(m) time (in O(n)) compute all ti values i total of O( ) time ll l in l f O(n) i find all valid shifts s in O(n) time by comparing p with each ts p = P[m] + d(P[m1] + d(P[m2] + ... + d(P[2] + dP[1]))) d(P[md(P[mdP[1]))) Compute p in O(m) time using Horner's rule: Horner s Compute t0 similarly from T[1..m] in O(m) time p y [ ] ( ) Compute remaining ti's in O(nm) time O(n ts+1 = d(ts  d d(t m1T[s+1]) + T[s+m+1] 11 source: 91.503 textbook Cormen et al. rolling window RabinKarp Algorithm
p, ts may be large, so mod by a prime q
p pattern match 32.5 32 5 12 source: 91.503 textbook Cormen et al. RabinKarp Algorithm (continued)
source: 91.503 textbook Cormen et al. ts+1 = d(ts  d m1T[s+1]) + T[s+m+1]
d m1 mod q (32.2) p = 31415 spurious hit 13 RabinKarp Algorithm (continued) 14 source: 91.503 textbook Cormen et al. RabinKarp Algorithm (continued)
d is radix; q is modulus ; (m) highorder digit position for mdigit window (m)
Matching loop invariant: when line 10 executed ts=T [s+1..s+m] mod q ((nm+1)m) ((nTry all possible shifts (m)
stopping condition rule out spurious hit worstworstcase running time is in ((nm+1)m) ((n 15 source: 91.503 textbook Cormen et al. RabinKarp Algorithm (continued)
source: 91.503 textbook Cormen et al.
d is radix; q is modulus (m) highorder digit position for mdigit window (m) ((nm+1)m) ((nTry all possible shifts
Matching loop invariant: when line 10 executed ts=T[s+1..s+m] mod q rule out spurious hit
stopping condition
set of all finitelength finitestrings formed from (m) Assume reducing mod q is like random mapping from * to Zq Estimate (chance that ts= p (mod q)) = 1/q Expected matching ti E t d t hi time = O( ) + O(m(v + n/q)) O(n) O( ( / ))
preprocessing + ts updates Expected # spurious hits is in O(n/q)
(v ( = # valid shifts) lid hift ) time for explicit matching comparisons If v is in O(1) and q >= m averageaveragecase running time is in (n+m) 16 String Matching Algorithms
Finite Automata 17 Finite Automata 32.6 source: 91.503 textbook Cormen et al. Strategy: Strategy: Build automaton for pattern, then examine each text character once. worstworstcase running time is in (n) + automaton creation time
18 Finite Automata 19 source: 91.503 textbook Cormen et al. StringMatching Automaton
Pattern = P = ababaca
Absent arrows go to state 0. Automaton accepts strings ending in P 32.7 20
source: 91.503 textbook Cormen et al. StringMatching Automaton
Suffix Function for P: (x) = length of longest prefix of P that is a suffix of x ( x) = max{k : Pk x} (32.3) (32.4) Automaton's operational invariant We will build up to this proof... icharacter prefix of T (32.5) at each step: keep track of longest pattern prefix that is a suffix of what has been read21 far step: so
source: 91.503 textbook Cormen et al. StringMatching Automaton
Simulate behavior of stringmatching automaton that finds occurrences of pattern P of length m i T [1 ] f tt fl th in [1..n] We'll show automaton is in state (Ti) after scanning character T[i]. Since (Ti)=m iff P (Ti) , machine is in accepting state m iff it has just scanned pattern P. assuming automaton has already been created... created... worstworstcase running time of matching is in (n)
22 source: 91.503 textbook Cormen et al. StringMatching Automaton
(continued)
Correctness of matching procedure...
32.4 source: 91.503 textbook Cormen et al. Automaton keeps track of longest pattern prefix that is a suffix of what has been read so far in the text. board work k (32.4) 32.3 ( xa) = ( P ( x ) a) to be proved next... 23 StringMatching Automaton
(continued)
Correctness of matching procedure...
32.2 source: 91.503 textbook Cormen et al. to be used to prove Lemma 32.3 32.8 = P ( xa )
32.8 32.2 24 StringMatching Automaton
(continued)
Correctness of matching procedure...
32.3 source: 91.503 textbook Cormen et al. 32.9 32 9 32.2 32.1 = P ( x ) = P ( xa )
32.9 32.3 25 StringMatching Automaton
(continued)
32.4 source: 91.503 textbook Cormen et al. Correctness of matching procedure is now established... (32.4) 32.3 ( xa) = ( P ( x ) a)
26 StringMatching Automaton
(continued)
source: 91.503 textbook Cormen et al. This procedure computes the transition function from a given pattern P [1...m]. worstworstcase running time of automaton creation is in (m3 ) can be improved to: (m ) worstworstcase running time of entire stringmatching strategy stringis in (m ) + (n)
27 automaton creation time pattern matching time String Matching Algorithms
KnuthMorrisKnuthMorrisPratt 28 KnuthMorrisPratt Overview Achieve (n+m) time by shortening automaton preprocessing time below (m ) Approach: Approach: don't precompute automaton's transition function don t automaton s calculate enough transition data "onthefly" "ontheobtain data via "alphabetindependent" pattern "alphabetindependent alphabetalphabet preprocessing pattern preprocessing compares pattern against shifts of itself
29 KnuthMorrisPratt Algorithm
determine how pattern matches against itself 32.10 30 source: 91.503 textbook Cormen et al. KnuthMorrisPratt Algorithm 32.6 Equivalently, what is largest k < q such that Pk Pq? Prefix function shows how pattern matches against itself (q ) = max{k : k < q and Pk Pq } (q) is length of longest prefix of P that is a proper suffix of Pq Example:
31 source: 91.503 textbook Cormen et al. KnuthMorrisPratt Algorithm
Somewhat similar in structure to FINITEAUTOMATONMATCHER... FINITEAUTOMATONusing amortized analysis (see
next slide) (m)
# characters matched scan text left to right lefttoright (m+n)
next character does not match (n)
next character matches using amortized analysis analysis* Is all of P matched? Look for next match 32
*source: 91.503 textbook Cormen et al., 2nd edition uses potential function with = q. 3rd edition uses aggregate analysis. KnuthMorrisPratt Algorithm
Amortized Analysis Potential Method = k k represents current state of algorithm
Similar in structure to KMPMATCHER KMPMATCHER... Potential is never negative since (k) >= 0 for all k
initial potential value (m) time
potential decreases potential increases by <=1 in each execution of for loop body
source: 91.503 textbook Cormen et al., 2nd edition. 3rd amortized cost of loop body is in (1) (m) loop iterations 33 edition uses aggregate analysis to show while loop executes O(m) times overall. KnuthMorrisPratt Algorithm
Correctness... Correctness Iterated It t d prefix function: fi f ti 34 source: 91.503 textbook Cormen et al. KnuthMorrisPratt Algorithm
Correctness... Correctness 35 source: 91.503 textbook Cormen et al. StringMatch
Correctness of ComputePrefixFunction. This is ComputePrefixFunction. nontrivial... Lemma 32.5. (Prefixfunction iteration lemma) Let P be a 32.5. (Prefixpattern of length m with prefix function . Then for q = 1 2 Then, 1, 2, ..., m, we have *[q] = {k : k < q and Pk Pq}. (using > for suffix symbol) Proof. 1. *[q] {k : k < q and Pk Pq}. Let i = (u)[q] for some u > 0. We prove the inclusion by induction on u. For F u = 1, we h 1, have i = [q], and th claim follows since i < q d the l i f ll i and P[q] Pq. Assume the inclusion holds for i = (u)[q]. We need to prove u+1) it for i = (u+1)[q] = [(u)[q]] . i < (u)[q] and Pi P(u)[q]. )[q
4/25/2011 36 source: Textbook and Prof. Pecelli StringMatch
By induction assumption, P(u)[q] Pq . Transitivity of the )[q relation give that Pi Pq, as desired. 2. {k : k < q and Pk Pq} *[q]. By contradiction. Suppose, to the contrary, that there i an integer in {k : k < S h h h is i i q and Pk Pq}  *[q], and let j denote the largest such integer. Since [q] is the largest value in {k : k < q and Pk Pq}, and g [q] *[q], we must have j < [q]. Let j' denote the s.t. smallest integer in *[q] s.t. j' > j. j {k : k < q and Pk Pq} i li Pj Pq; d implies j' *[q] implies Pj' Pq . Lemma 32.1 (Overlapping Suffix) implies that Pj Pj', and j 32 1 4/25/2011 is the largest value less than j' with this property.
37 source: Textbook and Prof. Pecelli StringMatch
This, in turn, forces the conclusion that [j'] = j j'] and, since j' *[q], we must have j *[q]. Contradiction. We now continue with another lemma: it is clear that, since [1] = 0, line 2 of ComputePrefix0, ComputePrefixFunction provides the correct value. We need to extend this statement to all q > 1.
4/25/2011 38 source: Textbook and Prof. Pecelli StringMatch
Lemma 32.6. Let P = P[1...m], and let be 32.6. [1...m p , , , the prefix function for P. For q = 0, 1, ..., m, if [q] > 0, then [q]  1 [q1]. 0, 1]. Proof. Proof. If r = [q] > 0, then r < q and Pr Pq. 0, Thus r  1 < q  1 and Pr1 Pq1 (by dropping the last characters from Pr and Pq) Lemma ). 32.5 implies that [q]  1 = r  1 [q1]. 1].
4/25/2011 39 source: Textbook and Prof. Pecelli StringMatch We now introduce a new set: for q = 2 3, ..., m, 2, 3 define Eq1 [q1] by: Eq1 = {k [q1]: P[k+1] = P[q]} {k ] ] = {k : k < q1 and Pk Pq1 and P[k+1] = P[q]} {k qk+1] {k q= {k : k < q1 and Pk+1 Pq}. k+1 In other words, Eq1 consists of the values k < q  1 for which Pk Pq1 and for which Pk+1 Pq , because P[k+1] = P[q]. Eq1 consists of those values k [q1] for which we can extend Pk to Pk+1 and still k+1 get a proper suffix of Pq.
4/25/2011 40 source: Textbook and Prof. Pecelli StringMatch
Corollary 32.7. Let P be a pattern of length m, and let p be y 32.7. p g the prefix function for P. For q = 2, 3, ..., m,
if Eq 1 = , 0 [q ] = 1 + max{k Eq 1 } if Eq 1 . Proof. Case 1: Eq1 is empty. There is no k [q1] empty (including k = 0) for which we can extend Pk to Pk+1 and get a 0) k+1 proper suffix of Pq. Thus [q] = 0. 0. Case 2: Eq1 is not empty. 1. Prove [q] 1 + max{k q1}. For each k q1 we have k+1 < q and Pk+1 Pq. The definition of [q] gives the inequality.
4/25/2011 41 source: Textbook and Prof. Pecelli StringMatch
2. 2 Prove that [q] 1 + max{k q1}. Since Eq1 is max{k nonnonempty, [q] > 0. Let r = [q]  1, hence r + 1 = 0. [q]. Since r + 1 > 0, P[r + 1] = P[q]. By Lemma 0, 1]. 32.6 we also have r = [q]  1 [q  1]. Therefore r q1, which implies r max{k q1} and max{k and, immediately, the desired inequality. Combining b th i C bi i both inequalities, we have the result. liti h th lt Now glue all these results together to obtain a proof of correctness.
4/25/2011 42 source: Textbook and Prof. Pecelli ...
View
Full Document
 Spring '11
 Staff
 Algorithms, Analysis of algorithms, String searching algorithm, textbook Cormen, Cormen

Click to edit the document details