10g-fingerprint

# 10g-fingerprint - Algorithms Non-Lecture G: String Matching...

This preview shows pages 1–2. Sign up to view the full content.

Algorithms Non-Lecture G: String Matching Why are our days numbered and not, say, lettered? — Woody Allen G String Matching G.1 Brute Force The basic object that we’re going to talk about for the next two lectures is a string , which is really just an array. The elements of the array come from a set Σ called the alphabet ; the elements themselves are called characters . Common examples are ASCII text, where each character is an seven-bit integer 1 , strands of DNA, where the alphabet is the set of nucleotides { A , C , G , T } , or proteins, where the alphabet is the set of 22 amino acids. The problem we want to solve is the following. Given two strings, a text T [ 1.. n ] and a pattern P [ 1.. m ] , ﬁnd the ﬁrst substring of the text that is the same as the pattern. (It would be easy to extend our algorithms to ﬁnd all matching substrings, but we will resist.) A substring is just a contiguous subarray. For any shift s , let T s denote the substring T [ s .. s + m - 1 ] . So more formally, we want to ﬁnd the smallest shift s such that T s = P , or report that there is no match. For example, if the text is the string ‘ AMANAPLANACATACANALPANAMA 2 and the pattern is ‘ CAN ’, then the output should be 15. If the pattern is ‘ SPAM ’, then the answer should be ‘none’. In most cases the pattern is much smaller than the text; to make this concrete, I’ll assume that m < n / 2. Here’s the ‘obvious’ brute force algorithm, but with one immediate improvement. The inner while loop compares the substring T s with P . If the two strings are not equal, this loop stops at the ﬁrst character mismatch. A LMOST B RUTE F ORCE ( T [ 1.. n ] , P [ 1.. m ]) : for s 1 to n - m + 1 equal true i 1 while equal and i m if T [ s + i - 1 ] 6 = P [ i ] equal false else i i + 1 if equal return s return ‘none’ 1 Yes, seven . Most computer systems use some sort of 8-bit character set, but there’s no universally accepted standard. Java supposedly uses the Unicode character set, which has variable-length characters and therefore doesn’t really ﬁt into our framework. Just think, someday you’ll be able to write ‘¶ = [ ++]/ f ; ’ in your Java code! Joy! 2 Dan Hoey (or rather, his computer program) found the following 540-word palindrome in 1984: A man, a plan, a caret, a ban, a myriad, a sum, a lac, a liar, a hoop, a pint, a catalpa, a gas, an oil, a bird, a yell, a vat, a caw, a pax, a wag, a tax, a nay, a ram, a cap, a yam, a gay, a tsar, a wall, a car, a luger, a ward, a bin, a woman, a vassal, a wolf, a tuna, a nit, a pall, a fret, a watt, a bay, a daub, a tan, a cab, a datum, a gall, a hat, a fag, a zap, a say, a jaw, a lay, a wet, a gallop, a tug, a trot, a trap, a tram, a torr, a caper, a top, a tonk, a toll, a ball, a fair, a sax, a minim, a tenor, a bass, a passer, a capital, a rut, an amen, a ted, a cabal, a tang, a sun, an ass, a maw, a sag, a jam, a dam, a sub, a salt, an axon, a sail, an ad, a wadi, a radian, a room, a rood, a rip, a tad, a pariah, a revel, a reel, a reed, a pool, a plug, a pin, a peek, a parabola, a dog, a pat, a cud, a nu, a fan, a pal, a rum, a nod, an eta, a lag, an eel, a

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

## This note was uploaded on 12/15/2009 for the course 942 cs taught by Professor A during the Spring '09 term at University of Illinois at Urbana–Champaign.

### Page1 / 5

10g-fingerprint - Algorithms Non-Lecture G: String Matching...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online