10g-fingerprint - Algorithms Non-Lecture G: String Matching...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
Algorithms Non-Lecture G: String Matching Why are our days numbered and not, say, lettered? — Woody Allen G String Matching G.1 Brute Force The basic object that we’re going to talk about for the next two lectures is a string , which is really just an array. The elements of the array come from a set Σ called the alphabet ; the elements themselves are called characters . Common examples are ASCII text, where each character is an seven-bit integer 1 , strands of DNA, where the alphabet is the set of nucleotides { A , C , G , T } , or proteins, where the alphabet is the set of 22 amino acids. The problem we want to solve is the following. Given two strings, a text T [ 1.. n ] and a pattern P [ 1.. m ] , find the first substring of the text that is the same as the pattern. (It would be easy to extend our algorithms to find all matching substrings, but we will resist.) A substring is just a contiguous subarray. For any shift s , let T s denote the substring T [ s .. s + m - 1 ] . So more formally, we want to find the smallest shift s such that T s = P , or report that there is no match. For example, if the text is the string ‘ AMANAPLANACATACANALPANAMA 2 and the pattern is ‘ CAN ’, then the output should be 15. If the pattern is ‘ SPAM ’, then the answer should be ‘none’. In most cases the pattern is much smaller than the text; to make this concrete, I’ll assume that m < n / 2. Here’s the ‘obvious’ brute force algorithm, but with one immediate improvement. The inner while loop compares the substring T s with P . If the two strings are not equal, this loop stops at the first character mismatch. A LMOST B RUTE F ORCE ( T [ 1.. n ] , P [ 1.. m ]) : for s 1 to n - m + 1 equal true i 1 while equal and i m if T [ s + i - 1 ] 6 = P [ i ] equal false else i i + 1 if equal return s return ‘none’ 1 Yes, seven . Most computer systems use some sort of 8-bit character set, but there’s no universally accepted standard. Java supposedly uses the Unicode character set, which has variable-length characters and therefore doesn’t really fit into our framework. Just think, someday you’ll be able to write ‘¶ = [ ++]/ f ; ’ in your Java code! Joy! 2 Dan Hoey (or rather, his computer program) found the following 540-word palindrome in 1984: A man, a plan, a caret, a ban, a myriad, a sum, a lac, a liar, a hoop, a pint, a catalpa, a gas, an oil, a bird, a yell, a vat, a caw, a pax, a wag, a tax, a nay, a ram, a cap, a yam, a gay, a tsar, a wall, a car, a luger, a ward, a bin, a woman, a vassal, a wolf, a tuna, a nit, a pall, a fret, a watt, a bay, a daub, a tan, a cab, a datum, a gall, a hat, a fag, a zap, a say, a jaw, a lay, a wet, a gallop, a tug, a trot, a trap, a tram, a torr, a caper, a top, a tonk, a toll, a ball, a fair, a sax, a minim, a tenor, a bass, a passer, a capital, a rut, an amen, a ted, a cabal, a tang, a sun, an ass, a maw, a sag, a jam, a dam, a sub, a salt, an axon, a sail, an ad, a wadi, a radian, a room, a rood, a rip, a tad, a pariah, a revel, a reel, a reed, a pool, a plug, a pin, a peek, a parabola, a dog, a pat, a cud, a nu, a fan, a pal, a rum, a nod, an eta, a lag, an eel, a
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 12/15/2009 for the course 942 cs taught by Professor A during the Spring '09 term at University of Illinois at Urbana–Champaign.

Page1 / 5

10g-fingerprint - Algorithms Non-Lecture G: String Matching...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online