This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 6.851: Advanced Data Structures Spring 2010 Lecture 7 – February 26, 2010 Prof. Andre Schulz Scribe: Mark Chen 1 Overview In this lecture, we consider the string matching problem  finding all places in a text where some query string occurs. From the perspective of a oneshot approach, we can solve string matching in O (  T  ) time, where  T  is the size of our text. This purely algorithmic approach has been studied extensively in the papers by KnuthMorrisPratt [6], BoyerMoore [1], and RabinKarp [4]. However, we entertain the possibility that multiple queries will be made to the same text. This motivates the development of data structures that preprocess the text to allow for more efficient queries. We will show how to construct, use, and analyze these string data structures. 2 Storing Strings and String Matching First, we introduce some notation. Throughout these notes, Σ will denote a finite alphabet. An example of a finite alphabet is the standard set of English letters Σ = { a,b,c,...,z } . A fixed string of characters T ∈ Σ * will comprise what we call a text . Another string of characters P ∈ Σ * will be called a search pattern . For integers i and j , define T [ i : j ] as the substring of T starting from the i th character and ending with the j th character inclusive. We will often omit j and write T [ i :] to denote the suffix of T starting at the i th character. Finally, we let the symbol ◦ denote concatenation. As a simple illustration of our notation, ( abcde [0 : 2]) ◦ ( cde [1 :]) = abcde . Now we can formally state the string matching problem: Given an input text T ∈ Σ * and a pattern P ∈ Σ * , we want to find all occurrences of P in T . Closely related variants of the string matching problem ask for the first, first k , or some occurrences, rather than for all occurrences. 2.1 Tries and Compressed Tries A commonly used string data structure is called a trie , a tree where each edge stores a letter, each node stores a string, and the root stores the empty string. The recursive relationship between the values stored on the edges and the values stored in the nodes is as follows: Given a path of increasing depth p = r,v 1 ,v 2 ,...,v from the root r to a node v , the string stored at node v i is the concatenation of the string stored in v i 1 with the letter stored on v i 1 v i . We will denote the strings stored in the leaves of the trie as words, and the strings stored in all other nodes as prefixes. If there is a natural lexicographical ordering on the elements in Σ, we order the edges of every node’s fanout alphabetically, from left to right. With respect to this ordering, in order traversal 1 of the leaves gives us every word stored in the trie in alphabetical order. In particular, it is easy to see that the fanout of any node must be bounded above by the size of the alphabet  Σ  ....
View
Full
Document
This note was uploaded on 03/31/2011 for the course EECS 6.851 taught by Professor Erikdemaine during the Spring '10 term at MIT.
 Spring '10
 ErikDemaine
 Data Structures

Click to edit the document details