This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: (To appear in ALGORITHMICA) On–line construction of suffix trees 1 Esko Ukkonen Department of Computer Science, University of Helsinki, P. O. Box 26 (Teollisuuskatu 23), FIN–00014 University of Helsinki, Finland Tel.: +35807084172, fax: +35807084441 Email: [email protected] Abstract. An on–line algorithm is presented for constructing the suffix tree for a given string in time linear in the length of the string. The new algorithm has the desirable property of processing the string symbol by symbol from left to right. It has always the suffix tree for the scanned part of the string ready. The method is developed as a linear–time version of a very simple algorithm for (quadratic size) suffix tries . Regardless of its quadratic worstcase this latter algorithm can be a good practical method when the string is not too long. Another variation of this method is shown to give in a natural way the well–known algorithms for constructing suffix automata (DAWGs). Key Words. Linear time algorithm, suffix tree, suffix trie, suffix automa ton, DAWG. 1 Research supported by the Academy of Finland and by the Alexander von Humboldt Foundation (Germany). 1 1. INTRODUCTION A suffix tree is a trie–like data structure representing all suffixes of a string. Such trees have a central role in many algorithms on strings, see e.g. [3, 7, 2]. It is quite commonly felt, however, that the linear–time suffix tree algorithms presented in the literature are rather difficult to grasp. The main purpose of this paper is to be an attempt in developing an understandable suffix tree construction based on a natural idea that seems to complete our picture of suffix trees in an essential way. The new algorithm has the important property of being on–line. It processes the string symbol by symbol from left to right, and has always the suffix tree for the scanned part of the string ready. The algorithm is based on the simple observation that the suffixes of a string T i = t 1 ··· t i can be obtained from the suffixes of string T i 1 = t 1 ··· t i 1 by catenating symbol t i at the end of each suffix of T i 1 and by adding the empty suffix. The suffixes of the whole string T = T n = t 1 t 2 ··· t n can be obtained by first expanding the suffixes of T into the suffixes of T 1 and so on, until the suffixes of T are obtained from the suffixes of T n 1 . This is in contrast with the method by Weiner [13] that proceeds right– to–left and adds the suffixes to the tree in increasing order of their length, starting from the shortest suffix, and with the method by McCreight [9] that adds the suffixes to the tree in the decreasing order of their length. It should be noted, however, that despite of the clear difference in the intuitive view on the problem, our algorithm and McCreight’s algorithm are in their final form functionally rather closely related....
View
Full
Document
This note was uploaded on 01/13/2012 for the course CMSC 423 taught by Professor Staff during the Fall '07 term at Maryland.
 Fall '07
 staff

Click to edit the document details