This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: 1 Suffix Trees and Suffix Arrays Srinivas Aluru Iowa State University 1.1 Basic Definitions and Properties .................... 1-1 1.2 Linear Time Construction Algorithms ............. 1-4 Suffix Trees vs. Suffix Arrays Linear Time Construction of Suffix Trees Linear Time Construction of Suffix Arrays Space Issues 1.3 Applications ............................................ 1-11 Pattern Matching Longest Common Substrings Text Compression String Containment Suffix-Prefix Overlaps 1.4 Lowest Common Ancestors .......................... 1-17 1.5 Advanced Applications ............................... 1-18 Suffix Links from Lowest Common Ancestors Approximate Pattern Matching Maximal Palindromes 1.1 Basic Definitions and Properties Suffix trees and suffix arrays are versatile data structures fundamental to string processing applications. Let s denote a string over the alphabet . Let $ / be a unique termination character, and s = s $ be the string resulting from appending $ to s . We use the following notation: | s | denotes the size of s , s [ i ] denotes the i th character of s , and s [ i..j ] denotes the substring s [ i ] s [ i + 1] ...s [ j ]. Let suff i = s [ i ] s [ i + 1] ...s [ | s | ] be the suffix of s starting at i th position. The suffix tree of s , denoted ST ( s ) or simply ST , is a compacted trie of all suffixes of string s . Let | s | = n . It has the following properties: 1. The tree has n leaves, labelled 1 ...n , one corresponding to each suffix of s . 2. Each internal node has at least 2 children. 3. Each edge in the tree is labelled with a substring of s . 4. The concatenation of edge labels from the root to the leaf labelled i is suff i . 5. The labels of the edges connecting a node with its children start with different characters. The paths from root to the suffixes labelled i and j coincide up to their longest common prefix, at which point they bifurcate. If a suffix of the string is a prefix of another longer suffix, the shorter suffix must end in an internal node instead of a leaf, as desired. It is to avoid this possibility that the unique termination character is added to the end of the string. Keeping this in mind, we use the notation ST ( s ) to denote the suffix tree of the string obtained by appending $ to s . 0-8493-8597-0/01/ $ 0.00+ $ 1.50 c 2001 by CRC Press, LLC 1-1 1-2 1 3 4 6 7 8 9 10 11 r v w y z $ $ i p s i i s i $ p i $ 5 x 2 p p i $ p p i $ i s s i p p i $ m i s s i s s i i $ p p p p i $ s s i p p i $ p p i $ s s i p p i $ s s i 12 u 12 11 5 8 2 1 10 9 7 4 6 3 1 4 1 1 2 1 3 SA Lcp FIGURE 1.1: Suffix tree, suffix array and Lcp array of the string mississippi . The suffix links in the tree are given by x z y u r , v r , and w r ....
View Full Document
This note was uploaded on 08/08/2011 for the course CAP 6938 taught by Professor Staff during the Spring '08 term at University of Central Florida.
- Spring '08