This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: SufFx Trees CMSC 423 Preprocessing Strings • Over the next few lectures, we’ll see several methods for preprocessing string data into data structures that make many questions (like searching) easy to answer: • SufFx Tries • SufFx Trees • SufFx Arrays • BorrowsWheeler transform • Typical setting: A long, known, and Fxed text string (like a genome) and many unknown, changing query strings. • Allowed to preprocess the text string once in anticipation of the future unknown queries. • Data structures will be useful in other settings as well. SufFx Tries • A trie , pronounced “try”, is a tree that exploits some structure in the keys e.g. if the keys are strings, a binary search tree would compare the entire strings, but a trie would look at their individual characters SufFx trie are a spaceefFcient data structure to store a string that allows many kinds of queries to be answered quickly. Hugely important for searching large sequences like genomes. The basis for a tool called “MUMMer” (developed by UMD faculty). SufFx Tries s = abaaba$ a b $ $ a $ b a $ a b a $ a b a $ a b a $ SufTrie(s) = sufFx trie representing string s. Edges of the sufFx trie are labeled with letters from the alphabet ∑ (say {A,C,G,T}). Every path from the root to a solid node represents a sufFx of s. Every sufFx of s is represented by some path from the root to a solid node. Why are all the solid nodes leaves? How many leaves will there be? How many nodes can the trie have? Processing Strings Using SufFx Tries • determine whether q is a substring of T? • check whether q is a sufFx of T? • count how many times q appears in T? • Fnd the longest repeat in T? • Fnd the longest common substring of T and q? Given a sufFx trie T, and a string q, how can we: Main idea: every substring of s is a preFx of some sufFx of s. Searching SufFx Tries s = abaaba$ a b $ $ a $ b a $ a b a $ a b a $ a b a $ Is “baa” a substring of s? ¡ollow the path given by the query string. After we’ve built the sufFx trees, queries can be answered in time: O(query) regardless of the text size. Searching SufFx Tries s = abaaba$ a b $ $ a $ b a $ a b a $ a b a $ a b a $ Is “baa” a substring of s? ¡ollow the path given by the query string. After we’ve built the sufFx trees, queries can be answered in time: O(query) regardless of the text size. Applications of SufFx Tries (1) Check whether q is a sufFx of T: Check whether q is a substring of T: Count # of occurrences of q in T: ¡ind the longest repeat in T: ¡ind the lexicographically (alphabetically) Frst sufFx: Applications of SufFx Tries (1) Check whether q is a sufFx of T: Check whether q is a substring of T: ¡ollow the path for q starting from the root....
View
Full Document
 Fall '07
 staff
 Algorithms on strings, Suffix tree

Click to edit the document details