Lec07-suffixtrees

Lec07-suffixtrees - SufFx Trees CMSC 423 Preprocessing...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: SufFx Trees CMSC 423 Preprocessing Strings Over the next few lectures, well see several methods for preprocessing string data into data structures that make many questions (like searching) easy to answer: SufFx Tries SufFx Trees SufFx Arrays Borrows-Wheeler transform Typical setting: A long, known, and Fxed text string (like a genome) and many unknown, changing query strings. Allowed to preprocess the text string once in anticipation of the future unknown queries. Data structures will be useful in other settings as well. SufFx Tries A trie , pronounced try, is a tree that exploits some structure in the keys- e.g. if the keys are strings, a binary search tree would compare the entire strings, but a trie would look at their individual characters- SufFx trie are a space-efFcient data structure to store a string that allows many kinds of queries to be answered quickly.- Hugely important for searching large sequences like genomes. The basis for a tool called MUMMer (developed by UMD faculty). SufFx Tries s = abaaba$ a b $ $ a $ b a $ a b a $ a b a $ a b a $ SufTrie(s) = sufFx trie representing string s. Edges of the sufFx trie are labeled with letters from the alphabet (say {A,C,G,T}). Every path from the root to a solid node represents a sufFx of s. Every sufFx of s is represented by some path from the root to a solid node. Why are all the solid nodes leaves? How many leaves will there be? How many nodes can the trie have? Processing Strings Using SufFx Tries determine whether q is a substring of T? check whether q is a sufFx of T? count how many times q appears in T? Fnd the longest repeat in T? Fnd the longest common substring of T and q? Given a sufFx trie T, and a string q, how can we: Main idea: every substring of s is a preFx of some sufFx of s. Searching SufFx Tries s = abaaba$ a b $ $ a $ b a $ a b a $ a b a $ a b a $ Is baa a substring of s? ollow the path given by the query string. After weve built the sufFx trees, queries can be answered in time: O(|query|) regardless of the text size. Searching SufFx Tries s = abaaba$ a b $ $ a $ b a $ a b a $ a b a $ a b a $ Is baa a substring of s? ollow the path given by the query string. After weve built the sufFx trees, queries can be answered in time: O(|query|) regardless of the text size. Applications of SufFx Tries (1) Check whether q is a sufFx of T: Check whether q is a substring of T: Count # of occurrences of q in T: ind the longest repeat in T: ind the lexicographically (alphabetically) Frst sufFx: Applications of SufFx Tries (1) Check whether q is a sufFx of T: Check whether q is a substring of T: ollow the path for q starting from the root....
View Full Document

Page1 / 49

Lec07-suffixtrees - SufFx Trees CMSC 423 Preprocessing...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online