ieeeBioInfo2002dnaBWT - DNA Sequence Compression using the...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
1 DNA Sequence Compression using the Burrows-Wheeler Transform Don Adjeroh, Yong Zhang Computer Science and Electrical Engineering West Virginia University, Morgantown, USA Amar Mukherjee Computer Science University of Central Florida, Orlando USA Matt Powell and Tim Bell Computer Science University of Canterbury, Christchurch New Zealand August 16, 2002
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 Outline Introduction and The Problem Background Overview of Approach BWT and Repeat Analysis Parsing Strategies Results Conclusion
Background image of page 2
3 Introduction DNA as an information storage medium Draft sequence of the human genome now available Complete genomes available for many other organisms Implications and Possibilities Genome-wide analysis of entire genomes Cross-genome analysis with complete genomes  Drug discovery and medical science  Potential  cure  for  gene-related  diseases,  such  as  sickle-cell  anemia,  Huntington’s  disease,  Fragile-X  mental  retardation  syndrome, cancer …
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
4 The problem Size !
Background image of page 4
5 The problem of size. .. Some important genomes are in  the order of billions of base pairs Exponential growth in the number  of complete genomes Exponential growth in the size of  available DNA sequence data We need . .. Efficient and effective algorithms for sequence analysis and  interpretation  Efficient techniques for management, organization, and  distribution of huge-volume sequence data Source: Genbank website
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
6 Nature of DNA Sequences Four types of nucleotide bases - adenine,   C  - cytosine,  - guanine,  T  - thymine Various forms of repetitions  Direct and tandem repeats  Reverse repeats, complimented repeats, palindromes Combinations  Approximate repeats Introns and Exons Coding areas - generally less repetitive  Non-coding areas - generally more redundant  Non-coding areas make up > 50% of genomes DNA - only part of the whole story P1 P2 P3 P4 P5 P6 P7 AACTGTCAA 2 AA x 3 GTCAA 4 5 AACTG x 6 TTGACAGTT x 7 AA
Background image of page 6
7 DNA Sequence Compression General Data Compression Symbol-wise substitution  (Huffman  codes) Dictionary-based  (LZ family) Context-based (BWT and PPM) DNA Sequence Compression 4 symbols ( A,C,G,T )    at most  2 bpc   on average Generally dictionary-based  Exploit the different forms of repeat  Key Issues Speedy identification of the repeats 
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 06/12/2011 for the course CAP 5510 taught by Professor Staff during the Spring '08 term at University of Central Florida.

Page1 / 23

ieeeBioInfo2002dnaBWT - DNA Sequence Compression using the...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online