repeatscout-ismb - De novo identification of repeat...

Info iconThis preview shows pages 1–15. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: De novo identification of repeat families in large genomes Alkes L. Price, Neil C. Jones and Pavel A. Pevzner June 28, 2005 What is a repeat family? A repeat family is a collection of similar sequences which appear many times in a genome. For example, the Alu repeat family has over 1 million approximate occurrences in the human genome: Alu Alu Alu Alu Alu Identifying repeat families: problem formulation Alu Alu Alu Alu Alu INPUT: Genome containing approximate Alu occurrences OUTPUT: 282bp Alu consensus sequence GGCCGGGCGCGGTGGCTCACG..GCGAGACTCCGTCTC + consensus sequences of all other repeat families in genome Identifying repeat families: an easy problem? Alu Alu Alu Alu Alu Identifying repeat families: an easy problem? Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Identifying repeat families: an easy problem? Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu consensus Identifying repeat families: an easy problem? Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu consensus Difficulties: Identifying repeat families: an easy problem? Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu consensus Difficulties: Regions containing repeat occurrences are not known a priori Identifying repeat families: an easy problem? Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu consensus Difficulties: Regions containing repeat occurrences are not known a priori Repeat boundaries are not known a priori Identifying repeat families: an easy problem? Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu Alu consensus Difficulties: Regions containing repeat occurrences are not known a priori Repeat boundaries are not known a priori Many repeat occurrences appear as partial copies Identifying repeat families: a difficult problem The problem of automated repeat sequence family classification is inherently messy and ill-defined and does not appear to be amenable to a clean algorithmic attack. Bao and Eddy, 2002 In this talk, we present a simple and efficient algorithm for solving this problem. Why is identifying repeat families important? Genome rearrangements (Kazazian, 2004) Drift to new biological function (Kidwell and Lisch, 2001) Increased rate of evolution under stress (Capy et al, 2000) 1. Repeats are biologically meaningful Repeats are drivers of genome evolution (Kazazian, 2004) which can play a beneficial (rather than parasitic) role (Holmes, 2002). In particular, repeats have been implicated in Why is identifying repeat families important? Repeats need to be masked prior to performing most single-species or multi-species analyses. Every time we compare two species that are closer to each other than either is to humans, we get nearly killed by unmasked repeats. Webb Miller (personal communication) 2. Repeat masking Why is identifying repeat families important?...
View Full Document

This note was uploaded on 04/06/2010 for the course COMPUTER S COMP5647 taught by Professor Dr.ping during the Spring '10 term at York University.

Page1 / 60

repeatscout-ismb - De novo identification of repeat...

This preview shows document pages 1 - 15. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online