FASTA_BLAST - An Introduction to Bioinformatics Biological...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon
An Introduction to Bioinformatics Biological Database Searching :FASTA, BLAST
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
An Introduction to Bioinformatics Dynamic programming algorithms give “correct” solutions but is very slow unless the sequences are quite short. Current common protein sequence data bases contain more than 100 M residues, For a “query” sequence of 1000 residues, we need to evaluate 10 11 matrix cells. Even if you compute 10 M cells /second, it will take 10 4 secs =~ 3 hour just for one query. Goal: search small fraction of the possible high scoring alignments. The vast literature on exact and approximate match algorithms can be used. But with scoring matrices, distant matches are hard to find. Need heuristic algorithms: FASTA and BLAST are two such classes of algorithms, BLAST is more popular but we still use the FASTA data format. Database Searching
Background image of page 2
An Introduction to Bioinformatics Database searching Core: pair-wise alignment algorithm Speed (fast sequence comparison) Relevance of the search results (statistical tests) Recovering all information of interest The results depend of the search parameters like gap penalty, scoring matrix. Sometimes searches with more than one matrix should be preformed
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
An Introduction to Bioinformatics “Filtration”: find short sequences in the database that matches with segments of query sequence. Use these short stretches of matches (using a hash table or a suffix tree ) as “seeds” from which extend out to get good longer alignments. Approximate Pattern Matching Problem: Given a pattern P=p 1 p 2 . . .p m and a text T=t 1 t 2 . . .t n , m<=n , find all positions in the text such that P occurs in T beginning these positions having at most k mismatches. The Basic Idea
Background image of page 4
An Introduction to Bioinformatics Approximate Query Matching Find all substrings of a query that approximately matches a text string. Given 1. two integers m and k 2. a query of length q, Q=x 1 x 2 …x q 3. a text string of length n , T=t 1 t 2 …t n Find all pairs of positions ( i,j ), 1<= i <=q-m+1 and 1<= j<=n-m+1 such that the m -letter substring of Q starting at position i approximately matches the m -letter substring of T starting at position j with at most k mismatches. Q T m q t i j
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
An Introduction to Bioinformatics Filtration Algorithm Stage 1: Preselect a set of positions in the text that are potentially similar to the query. Stage 2: Verify each potential position and reject it if the number of mismatches exceeds k. The underlying idea is that if an m -letter substring of Q approximately matches a m -letter substring of T, then the two substrings share at least one l -mer for appropriately large value of l given in the following theorem. All l -mers shared shared by the query and the text can be found by hashing. If the number of shared l -mers is relatively small, then all matches with a maximum of k mismatches can be found rapidly. [In the text, the symbol n has been used for m . Make a note of this.]
Background image of page 6
An Introduction to Bioinformatics
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
An Introduction to Bioinformatics FASTA
Background image of page 8
Image of page 9
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 06/12/2011 for the course CAP 5510 taught by Professor Staff during the Spring '08 term at University of Central Florida.

Page1 / 43

FASTA_BLAST - An Introduction to Bioinformatics Biological...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online