Notes about demo how you specify the matepair

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ­pair informa2on, the de Bruijn graph is highly tangled. There are the following op2ons for detangling the de Bruijn graph: PROBLEM! 1.  Error correc2on of reads Both inevitably end ­up causing 2.  Bulge and whirl removal errors rather than correcHng then. 33 Assembly Demonstra2on Notes About Demo •  We used Velvet because it’s the simplest to use. •  Write a shell script to run the assembler to keep track of the parameters you used and to avoid wri2ng out the command each 2me. •  Almost all assemblers require you to specify the following: –  value of k, whether the data is paired end, insert length, minimum con2g length –  And some2mes, whether it is single ­cell data. Notes About Demo •  How you specify the mate ­pair informa2on varies from assembler to assembler. You have to read the manual and write a (perl) script to specify the data in the correct format! Notes About Demo •  Assemblers can be challenging programs to run. All of them have intricacies even in the installa2on of the program. •  Therefore, running an assembler requires: 1.  Some knowledge about Unix/Linux commands. 2.  Access to a server with large amounts of memory (64G for small bacteria genomes, 512G for larger genomes). Notes About Demo •  Be aware that your assembler may not always produce decent results. Can you tell if you did? Yes. Assembly Evalua2on What has been sequence? •  We’ve sequenced a number of genomes but several genomes remain difficult •  Plant genomes are very hard because they are extremely long, contain huge repeat regions, and are polyploid •  Note: we do not dis2nguish between genotypes… that is a separate problem Organism Type Genome Size No. of predicted genes Homo Sapiens Takifugu rubripes Human Puffer fish 3.2Gb 390Mb 20,251 22 ­29,000 Oryza sa2va Anopheles gambiae Rice Mosquito 420Mb 278Mb 32 ­50,000 13,700 Saccharomyces Baker’s cerevisiae yeast 12.1Mb 6,200 Cucumis sa2vus cucumber 367Mb 27,000 •  kb (= kbp) = kilo base pairs = 1,000 bp •  Mb = mega base pairs = 1,000,000 bp •  Gb = giga base pairs = 1,000,000,000 bp. Assembly Evalua2on •  How can we tell the difference between a good assembly and a bad assembly? –  Answer: N50 staHsHc, which is a metric of the length of a set of sequences, with greater weight given to longer sequences. –  Given a set of sequences of varying lengths, the N50 length is defined as the length N for which half of all bases in the sequences are in a sequence of length L < N. –  There are some contradictory in the defini2on(s) of the N50 value. Calcula2ng N50 Alterna2ve defini2on: the largest en2ty E such that at least half of the total size of the en22es is contained in en22es larger than E. 1.  Read Fasta file and calculate sequence length. 2.  Sort length on reverse order. 3.  Calculate Total size. 4.  Calculate N50. Other Evalua2ons •  Number of inser2ons, dele2ons, and subs2tu2on errors in an assembly •  misassembly of con2gs (chimeric indels) >=500 bp 44 Other Evalua2ons •  Number of inser2ons, dele2ons, and subs2tu2on errors in an assembly •  misassembly of con2gs (chimeric indels) >=500 bp 45 Next Lecture Detangling the de Bruijn Graph Even using mate ­pair informa2on, the de Bruijn graph is highly tangled. There are the following op2ons for detangling the de Bruijn graph: 1.  Error correc2on of reads 2.  Bulge and whirl removal...
View Full Document

This note was uploaded on 02/10/2014 for the course CS 680 taught by Professor Staff during the Fall '08 term at Colorado State.

Ask a homework question - tutors are online