Unformatted text preview: Prac%cal Bioinforma%cs for Life Scien%sts Week 15, Lecture 29 István Albert Bioinforma%cs Consul%ng Center Penn State Metagenomics: data analysis •  It is a new applica%on domain •  This has benefits but also frustra%on •  Historically: grandiose statements for nascent technologies are usually exaggerated Two big packages •  QIIME – Quan%ta%ve Insights Into Microbial Ecology (Nature Methods, 2010) –  it is a Python based “glue” that connects a series of external tools –  Has a very long list of programs that need to be installed: blast, rdp, CD- hit, uclust, fasXree etc •  mothur (Appl. Environ Microbiology, 2009 ) – by Patrick Schloss – a single binary file that contains all the func2onality Lot of authors, published in Nature Methods, 2010 Has a Google Group forum that seem very ac%ve By Dr. Patrick Schloss at University of Michigan Workshops on mothur twice a year in Detroit – highly recommended mothur is a unique and amazing so^ware •  There is no other bioinforma%cs so^ware like it à༎ a design reminiscent of matlab or mathema2ca WriXen by a %ny team, yet implements very advance methods and techniques! •  You can finish the en%re analysis/paper by using only mothur (plot with R) •  Excellent documenta%on (though some aspects refer to previous mothur releases so may be out of sync) Mothur wiki OTU based approaches •  Characterize similarity and frequency distribu%on of the sequences to evaluate ecological features: –  Richness –  Diversity –  Similarity Depends on our ability to compare sequences! NAST – Nearest alignment space termina%on algorithm •  DeStan%s et al, Nucleic Acid Research (2006) Creates MSA (mul%ple sequence alignments) 1.  The template sequences are pre- aligned rela%ve to one another. 2.  Then query (candidate) sequences will also be aligned to the MSA and therefore comparable rela%ve to one another. Example of NAST compression of a BLAST pairwise alignment using a 38 character aligned template. DeSantis T Z et al. Nucl. Acids Res. 2006;34:W394-W399 © The Author 2006. Published by Oxford University Press. All rights reserved The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial Distances in a MSA •  We can compute the number of mistmatches +gaps between any two sequences •  Once we have a distance we can cluster with this metric •  Sequences that cluster at a certain level form a opera%onal taxonomical unit Metagenomics “calculators ” Analogous to the game “mastermind” One tries to deduce the truth with simple statements/ guesses that reduce the possibili%es a^er each step. Example: •  Are the abundances the same •  Yes à༎ no dierence between samples •  No à༎ is the dominance of the most common species the same? 1.  Yes à༎ so the differences all come from the less common species 2.  No à༎ is the dominance different because there are fewer species, more species, or same but with different abundances Calculators in mothur Simpson’s paradox Most famous example: berkley gender bias lawsuit Total admission rates seem to sta%s%cally favor men Admission rates per largest departments seem to sta%s%cally favor women. Difficult to recognize this class of problems – yet we face them all the %me In meta- genomics where we have to constantly compare across groups, memberships and abundances. Proper experimental design •  Metagenomics studies cri%cally depend on controls and experimental designs – moreso than any other study •  The parameter space is huge: –  contaminants, chimeric reads, sequencing errors will all look like a new bacterial species –  the data is sparse and noisy – lots of dead ends •  Search for: SOP on the mothur webpage (SOP – standard opera%ng procedure ) 1.  Advocates dedica%ng two barcodes to two different controls a)  b)  mock (prebuilt community) and known realis%c sample that you always re- sequence for every experiment mothur in ac%on Get the data from the course webpage ...
