lecture-28 - Pracl Bioinforma%cs for Life ...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Prac%cal Bioinforma%cs for Life Scien%sts Week 14, Lecture 28 István Albert Bioinforma%cs Consul%ng Center Penn State Final project A group of researchers are interested in studying protein binding loca%ons in a yet unknown genome. They hypothesize that observed phenotypic varia%ons will correlate with sequence varia%on of the bound loca%ons in this genome. The researchers perform two experimental procedures. 1.  First they sequence the whole genome of the wild- type (normal, wt) phenotype with a paired end sequencing technology. 2.  Second, they perform a chroma%n- immuno- precipita%on experiment that isolates DNA fragments corresponding to bound proteins. This second experiment is performed for both the wild- type (wt) phenotype and the mutant phenotype (m1). Final project data (see course webpage) 1.  The paired end whole- genome sequencing is stored in datasets r1.fq and r2.fq 2.  ChIP- Seq data for the wild- type sample is stored in file p1.fq 3.  ChIP- Seq data for the mutant sample is stored in file p2.fq Ques%ons that need to be answered •  •  What is the es%mated size of your genome? How many binding loca%ons can you detect for each of the Chip- Seq datasets? Does the number of binding loca%ons vary? •  Can you observe any genomic varia%on between phenotypes in the bound loca%ons? •  Include a IGV screenshot of the loca%on of one binding site Tips: 1.  there are many ways to solve the project. 2.  there are fewer than 10 binding sites 3.  at the bare minimum all ques%ons above may be be solved via three tools: velvet + bwa + samtools Due date: next Thursday •  Project due by next Thursday (Dec 8th) •  Project related office hours on next Monday and Wednesday between 2 and 3pm •  Turn in all homework you might have missed (par%al credit will be given) Classifica%on of metagenomics data •  One of the first ques%ons any life scien%st is asking – what species is present in my data? •  Yet it may be the right ques%on – just pubng a label makes the data more precise than it might be •  Current state of meta- genomics does not lend itself to accurate species level characteriza%on –  too many unknown bacteria –  too many unknown systema%c effects –  too many improperly designed experiments We’ll use and compare two classifica%on approaches •  BLAST à༎ LCA (lowest common ancestor) à༎ visualize with Megan (Metagenome Analyzer) •  Probabilis%c (bayesian) classifica%on for 16S rRNA data via the RDP mul%classifier •  Download the data for lecture 28 from the course webpage Megan – Metagenome Analyzer Java based with a neat graphical user interface Find and download the 16S rRNA BLAST database Run blast to generate the alignments We have two samples s1.fa and s2.fa that correspond to two condi%ons You can limit the number of alignments and use more threads if your computer can handle that à༎ speeds up the process considerably Our blast files This will need to be recognized by Megan to connect it to a taxonomy (we need to use a gi to taxonomy mapper) Neat feature: your original sequence names may also contain the taxonomy as a list, in which case Megan will parse that out. > [0]Bacteria;[1]Bacteroidetes;[2]Bacteroidia;[3]Bacteroidales;something In that case no addi%onal taxonomical informa%on is needed. Result of the classifica%on You can also load up the s1.rma file directly Classify at RDP Note: this classifies the sequences directly! No alignment step needed! Results of the classifica%on RPD mul%classifier from the command line Find and download the from the classifier webpage. It can generate combine the output for each file Inves%ga%ng a bit more How do these methods work really: MEGAN à༎ lowest common ancestor (read the Megan manual for more details) à༎ but operates solely on sequence similarity - extract reads that map to a certain taxa and analyze the alignments RDP à༎ breaks each sequence into words, and computes the likelyhood that a word comes from a certain taxa (it only works for 16s RNA data!) ...
View Full Document

This note was uploaded on 02/29/2012 for the course BMMB 597D taught by Professor Istvanalbert during the Fall '11 term at Penn State.

Ask a homework question - tutors are online