lecture-28 - Pracl Bioinforma%cs for Life ...

Info icon This preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Prac%cal Bioinforma%cs for Life Scien%sts Week 14, Lecture 28 István Albert Bioinforma%cs Consul%ng Center Penn State Final project A group of researchers are interested in studying protein binding loca%ons in a yet unknown genome. They hypothesize that observed phenotypic varia%ons will correlate with sequence varia%on of the bound loca%ons in this genome. The researchers perform two experimental procedures. 1.  First they sequence the whole genome of the wild- type (normal, wt) phenotype with a paired end sequencing technology. 2.  Second, they perform a chroma%n- immuno- precipita%on experiment that isolates DNA fragments corresponding to bound proteins. This second experiment is performed for both the wild- type (wt) phenotype and the mutant phenotype (m1). Final project data (see course webpage) 1.  The paired end whole- genome sequencing is stored in datasets r1.fq and r2.fq 2.  ChIP- Seq data for the wild- type sample is stored in file p1.fq 3.  ChIP- Seq data for the mutant sample is stored in file p2.fq Ques%ons that need to be answered •  •  What is the es%mated size of your genome? How many binding loca%ons can you detect for each of the Chip- Seq datasets? Does the number of binding loca%ons vary? •  Can you observe any genomic varia%on between phenotypes in the bound loca%ons? •  Include a IGV screenshot of the loca%on of one binding site Tips: 1.  there are many ways to solve the project. 2.  there are fewer than 10 binding sites 3.  at the bare minimum all ques%ons above may be be solved via three tools: velvet + bwa + samtools Due date: next Thursday •  Project due by next Thursday (Dec 8th) •  Project related office hours on next Monday and Wednesday between 2 and 3pm •  Turn in all homework you might have missed (par%al credit will be given) Classifica%on of metagenomics data •  One of the first ques%ons any life scien%st is asking – what species is present in my data? •  Yet it may be the right ques%on – just pubng a label makes the data more precise than it might be •  Current state of meta- genomics does not lend itself to accurate species level characteriza%on –  too many unknown bacteria –  too many unknown systema%c effects –  too many improperly designed experiments We’ll use and compare two classifica%on approaches •  BLAST à༎ LCA (lowest common ancestor) à༎ visualize with Megan (Metagenome Analyzer) •  Probabilis%c (bayesian) classifica%on for 16S rRNA data via the RDP mul%classifier •  Download the data for lecture 28 from the course webpage Megan – Metagenome Analyzer Java based with a neat graphical user interface Find and download the 16S rRNA BLAST database Run blast to generate the alignments We have two samples s1.fa and s2.fa that correspond to two condi%ons You can limit the number of alignments and use more threads if your computer can handle that à༎ speeds up the process considerably Our blast files This will need to be recognized by Megan to connect it to a taxonomy (we need to use a gi to taxonomy mapper) Neat feature: your original sequence names may also contain the taxonomy as a list, in which case Megan will parse that out. > [0]Bacteria;[1]Bacteroidetes;[2]Bacteroidia;[3]Bacteroidales;something In that case no addi%onal taxonomical informa%on is needed. Result of the classifica%on You can also load up the s1.rma file directly Classify at RDP Note: this classifies the sequences directly! No alignment step needed! Results of the classifica%on RPD mul%classifier from the command line Find and download the from the classifier webpage. It can generate combine the output for each file Inves%ga%ng a bit more How do these methods work really: MEGAN à༎ lowest common ancestor (read the Megan manual for more details) à༎ but operates solely on sequence similarity - extract reads that map to a certain taxa and analyze the alignments RDP à༎ breaks each sequence into words, and computes the likelyhood that a word comes from a certain taxa (it only works for 16s RNA data!) ...
View Full Document

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern