ps22002

ps22002 - Harvard-MIT Division of Health Sciences and...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
Problem Set 2 Please make sure to show your work and calculations and state any assumptions you make in answering the following questions. Include the names of the people you worked with at the top of your problem set. Problem 1: Genome sizes and data storage (35 points total) The NIH ’s National Center for Biotechnology Information (NCBI) provides a huge repository and a multitude of databases for biological information. NCBI Entrez's Genome page (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome)is a good starting place for resources on genome projects. Many biology textbooks also commonly discuss genomic size and its biological basis. 1 (a) Find the approximate size of the West Nile viral genome, the microbial Escherichia coli K12 genome, the Caenorhabditis elegans haploid genome, the haploid human genome and the Amoeba dubia genome in base pairs. (1 pt each, total of 5) 1(b) Find out the estimated total number of genes in each of the above organisms. (1 pt each, total of 5) Is the size of genome proportional to the total number of genes? Give at least one reason why this is or is not the case.(4 points) Is it always true that the more complex the organism, the large genome it has? Give an example if your answer is no and explain why.(4 points) 1(c) What is the minimum number of bytes required to store the genomes listed above? To store the human genome in its diploid rather than haploid form? Show your calculations! (6 points) 1(d) What is the minimum number of bytes needed to store all human genomes? All such genomes can be represented as a single individual's genome plus the variations, or polymorphisms, seen in all other human genomes. Assume that the human population is ~ 6 billion, which was the population reached in October 1999, and that polymorphic sites tend to be simple single nucleotide polymorphisms (SNPs) such as "A" in one genome and "C" in another) and occur about once every 3 kb (4pts). 1(e) How many double-sided DVDs would it take to store the genomes listed above given your bit conversions above? How many 80GB hard disks would it take to store all human genomes in the world, again given your calculations above (4pts)? 1(f) Some nucleotide sequence data have to be stored at more than 2 bits/base. Could you think of a reason why this would be the case? (3 points)
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Problem 2: Sequence occurrences (30 points total) 2 (a) At how many sites would you expect “CG” to occur in 4.6 Mbp (mega bp) in a double-stranded genome? How about “CTAG”? And “GATTACA”? Assume all nucleotides have an equal probability of occurring. (2 pts for each, total of 6 pts) Hint: “CG” is a palindromic sequence. When it occurs on one strand, it also occurs on the complement. 5’-
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 01/24/2010 for the course HST. 508 taught by Professor Dr.georgechurch during the Fall '02 term at MIT.

Page1 / 6

ps22002 - Harvard-MIT Division of Health Sciences and...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online