JHU 580.429 SB3 HW7: DNA information content 1. Assume that the length of the human genome is 3 × 10 9 base pairs, and that each of the 4 base pairs occurs with probability 1 / 4. (a) How long in base pairs does a motif have to be to occur approximately once per genome? A fractional result is fine. (b) Suppose a motif occurs once on average in the genome. You can model this as a binomial distribution with 3 × 10 9 attempts and a success rate of p per attempt. What is p ? (c) The binomial distribution reduces to a Poisson distribution. From the Poisson distribu- tion for a motif with λ occurrences on average per genome, what is the probability of exactly k occurrences? This question is really just asking you to write down the Poisson distribution. 2. Coding length and information content. The Shannon entropy for a discrete random variable with n states i ∈ { 1 , 2 ,... n } is - n i = 1 p i log 2 ( p i ) , where p i is the probability of state i and the entropy is in bits.

