This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: 1 6.874/6.807/7.90 Computational functional genomics, lectures 3 and 4 (Jaakkola) Finding regulatory sequences in DNA: motif discovery The purpose of motif discovery as discussed here is to find binding sites of DNA binding regulators. We assume that the binding sites are short segments of DNA, not necessarily contiguous, to which a specific (family of) transcription factors can bind. While such sites may appear in genic as well as inter-genic regions, we focus here on finding binding sites only in the promoter region of each gene. Its worthwhile to note that the existence of a binding site is no guarantee that the site (the corresponding TF) plays a role in regulating the gene in question; the site may not be accessible due to chromatin structure, may never be occupied due to resource constraints and higher anity sites elsewhere, and so on. Nevertheless, knowing who could participate in regulating each gene, and finding where they bind at a base-pair resolution, is a useful source of information. There are several strategies we could follow to try to find such sites. For example, if we have genomes of two related species, simply aligning the promoter regions of orthologous genes (or aligning the whole genomes) would reveal interspersed segments of DNA that are highly conserved across the two species. The binding sites of DNA binding regulators are likely to fall within the conserved segments because of the evolutionary pressure to maintain the regulatory programs. The fraction of conserved segments that are interpretable as binding sites within, say, a promoter region, depends on the evolutionary distance between the species (time that they have evolved independently). Sucient time is required for inessential portions of the sequences to diverge. Considering more than two related species would help emphasize the signal. For more information about this approach, see, e.g., Kellis, 2003 . We will follow here another approach, searching for binding sites of regulators within a single genome. We start with a set of genes that are likely to share regulators (bindings sites of regulators); a random subset of the genes are unlikely to share any regulators of interest. Note that it is not necessary to know who the common regulators are, only that they are likely to exist. Of course, knowing something about the common regulators, e.g., the protein families, can be extremely useful Simple motifs, analysis To set the problem a bit more formally, let S 1 , . . . , S n be n promoter sequences of interest. For simplicity we assume also that the sequences are of the same length, L , typically something like 500 1000 bases (yeast). The simplest possible motif we could try to find is a w-mer, a sequence of length w . In other words, we are looking for a common w-mer 2 6.874/6.807/7.90 Computational functional genomics, lectures 3 and 4 (Jaakkola) (exact match) across the promoters S 1 , . . . , S n . Wed like to understand first how L , w , and n...
View Full Document