This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Published online 13 June 2008 Nucleic Acids Research, 2008, Vol. 36, No. 12 4137–4148 doi:10.1093/nar/gkn361 Extracting sequence features to predict protein–DNA interactions: a comparative study Qing Zhou 1, * and Jun S. Liu 2 1 Department of Statistics, University of California, Los Angeles, CA 90095 and 2 Department of Statistics, Harvard University, Cambridge, MA 02138, USA Received February 25, 2008; Revised May 16, 2008; Accepted May 21, 2008 ABSTRACT Predicting how and where proteins, especially tran- scription factors (TFs), interact with DNA is an impor- tant problem in biology. We present here a systematic study of predictive modeling approaches to the TF–DNA binding problem, which have been frequently shown to be more efficient than those methods only based on position-specific weight matrices (PWMs). In these approaches, a statistical relationship between genomic sequences and gene expression or ChIP-binding intensities is inferred through a regression framework; and influential sequence features are identified by variable selec- tion. We examine a few state-of-the-art learning methods including stepwise linear regression, multi- variate adaptive regression splines, neural networks, support vector machines, boosting and Bayesian additive regression trees (BART). These methods are applied to both simulated datasets and two whole-genome ChIP-chip datasets on the TFs Oct4 and Sox2, respectively, in human embryonic stem cells. We find that, with proper learning methods, predictive modeling approaches can significantly improve the predictive power and identify more bio- logically interesting features, such as TF–TF interac- tions, than the PWM approach. In particular, BART and boosting show the best and the most robust overall performance among all the methods. INTRODUCTION Transcription factors (TFs) regulate the expression of target genes by binding in a sequence-specific manner to various binding sites located in the promoter regions of these genes. A widely used model for characterizing the common sequence pattern of a set of TF-binding sites (TFBSs), often referred to as a motif, is the position-specific weight matrix (PWM). It assumes that each position of a binding site is generated by a multinomial probability dis- tribution independent of other positions. Since 1980s, many computational approaches have been developed based on the PWM representation to ‘discover’ motifs and TFBSs from a set of DNA sequences (1–6). See ref. (7,8) for recent reviews. From a discriminant modeling per- spective, a PWM implies a linear additive model for the TF–DNA interaction. Since non-negligible dependence among the positions of a binding site can be present (9,10), methods that simultaneously infer such dependence and predict novel binding sites have been developed (11–13). Approaches that make use of information in both positive (binding sites) and negative sequences (non- binding sites) have also been developed (14–16). In addi-binding sites) have also been developed (14–16)....
View Full Document
This note was uploaded on 11/24/2010 for the course STAT 201a taught by Professor Wu during the Spring '10 term at Pasadena City College.
- Spring '10