feature selection random forest

feature selection random forest - Feature Selection using a...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
Feature Selection using a Random Forests Classifier for the Integrated Analysis of Multiple Data Types David M. Reif 1,2 , Alison A. Motsinger 1 , Brett A. McKinney 1,2,3 , James E. Crowe Jr. 3 , Jason H. Moore 2 {reif, motsinger, [email protected], [email protected], [email protected] 1 Center for Human Genetics Research, Dept. of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN, USA 2 Computational Genetics Laboratory, Dept. of Genetics, Dartmouth Medical School, Lebanon, NH, USA 3 Program in Vaccine Sciences, Dept. of Pediatrics, Vanderbilt University, Nashville, TN, USA ABSTRACT Complex clinical phenotypes arise from the concerted interactions among the myriad components of a biological system. Therefore, comprehensive models can only be developed through the integrated study of multiple types of experimental data gathered from the system in question. The Random Forests TM (RF) method is adept at identifying relevant features having only slight main effects in high- dimensional data. This method is well-suited to integrated analysis, as relevant attributes may be selected from categorical or continuous data, and there may be interactions across data types. RF is a natural approach for studying gene-gene, gene-protein, or protein-protein interactions because importance scores for particular attributes take interactions into account. Thus, Random Forests is a promising solution to the analysis challenge posed by high-dimensional datasets including interactions among attributes of different types. In this study, we characterize the performance of RF on a range of simulated genetic and/or proteomic datasets. We compare the performance of RF in identifying relevant attributes when given genetic data alone, proteomic data alone, or a combined dataset of genetic plus proteomic data. Our results indicate that utilizing multiple data types is beneficial when the disease model is complex and the phenotypic outcome- associated data type is unknown. The results of this study also show that RF is adept at identifying relevant features in high-dimensional data with small main effects and low heritability. Keywords Random Forests TM , gene-gene interactions, feature selection, multiple data types, data integration. 1. INTRODUCTION Adverse drug reaction is one of the leading causes of hospitalizations in the Unites States. For example, in 1994 1-4244-0623-4/06/$20.00 ©2006 IEEE. alone, adverse drug reactions accounted for more than 2.2 million serious hospitalizations [1]. Currently, there is no definitive way to determine how a person will respond to a medication—limiting pharmaceutical development to a "one size fits all" system. This system allows for the development of drugs to which the "typical" patient will respond, but one size does not necessarily fit all, sometimes with dire consequences. The need to screen patients for biomarkers predictive of response a priori to prevent adverse reactions has created a subspecialty within the field of human genetics known as pharmacogenomics.
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 04/06/2010 for the course COMPUTER S COSC1520 taught by Professor Paul during the Spring '09 term at York University.

Page1 / 8

feature selection random forest - Feature Selection using a...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online