This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Journal of Machine Learning Research 3 (2003) 1157-1182 Submitted 11/02; Published 3/03 An Introduction to Variable and Feature Selection Isabelle Guyon ISABELLE@CLOPINET.COM Clopinet 955 Creston Road Berkeley, CA 94708-1501, USA Andre Elisseeff ANDRE@TUEBINGEN.MPG.DE Empirical Inference for Machine Learning and Perception Department Max Planck Institute for Biological Cybernetics Spemannstrasse 38 72076 Tubingen, Germany Editor: Leslie Pack Kaelbling Abstract Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the pre- dictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods. Keywords: Variable selection, feature selection, space dimensionality reduction, pattern discov- ery, filters, wrappers, clustering, information theory, support vector machines, model selection, statistical testing, bioinformatics, computational biology, gene expression, microarray, genomics, proteomics, QSAR, text classification, information retrieval. 1 Introduction As of 1997, when a special issue on relevance including several papers on variable and feature selection was published (Blum and Langley, 1997, Kohavi and John, 1997), few domains explored used more than 40 features. The situation has changed considerably in the past few years and, in this special issue, most papers explore domains with hundreds to tens of thousands of variables or features: 1 New techniques are proposed to address these challenging tasks involving many irrelevant and redundant variables and often comparably few training examples. Two examples are typical of the new application domains and serve us as illustration throughout this introduction. One is gene selection from microarray data and the other is text categorization. In the gene selection problem, the variables are gene expression coefficients corresponding to the 1. We call variable the raw input variables and features variables constructed for the input variables. We use without distinction the terms variable and feature when there is no impact on the selection algorithms, e.g., when features resulting from a pre-processing of input variables are explicitly computed. The distinction is necessary in the case of kernel methods for which features are not explicitly computed (see section 5.3)....
View Full Document
- Spring '10
- Machine Learning