Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Machine Learning, Neural and Statistical Classification Editors: D. Michie, D.J. Spiegelhalter, C.C. Taylor February 17, 1994 ¢¡¡¡¢¡¡¢¡£¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡ ¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡ 2 Classification 2.1 DEFINITION OF CLASSIFICATION 2.1.1 Rationale 2.1.2 Issues 2.1.3 Class definitions 2.1.4 Accuracy 2.2 EXAMPLES OF CLASSIFIERS 2.2.1 Fisher’s linear discriminants 2.2.2 Decision tree and Rule-based methods 2.2.3 k-Nearest-Neighbour 2.3 CHOICE OF VARIABLES 2.3.1 Transformations and combinations of variables 2.4 CLASSIFICATION OF CLASSIFICATION PROCEDURES 2.4.1 Extensions to linear discrimination 2.4.2 Decision trees and Rule-based methods ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ 1 Introduction 1.1 INTRODUCTION 1.2 CLASSIFICATION 1.3 PERSPECTIVES ON CLASSIFICATION 1.3.1 Statistical approaches 1.3.2 Machine learning 1.3.3 Neural networks 1.3.4 Conclusions 1.4 THE STATLOG PROJECT 1.4.1 Quality control 1.4.2 Caution in the interpretations of comparisons 1.5 THE STRUCTURE OF THIS VOLUME 6 6 6 7 8 8 8 9 9 10 11 11 12 12 12 1 1 1 2 2 2 3 3 4 4 4 5 Contents ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ 3 Classical Statistical Methods 3.1 INTRODUCTION 3.2 LINEAR DISCRIMINANTS 3.2.1 Linear discriminants by least squares 3.2.2 Special case of two classes 3.2.3 Linear discriminants by maximum likelihood 3.2.4 More than two classes 3.3 QUADRATIC DISCRIMINANT 3.3.1 Quadratic discriminant - programming details 3.3.2 Regularisation and smoothed estimates 3.3.3 Choice of regularisation parameters 3.4 LOGISTIC DISCRIMINANT 3.4.1 Logistic discriminant - programming details 3.5 BAYES’ RULES 3.6 EXAMPLE 3.6.1 Linear discriminant 3.6.2 Logistic discriminant 3.6.3 Quadratic discriminant 17 17 17 18 20 20 21 22 22 23 23 24 25 27 27 27 27 27 4 Modern Statistical Techniques 4.1 INTRODUCTION 4.2 DENSITY ESTIMATION 4.2.1 Example 4.3 -NEAREST NEIGHBOUR 4.3.1 Example 4.4 PROJECTION PURSUIT CLASSIFICATION 4.4.1 Example 4.5 NAIVE BAYES 4.6 CAUSAL NETWORKS 4.6.1 Example 4.7 OTHER RECENT APPROACHES 4.7.1 ACE 4.7.2 MARS 29 29 30 33 35 36 37 39 40 41 45 46 46 47 ¦ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡ ¢¡¡¡¢¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£ ¥ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡ ¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡ 2.7 2.6 2.5 2.4.3 Density estimates A GENERAL STRUCTURE FOR CLASSIFICATION PROBLEMS 2.5.1 Prior probabilities and the Default rule 2.5.2 Separating classes 2.5.3 Misclassification costs BAYES RULE GIVEN DATA 2.6.1 Bayes rule in statistics REFERENCE TEXTS ii 12 12 13 13 13 14 15 16 [Ch. 0 ¢¡¡¡¢¡¡¢¡£¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£ ¢¡¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢ ¢¡¡¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ 6 Neural Networks 6.1 INTRODUCTION 6.2 SUPERVISED NETWORKS FOR CLASSIFICATION 6.2.1 Perceptrons and Multi Layer Perceptrons 6.2.2 Multi Layer Perceptron structure and functionality 6.2.3 Radial Basis Function networks 6.2.4 Improving the generalisation of Feed-Forward networks 6.3 UNSUPERVISED LEARNING 6.3.1 The K-means clustering algorithm 6.3.2 Kohonen networks and Learning Vector Quantizers 6.3.3 RAMnets 6.4 DIPOL92 6.4.1 Introduction 6.4.2 Pairwise linear regression 6.4.3 Learning procedure 6.4.4 Clustering of classes 6.4.5 Description of the classification procedure 84 84 86 86 87 93 96 101 101 102 103 103 104 104 104 105 105 ¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ ©¨ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡¡ ¢£§ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢ ¢¡¡¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢ ¢¡¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡ ¢¡¡¡¢¡¡ Sec. 0.0] iii 5 Machine Learning of Rules and Trees 5.1 RULES AND TREES FROM DATA: FIRST PRINCIPLES 5.1.1 Data fit and mental fit of classifiers 5.1.2 Specific-to-general: a paradigm for rule-learning 5.1.3 Decision trees 5.1.4 General-to-specific: top-down induction of trees 5.1.5 Stopping rules and class probability trees 5.1.6 Splitting criteria 5.1.7 Getting a “right-sized tree” 5.2 STATLOG’S ML ALGORITHMS 5.2.1 Tree-learning: further features of C4.5 5.2.2 NewID 5.2.3 5.2.4 Further features of CART 5.2.5 Cal5 5.2.6 Bayes tree 5.2.7 Rule-learning algorithms: CN2 5.2.8 ITrule 5.3 BEYOND THE COMPLEXITY BARRIER 5.3.1 Trees into rules 5.3.2 Manufacturing new attributes 5.3.3 Inherent limits of propositional-level learning 5.3.4 A human-machine compromise: structured induction 50 50 50 54 56 57 61 61 63 65 65 65 67 68 70 73 73 77 79 79 80 81 83 ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ 7 Methods for Comparison 7.1 ESTIMATION OF ERROR RATES IN CLASSIFICATION RULES 7.1.1 Train-and-Test 7.1.2 Cross-validation 7.1.3 Bootstrap 7.1.4 Optimisation of parameters 7.2 ORGANISATION OF COMPARATIVE TRIALS 7.2.1 Cross-validation 7.2.2 Bootstrap 7.2.3 Evaluation Assistant 7.3 CHARACTERISATION OF DATASETS 7.3.1 Simple measures 7.3.2 Statistical measures 7.3.3 Information theoretic measures 7.4 PRE-PROCESSING 7.4.1 Missing values 7.4.2 Feature selection and extraction 7.4.3 Large number of categories 7.4.4 Bias in class proportions 7.4.5 Hierarchical attributes 7.4.6 Collection of datasets 7.4.7 Preprocessing strategy in StatLog 107 107 108 108 108 109 110 111 111 111 112 112 112 116 120 120 120 121 122 123 124 124 8 Review of Previous Empirical Comparisons 8.1 INTRODUCTION 8.2 BASIC TOOLBOX OF ALGORITHMS 8.3 DIFFICULTIES IN PREVIOUS STUDIES 8.4 PREVIOUS EMPIRICAL COMPARISONS 8.5 INDIVIDUAL RESULTS 8.6 MACHINE LEARNING vs. NEURAL NETWORK 8.7 STUDIES INVOLVING ML, k-NN AND STATISTICS 8.8 SOME EMPIRICAL STUDIES RELATING TO CREDIT RISK 8.8.1 Traditional and statistical approaches 8.8.2 Machine Learning and Neural Networks 125 125 125 126 127 127 127 129 129 129 130 9 Dataset Descriptions and Results 9.1 INTRODUCTION 9.2 CREDIT DATASETS 9.2.1 Credit management (Cred.Man) 9.2.2 Australian credit (Cr.Aust) 9.3 IMAGE DATASETS 9.3.1 Handwritten digits (Dig44) 9.3.2 Karhunen-Loeve digits (KL) 9.3.3 Vehicle silhouettes (Vehicle) 9.3.4 Letter recognition (Letter) 131 131 132 132 134 135 135 137 138 140 ¢¡¡¡¢¡¡¢¡£¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡ ¢¡¡¡ ¢¡¡¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢ ¢¡¡ iv [Ch. 0 ¢¡¡¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ 10 Analysis of Results 10.1 INTRODUCTION 10.2 RESULTS BY SUBJECT AREAS 10.2.1 Credit datasets 10.2.2 Image datasets 10.2.3 Datasets with costs 10.2.4 Other datasets 10.3 TOP FIVE ALGORITHMS 10.3.1 Dominators 10.4 MULTIDIMENSIONAL SCALING 10.4.1 Scaling of algorithms 10.4.2 Hierarchical clustering of algorithms 10.4.3 Scaling of datasets 10.4.4 Best algorithms for datasets 10.4.5 Clustering of datasets 10.5 PERFORMANCE RELATED TO MEASURES: THEORETICAL 10.5.1 Normal distributions 10.5.2 Absolute performance: quadratic discriminants ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£ 9.6 9.5 9.4 9.3.5 Chromosomes (Chrom) 9.3.6 Landsat satellite image (SatIm) 9.3.7 Image segmentation (Segm) 9.3.8 Cut DATASETS WITH COSTS 9.4.1 Head injury (Head) 9.4.2 Heart disease (Heart) 9.4.3 German credit (Cr.Ger) OTHER DATASETS 9.5.1 Shuttle control (Shuttle) 9.5.2 Diabetes (Diab) 9.5.3 DNA 9.5.4 Technical (Tech) 9.5.5 Belgian power (Belg) 9.5.6 Belgian power II (BelgII) 9.5.7 Machine faults (Faults) 9.5.8 Tsetse fly distribution (Tsetse) STATISTICAL AND INFORMATION MEASURES 9.6.1 KL-digits dataset 9.6.2 Vehicle silhouettes 9.6.3 Head injury 9.6.4 Heart disease 9.6.5 Satellite image dataset 9.6.6 Shuttle control 9.6.7 Technical 9.6.8 Belgian power II Sec. 0.0] 175 175 176 176 179 183 184 185 186 187 188 189 190 191 192 192 192 193 142 143 145 146 149 149 152 153 154 154 157 158 161 163 164 165 167 169 170 170 173 173 173 173 174 174 v ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡¡ ©¨ ¢£§ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ 11 Conclusions 11.1 INTRODUCTION 11.1.1 User’s guide to programs 11.2 STATISTICAL ALGORITHMS 11.2.1 Discriminants 11.2.2 ALLOC80 11.2.3 Nearest Neighbour 11.2.4 SMART 11.2.5 Naive Bayes 11.2.6 CASTLE 11.3 DECISION TREES 11.3.1 and NewID 11.3.2 C4.5 11.3.3 CART and IndCART 11.3.4 Cal5 11.3.5 Bayes Tree 11.4 RULE-BASED METHODS 11.4.1 CN2 11.4.2 ITrule 11.5 NEURAL NETWORKS 11.5.1 Backprop 11.5.2 Kohonen and LVQ 11.5.3 Radial basis function neural network 11.5.4 DIPOL92 11.6 MEMORY AND TIME 11.6.1 Memory 11.6.2 Time 11.7 GENERAL ISSUES 11.7.1 Cost matrices 11.7.2 Interpretation of error rates 11.7.3 Structuring the results 11.7.4 Removal of irrelevant attributes ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£ 10.5.3 Relative performance: Logdisc vs. DIPOL92 10.5.4 Pruning of decision trees 10.6 RULE BASED ADVICE ON ALGORITHM APPLICATION 10.6.1 Objectives 10.6.2 Using test results in metalevel learning 10.6.3 Characterizing predictive power 10.6.4 Rules generated in metalevel learning 10.6.5 Application Assistant 10.6.6 Criticism of metalevel learning approach 10.6.7 Criticism of measures 10.7 PREDICTION OF PERFORMANCE 10.7.1 ML on ML vs. regression vi 213 213 214 214 214 214 216 216 216 217 217 218 219 219 219 220 220 220 220 221 221 222 223 223 223 223 224 224 224 225 225 226 193 194 197 197 198 202 205 207 209 209 210 211 [Ch. 0 ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡ ¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ 12 Knowledge Representation 12.1 INTRODUCTION 12.2 LEARNING, MEASUREMENT AND REPRESENTATION 12.3 PROTOTYPES 12.3.1 Experiment 1 12.3.2 Experiment 2 12.3.3 Experiment 3 12.3.4 Discussion 12.4 FUNCTION APPROXIMATION 12.4.1 Discussion 12.5 GENETIC ALGORITHMS 12.6 PROPOSITIONAL LEARNING SYSTEMS 12.6.1 Discussion 12.7 RELATIONS AND BACKGROUND KNOWLEDGE 12.7.1 Discussion 12.8 CONCLUSIONS 228 228 229 230 230 231 231 231 232 234 234 237 239 241 244 245 13 Learning to Control Dynamic Systems 13.1 INTRODUCTION 13.2 EXPERIMENTAL DOMAIN 13.3 LEARNING TO CONTROL FROM SCRATCH: BOXES 13.3.1 BOXES 13.3.2 Refinements of BOXES 13.4 LEARNING TO CONTROL FROM SCRATCH: GENETIC LEARNING 13.4.1 Robustness and adaptation 13.5 EXPLOITING PARTIAL EXPLICIT KNOWLEDGE 13.5.1 BOXES with partial knowledge 13.5.2 Exploiting domain knowledge in genetic learning of control 13.6 EXPLOITING OPERATOR’S SKILL 13.6.1 Learning to pilot a plane 13.6.2 Learning to control container cranes 13.7 CONCLUSIONS A Dataset availability B Software sources and details C Contributors 246 246 248 250 250 252 252 254 255 255 256 256 256 258 261 262 262 265 ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡¢¡ ¢¡¡¡¢¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡¢¡¡ ¢¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£¤£¡¡ ¢¡¡¡¢¡¡¢¡£¡¢¡¡¢¡¡¡¢¡£ 11.7.5 11.7.6 11.7.7 11.7.8 Diagnostics and plotting Exploratory data Special features From classification to knowledge organisation and synthesis 226 226 227 227 Sec. 0.0] vii 1 Introduction D. Michie (1), D. J. Spiegelhalter (2) and C. C. Taylor (3) (1) University of Strathclyde, (2) MRC Biostatistics Unit, Cambridge and (3) University of Leeds  1.1 INTRODUCTION The aim of this book is to provide an up-to-date review of different approaches to classification, compare their performance on a wide range of challenging data-sets, and draw conclusions on their applicability to realistic industrial problems. Before describing the contents, we first need to define what we mean by classification, give some background to the different perspectives on the task, and introduce the European Community StatLog project whose results form the basis for this book. 1.2 CLASSIFICATION The task of classification occurs in a wide range of human activity. At its broadest, the term could cover any context in which some decision or forecast is made on the basis of currently available information, and a classification procedure is then some formal method for repeatedly making such judgments in new situations. In this book we shall consider a more restricted interpretation. We shall assume that the problem concerns the construction of a procedure that will be applied to a continuing sequence of cases, in which each new case must be assigned to one of a set of pre-defined classes on the basis of observed attributes or features. The construction of a classification procedure from a set of data for which the true classes are known has also been variously termed pattern recognition, discrimination , or supervised learning (in order to distinguish it from unsupervised learning or clustering in which the classes are inferred from the data). Contexts in which a classification task is fundamental include, for example, mechanical procedures for sorting letters on the basis of machine-read postcodes, assigning individuals to credit status on the basis of financial and other personal information, and the preliminary diagnosis of a patient’s disease in order to select immediate treatment while awaiting definitive test results. In fact, some of the most urgent problems arising in science, industry  Address for correspondence: MRC Biostatistics Unit, Institute of Public Health, University Forvie Site, Robinson Way, Cambridge CB2 2SR, U.K. 2 Introduction [Ch. 1 and commerce can be regarded as classification or decision problems using complex and often very extensive data. We note that many other topics come under the broad heading of classification. These include problems of control, which is briefly covered in Chapter 13. 1.3 PERSPECTIVES ON CLASSIFICATION As the book’s title suggests, a wide variety of approaches has been taken towards this task. Three main historical strands of research can be identified: statistical , machine learning and neural network. These have largely involved different professional and academic groups, and emphasised different issues. All groups have, however, had some objectives in common. They have all attempted to derive procedures that would be able: to equal, if not exceed, a human decision-maker’s behaviour, but have the advantage of consistency and, to a variable extent, explicitness, to handle a wide variety of problems and, given enough data, to be extremely general, to be used in practical settings with proven success.    1.3.1 Statistical approaches Two main phases of work on classification can be identified within the statistical community. The first, “classical” phase concentrated on derivatives of Fisher’s early work on linear discrimination. The second, “modern” phase exploits more flexible classes of models, many of which attempt to provide an estimate of the joint distribution of the features within each class, which can in turn provide a classification rule. Statistical approaches are generally characterised by having an explicit underlying probability model, which provides a probability of being in each class rather than simply a classification. In addition, it is usually assumed that the techniques will be used by statisticians, and hence some human intervention is assumed with regard to variable selection and transformation, and overall structuring of the problem. 1.3.2 Machine learning Machine Learning is generally taken to encompass automatic computing procedures based on logical or binary operations, that learn a task from a series of examples. Here we are just concerned with classification, and it is arguable what should come under the Machine Learning umbrella. Attention has focussed on decision-tree approaches, in which classification results from a sequence of logical steps. These are capable of representing the most complex problem given sufficient data (but this may mean an enormous amount!). Other techniques, such as genetic algorithms and inductive logic procedures (ILP), are currently under active development and in principle would allow us to deal with more general types of data, including cases where the number and type of attributes may vary, and where additional layers of learning are superimposed, with hierarchical structure of attributes and classes and so on. Machine Learning aims to generate classifying expressions simple enough to be understood easily by the human. They must mimic human reasoning sufficiently to provide insight into the decision process. Like statistical approaches, background knowledge may be exploited in development, but operation is assumed without human intervention. Sec. 1.4] 1.3.3 Perspectives on classification 3 Neural networks The field of Neural Networks has arisen from diverse sources, ranging from the fascination of mankind with understanding and emulating the human brain, to broader issues of copying human abilities such as speech and the use of language, to the practical commercial, scientific, and engineering disciplines of pattern recognition, modelling, and prediction. The pursuit of technology is a strong driving force for researchers, both in academia and industry, in many fields of science and engineering. In neural networks, as in Machine Learning, the excitement of technological progress is supplemented by the challenge of reproducing intelligence itself. A broad class of techniques can come under this heading, but, generally, neural networks consist of layers of interconnected nodes, each node producing a non-linear function of its input. The input to a node may come from other nodes or directly from the input data. Also, some nodes are identified with the output of the network. The complete network therefore represents a very complex set of interdependencies which may incorporate any degree of nonlinearity, allowing very general functions to be modelled. In the simplest networks, the output from one node is fed into another node in such a way as to propagate “messages” through layers of interconnecting nodes. More complex behaviour may be modelled by networks in which the final output nodes are connected with earlier nodes, and then the system has the characteristics of a highly nonlinear system with feedback. It has been argued that neural networks mirror to a certain extent the behaviour of networks of neurons in the brain. Neural network approaches combine the complexity of some of the statistical techniques with the machine learning objective of imitating human intelligence: however, this is done at a more “unconscious” level and hence there is no accompanying ability to make learned concepts transparent to the user. 1.3.4 Conclusions The three broad approaches outlined above form the basis of the grouping of procedures used in this book. The correspondence between type of technique and professional background is inexact: for example, techniques that use decision trees have been developed in parallel both within the machine learning community, motivated by psychological research or knowledge acquisition for expert systems, and within the statistical profession as a response to the perceived limitations of classical discrimination techniques based on linear functions. Similarly strong parallels may be drawn between advanced regression techniques developed in statistics, and neural network models with a background in psychology, computer science and artificial intelligence. It is the aim of this book to put all methods to the test of experiment, and to give an objective assessment of their strengths and weaknesses. Techniques have been grouped according to the above categories. It is not always straightforward to select a group: for example some procedures can be considered as a development from linear regression, but have strong affinity to neural networks. When deciding on a group for a specific technique, we have attempted to ignore its professional pedigree and classify according to its essential nature. 4 Introduction [Ch. 1 1.4 THE STATLOG PROJECT The fragmentation amongst different disciplines has almost certainly hindered communication and progress. The StatLog project was designed to break down these divisions by selecting classification procedures regardless of historical pedigree, testing them on large-scale and commercially important problems, and hence to determine to what extent the various techniques met the needs of industry. This depends critically on a clear understanding of: 1. the aims of each classification/decision procedure; 2. the class of problems for which it is most suited; 3. measures of performance or benchmarks to monitor the success of the method in a particular application. About 20 procedures were considered for about 20 datasets, so that results were obtained from around 20 20 = 400 large scale experiments. The set of methods to be considered was pruned after early experiments, using criteria developed for multi-input (problems), many treatments (algorithms) and multiple criteria experiments. A management hierarchy led by Daimler-Benz controlled the full project. The objectives of the Project were threefold: 1. to provide critical performance measurements on available classification procedures; 2. to indicate the nature and scope of further development which particular methods require to meet the expectations of industrial users; 3. to indicate the most promising avenues of development for the commercially immature approaches. ©  1.4.1 Quality control The Project laid down strict guidelines for the testing procedure. First an agreed data format was established, algorithms were “deposited” at one site, with appropriate instructions; this version would be used in the case of any future dispute. Each dataset was then divided into a training set and a testing set, and any parameters in an algorithm could be “tuned” or estimated only by reference to the training set. Once a rule had been determined, it was then applied to the test data. This procedure was validated at another site by another (more na¨ve) user for each dataset in the first phase of the Project. This ensured that the ı guidelines for parameter selection were not violated, and also gave some information on the ease-of-use for a non-expert in the domain. Unfortunately, these guidelines were not followed for the radial basis function (RBF) algorithm which for some datasets determined the number of centres and locations with reference to the test set, so these results should be viewed with some caution. However, it is thought that the conclusions will be unaffected. 1.4.2 Caution in the interpretations of comparisons There are some strong caveats that must be made concerning comparisons between techniques in a project such as this. First, the exercise is necessarily somewhat contrived. In any real application, there should be an iterative process in which the constructor of the classifier interacts with the  ESPRIT project 5170. Comparative testing and evaluation of statistical and logical learning algorithms on large-scale applications to classification, prediction and control Sec. 1.5] The structure of this volume 5 expert in the domain, gaining understanding of the problem and any limitations in the data, and receiving feedback as to the quality of preliminary investigations. In contrast, StatLog datasets were simply distributed and used as test cases for a wide variety of techniques, each applied in a somewhat automatic fashion. Second, the results obtained by applying a technique to a test problem depend on three factors: 1. 2. 3. the essential quality and appropriateness of the technique; the actual implementation of the technique as a computer program ; the skill of the user in coaxing the best out of the technique. In Appendix B we have described the implementations used for each technique, and the availability of more advanced versions if appropriate. However, it is extremely difficult to control adequately the variations in the background and ability of all the experimenters in StatLog, particularly with regard to data analysis and facility in “tuning” procedures to give their best. Individual techniques may, therefore, have suffered from poor implementation and use, but we hope that there is no overall bias against whole classes of procedure. 1.5 THE STRUCTURE OF THIS VOLUME The present text has been produced by a variety of authors, from widely differing backgrounds, but with the common aim of making the results of the StatLog project accessible to a wide range of workers in the fields of machine learning, statistics and neural networks, and to help the cross-fertilisation of ideas between these groups. After discussing the general classification problem in Chapter 2, the next 4 chapters detail the methods that have been investigated, divided up according to broad headings of Classical statistics, modern statistical techniques, Decision Trees and Rules, and Neural Networks. The next part of the book concerns the evaluation experiments, and includes chapters on evaluation criteria, a survey of previous comparative studies, a description of the data-sets and the results for the different methods, and an analysis of the results which explores the characteristics of data-sets that make them suitable for particular approaches: we might call this “machine learning on machine learning”. The conclusions concerning the experiments are summarised in Chapter 11. The final chapters of the book broaden the interpretation of the basic classification problem. The fundamental theme of representing knowledge using different formalisms is discussed with relation to constructing classification techniques, followed by a summary of current approaches to dynamic control now arising from a rephrasing of the problem in terms of classification and learning. 2 Classification  R. J. Henery University of Strathclyde 2.1 DEFINITION OF CLASSIFICATION Classification has two distinct meanings. We may be given a set of observations with the aim of establishing the existence of classes or clusters in the data. Or we may know for certain that there are so many classes, and the aim is to establish a rule whereby we can classify a new observation into one of the existing classes. The former type is known as Unsupervised Learning (or Clustering), the latter as Supervised Learning. In this book when we use the term classification, we are talking of Supervised Learning. In the statistical literature, Supervised Learning is usually, but not always, referred to as discrimination, by which is meant the establishing of the classification rule from given correctly classified data. The existence of correctly classified data presupposes that someone (the Supervisor) is able to classify without error, so the question naturally arises: why is it necessary to replace this exact classification by some approximation? 2.1.1 Rationale There are many reasons why we may wish to set up a classification procedure, and some of these are discussed later in relation to the actual datasets used in this book. Here we outline possible reasons for the examples in Section 1.2. 1.  2. Mechanical classification procedures may be much faster: for example, postal code reading machines may be able to sort the majority of letters, leaving the difficult cases to human readers. A mail order firm must take a decision on the granting of credit purely on the basis of information supplied in the application form: human operators may well have biases, i.e. may make decisions on irrelevant information and may turn away good customers. Address for correspondence: Department of Statistics and Modelling Science, University of Strathclyde, Glasgow G1 1XH, U.K. Sec. 2.1] 3. 4. Definition 7 In the medical field, we may wish to avoid the surgery that would be the only sure way of making an exact diagnosis, so we ask if a reliable diagnosis can be made on purely external symptoms. The Supervisor (refered to above) may be the verdict of history, as in meteorology or stock-exchange transaction or investment and loan decisions. In this case the issue is one of forecasting. 2.1.2 Issues There are also many issues of concern to the would-be classifier. We list below a few of these.  Accuracy. There is the reliability of the rule, usually represented by the proportion of correct classifications, although it may be that some errors are more serious than others, and it may be important to control the error rate for some key class. Speed. In some circumstances, the speed of the classifier is a major issue. A classifier that is 90% accurate may be preferred over one that is 95% accurate if it is 100 times faster in testing (and such differences in time-scales are not uncommon in neural networks for example). Such considerations would be important for the automatic reading of postal codes, or automatic fault detection of items on a production line for example.   Comprehensibility. If it is a human operator that must apply the classification procedure, the procedure must be easily understood else mistakes will be made in applying the rule. It is important also, that human operators believe the system. An oft-quoted example is the Three-Mile Island case, where the automatic devices correctly recommended a shutdown, but this recommendation was not acted upon by the human operators who did not believe that the recommendation was well founded. A similar story applies to the Chernobyl disaster.  Time to Learn. Especially in a rapidly changing environment, it may be necessary to learn a classification rule quickly, or make adjustments to an existing rule in real time. “Quickly” might imply also that we need only a small number of observations to establish our rule. At one extreme, consider the na¨ve 1-nearest neighbour rule, in which the training set ı is searched for the ‘nearest’ (in a defined sense) previous example, whose class is then assumed for the new case. This is very fast to learn (no time at all!), but is very slow in practice if all the data are used (although if you have a massively parallel computer you might speed up the method considerably). At the other extreme, there are cases where it is very useful to have a quick-and-dirty method, possibly for eyeball checking of data, or for providing a quick cross-checking on the results of another procedure. For example, a bank manager might know that the simple rule-of-thumb “only give credit to applicants who already have a bank account” is a fairly reliable rule. If she notices that the new assistant (or the new automated procedure) is mostly giving credit to customers who do not have a bank account, she would probably wish to check that the new assistant (or new procedure) was operating correctly. 8 Classification [Ch. 2 2.1.3 Class definitions An important question, that is improperly understood in many studies of classification, is the nature of the classes and the way that they are defined. We can distinguish three common cases, only the first leading to what statisticians would term classification: 1. 2. 3. Classes correspond to labels for different populations: membership of the various populations is not in question. For example, dogs and cats form quite separate classes or populations, and it is known, with certainty, whether an animal is a dog or a cat (or neither). Membership of a class or population is determined by an independent authority (the Supervisor), the allocation to a class being determined independently of any particular attributes or variables. Classes result from a prediction problem. Here class is essentially an outcome that must be predicted from a knowledge of the attributes. In statistical terms, the class is a random variable. A typical example is in the prediction of interest rates. Frequently the question is put: will interest rates rise (class=1) or not (class=0). Classes are pre-defined by a partition of the sample space, i.e. of the attributes themselves. We may say that class is a function of the attributes. Thus a manufactured item may be classed as faulty if some attributes are outside predetermined limits, and not faulty otherwise. There is a rule that has already classified the data from the attributes: the problem is to create a rule that mimics the actual rule as closely as possible. Many credit datasets are of this type. In practice, datasets may be mixtures of these types, or may be somewhere in between. 2.1.4 Accuracy On the question of accuracy, we should always bear in mind that accuracy as measured on the training set and accuracy as measured on unseen data (the test set) are often very different. Indeed it is not uncommon, especially in Machine Learning applications, for the training set to be perfectly fitted, but performance on the test set to be very disappointing. Usually, it is the accuracy on the unseen data, when the true classification is unknown, that is of practical importance. The generally accepted method for estimating this is to use the given data, in which we assume that all class memberships are known, as follows. Firstly, we use a substantial proportion (the training set) of the given data to train the procedure. This rule is then tested on the remaining data (the test set), and the results compared with the known classifications. The proportion correct in the test set is an unbiased estimate of the accuracy of the rule provided that the training set is randomly sampled from the given data. 2.2 EXAMPLES OF CLASSIFIERS To illustrate the basic types of classifiers, we will use the well-known Iris dataset, which is given, in full, in Kendall & Stuart (1983). There are three varieties of Iris: Setosa, Versicolor and Virginica. The length and breadth of both petal and sepal were measured on 50 flowers of each variety. The original problem is to classify a new Iris flower into one of these three types on the basis of the four attributes (petal and sepal length and width). To keep this example simple, however, we will look for a classification rule by which the varieties can be distinguished purely on the basis of the two measurements on Petal Length Sec. 2.2] Examples of classifiers 9 and Width. We have available fifty pairs of measurements of each variety from which to learn the classification rule. 2.2.1 Fisher’s linear discriminants This is one of the oldest classification procedures, and is the most commonly implemented in computer packages. The idea is to divide sample space by a series of lines in two dimensions, planes in 3-D and, generally hyperplanes in many dimensions. The line dividing two classes is drawn to bisect the line joining the centres of those classes, the direction of the line is determined by the shape of the clusters of points. For example, to differentiate between Versicolor and Virginica, the following rule is applied: Petal Length, then Versicolor. Petal Length, then Virginica.  0(  &%#!! 1 )' $ "    0(  &%#!!  )' $ "    If Petal Width If Petal Width  3.0 Fisher’s linear discriminants applied to the Iris data are shown in Figure 2.1. Six of the observations would be misclassified. Petal width 1.5 2.0 2.5 Virginica 1.0 E Setosa A AA A A A AA A A A A A AA A A AA A A A AA AA A A AA A A AA A AA A A A E AA E A EE E A E E EE EA A A E E EE E E EE E EE E E EE EE E E E EE E E E EE 0.0 0.5 S S S SSS S S SS S S S S SS S S S SS SS SS S SS 0 2 Versicolor 4 Petal length 6 8 Fig. 2.1: Classification by linear discriminants: Iris data. 2.2.2 Decision tree and Rule-based methods One class of classification procedures is based on recursive partitioning of the sample space. Space is divided into boxes, and at each stage in the procedure, each box is examined to see if it may be split into two boxes, the split usually being parallel to the coordinate axes. An example for the Iris data follows. 1  If Petal Length If Petal Length 2.65 then Setosa. 4.95 then Virginica.   Classification   If 2.65 [Ch. 2 Petal Length if Petal Width 4.95 then : 1.65 then Versicolor;  1 if Petal Width  10 1.65 then Virginica. 2.5 3.0 The resulting partition is shown in Figure 2.2. Note that this classification rule has three mis-classifications. A AA A A A AA A A A A A AA A A AA A A A AA AA A A AA A A AA A AA A A A AA E A E EE E A E E EE EA A A E E EE E EE E EE E E EE E EE E E E EE E E E EE Petal width 1.5 2.0 Virginica Setosa 1.0 E Virginica 0.5 S 0.0 S S SSS S S SS S S S S SS S S S SS SS SS S SS 0 2 Versicolor 4 Petal length 6 8 Fig. 2.2: Classification by decision tree: Iris data. Weiss & Kapouleas (1989) give an alternative classification rule for the Iris data that is very directly related to Figure 2.2. Their rule can be obtained from Figure 2.2 by continuing the dotted line to the left, and can be stated thus: 1 If Petal Length 2.65 then Setosa. If Petal Length 4.95 or Petal Width Otherwise Versicolor. 1.65 then Virginica.   1   Notice that this rule, while equivalent to the rule illustrated in Figure 2.2, is stated more concisely, and this formulation may be preferred for this reason. Notice also that the rule is ambiguous if Petal Length 2.65 and Petal Width 1.65. The quoted rules may be made unambiguous by applying them in the given order, and they are then just a re-statement of the previous decision tree. The rule discussed here is an instance of a rule-based method: such methods have very close links with decision trees. 1  2.2.3 k-Nearest-Neighbour We illustrate this technique on the Iris data. Suppose a new Iris is to be classified. The idea is that it is most likely to be near to observations from its own proper population. So we look at the five (say) nearest observations from all previously recorded Irises, and classify Sec. 2.3] Variable selection 11 the observation according to the most frequent class among its neighbours. In Figure 2.3, the new observation is marked by a , and the nearest observations lie within the circle centred on the . The apparent elliptical shape is due to the differing horizontal and vertical scales, but the proper scaling of the observations is a major difficulty of this method. This is illustrated in Figure 2.3 , where an observation centred at would be classified as Virginica since it has Virginica among its nearest neighbours. ' 2 ) Petal width 1.5 2.0 2.5 ' 3.0 2 2 1.0 E A AA A A A AA A A A A A AA A A AA A A A AA AA A A AA A A AA A AA A A A AA E A E EE E A E E EE EA A A E E EE E EE E EE E E EE E EE E E E EE Virginica E E E EE 0.0 0.5 S S S SSS S S SS S S S S SS S S S SS SS SS S SS 0 2 4 Petal length 6 8 Fig. 2.3: Classification by 5-Nearest-Neighbours: Iris data. 2.3 CHOICE OF VARIABLES As we have just pointed out in relation to k-nearest neighbour, it may be necessary to reduce the weight attached to some variables by suitable scaling. At one extreme, we might remove some variables altogether if they do not contribute usefully to the discrimination, although this is not always easy to decide. There are established procedures (for example, forward stepwise selection) for removing unnecessary variables in linear discriminants, but, for large datasets, the performance of linear discriminants is not seriously affected by including such unnecessary variables. In contrast, the presence of irrelevant variables is always a problem with k-nearest neighbour, regardless of dataset size. 2.3.1 Transformations and combinations of variables Often problems can be simplified by a judicious transformation of variables. With statistical procedures, the aim is usually to transform the attributes so that their marginal density is approximately normal, usually by applying a monotonic transformation of the power law type. Monotonic transformations do not affect the Machine Learning methods, but they can benefit by combining variables, for example by taking ratios or differences of key variables. Background knowledge of the problem is of help in determining what transformation or 12 Classification [Ch. 2 combination to use. For example, in the Iris data, the product of the variables Petal Length and Petal Width gives a single attribute which has the dimensions of area, and might be labelled as Petal Area. It so happens that a decision rule based on the single variable Petal Area is a good classifier with only four errors: If Petal Area 2.0 then Setosa. If 2.0 Petal Area 7.4 then Virginica. If Petal Area 7.4 then Virginica. This tree, while it has one more error than the decision tree quoted earlier, might be preferred on the grounds of conceptual simplicity as it involves only one “concept”, namely Petal Area. Also, one less arbitrary constant need be remembered (i.e. there is one less node or cut-point in the decision trees).  1      2.4 CLASSIFICATION OF CLASSIFICATION PROCEDURES The above three procedures (linear discrimination, decision-tree and rule-based, k-nearest neighbour) are prototypes for three types of classification procedure. Not surprisingly, they have been refined and extended, but they still represent the major strands in current classification practice and research. The 23 procedures investigated in this book can be directly linked to one or other of the above. However, within this book the methods have been grouped around the more traditional headings of classical statistics, modern statistical techniques, Machine Learning and neural networks. Chapters 3 – 6, respectively, are devoted to each of these. For some methods, the classification is rather abitrary. 2.4.1 Extensions to linear discrimination We can include in this group those procedures that start from linear combinations of the measurements, even if these combinations are subsequently subjected to some nonlinear transformation. There are 7 procedures of this type: Linear discriminants; logistic discriminants; quadratic discriminants; multi-layer perceptron (backprop and cascade); DIPOL92; and projection pursuit. Note that this group consists of statistical and neural network (specifically multilayer perceptron) methods only. ©¨ ¤£§ 2.4.2 Decision trees and Rule-based methods This is the most numerous group in the book with 9 procedures: NewID; C4.5; CART; IndCART; Bayes Tree; and ITrule (see Chapter 5). ; Cal5; CN2; 2.4.3 Density estimates This group is a little less homogeneous, but the 7 members have this in common: the procedure is intimately linked with the estimation of the local probability density at each point in sample space. The density estimate group contains: k-nearest neighbour; radial basis functions; Naive Bayes; Polytrees; Kohonen self-organising net; LVQ; and the kernel density method. This group also contains only statistical and neural net methods. 2.5 A GENERAL STRUCTURE FOR CLASSIFICATION PROBLEMS There are three essential components to a classification problem. 1. The relative frequency with which the classes occur in the population of interest, expressed formally as the prior probability distribution. ...
View Full Document

Ask a homework question - tutors are online