This **preview** has intentionally **blurred** parts. Sign up to view the full document

**Unformatted Document Excerpt**

Introduction Contents 1 1.1 Machine Perception . . . . . . . . . . . . . 1.2 An Example . . . . . . . . . . . . . . . . . . 1.2.1 Related fields . . . . . . . . . . . . . 1.3 The Sub-problems of Pattern Classification 1.3.1 Feature Extraction . . . . . . . . . . 1.3.2 Noise . . . . . . . . . . . . . . . . . 1.3.3 Overfitting . . . . . . . . . . . . . . 1.3.4 Model Selection . . . . . . . . . . . . 1.3.5 Prior Knowledge . . . . . . . . . . . 1.3.6 Missing Features . . . . . . . . . . . 1.3.7 Mereology . . . . . . . . . . . . . . . 1.3.8 Segmentation . . . . . . . . . . . . . 1.3.9 Context . . . . . . . . . . . . . . . . 1.3.10 Invariances . . . . . . . . . . . . . . 1.3.11 Evidence Pooling . . . . . . . . . . . 1.3.12 Costs and Risks . . . . . . . . . . . 1.3.13 Computational Complexity . . . . . 1.4 Learning and Adaptation . . . . . . . . . . 1.4.1 Supervised Learning . . . . . . . . . 1.4.2 Unsupervised Learning . . . . . . . . 1.4.3 Reinforcement Learning . . . . . . . 1.5 Conclusion . . . . . . . . . . . . . . . . . . Summary by Chapters . . . . . . . . . . . . . . . Bibliographical and Historical Remarks . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . Index . . . . . . . . . . . . . . . . . . . . . . . . 3 3 3 11 11 11 12 12 12 12 13 13 13 14 14 15 15 16 16 16 17 17 17 17 19 19 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 CONTENTS Chapter 1 Introduction face, read T he ease with which we recognize akeys inunderstand spoken words, decidehandwritten characters, identify our car our pocket by feel, and whether an apple is ripe by its smell belies the astoundingly complex processes that underlie these acts of pattern recognition. Pattern recognition -- the act of taking in raw data and taking an action based on the "category" of the pattern -- has been crucial for our survival, and over the past tens of millions of years we have evolved highly sophisticated neural and cognitive systems for such tasks. 1.1 Machine Perception It is natural that we should seek to design and build machines that can recognize patterns. From automated speech recognition, fingerprint identification, optical character recognition, DNA sequence identification and much more, it is clear that reliable, accurate pattern recognition by machine would be immensely useful. Moreover, in solving the myriad problems required to build such systems, we gain deeper understanding and appreciation for pattern recognition systems in the natural world -- most particularly in humans. For some applications, such as speech and visual recognition, our design efforts may in fact be influenced by knowledge of how these are solved in nature, both in the algorithms we employ and the design of special purpose hardware. 1.2 An Example To illustrate the complexity of some of the types of problems involved, let us consider the following imaginary and somewhat fanciful example. Suppose that a fish packing plant wants to automate the process of sorting incoming fish on a conveyor belt according to species. As a pilot project it is decided to try to separate sea bass from salmon using optical sensing. We set up a camera, take some sample images and begin to note some physical differences between the two types of fish -- length, lightness, width, number and shape of fins, position of the mouth, and so on -- and these suggest features to explore for use in our classifier. We also notice noise or variations in the 3 4 CHAPTER 1. INTRODUCTION images -- variations in lighting, position of the fish on the conveyor, even "static" due to the electronics of the camera itself. Given that there truly are differences between the population of sea bass and that model of salmon, we view them as having different models -- different descriptions, which are typically mathematical in form. The overarching goal and approach in pattern classification is to hypothesize the class of these models, process the sensed data to eliminate noise (not due to the models), and for any sensed pattern choose the model that corresponds best. Any techniques that further this aim should be in the conceptual toolbox of the designer of pattern recognition systems. Our prototype system to perform this very specific task might well have the form shown in Fig. 1.1. First the camera captures an image of the fish. Next, the camera's presignals are preprocessed to simplify subsequent operations without loosing relevant processing information. In particular, we might use a segmentation operation in which the images of different fish are somehow isolated from one another and from the background. The segmentation information from a single fish is then sent to a feature extractor, whose purpose is to reduce the data by measuring certain "features" or "properties." These features feature extraction (or, more precisely, the values of these features) are then passed to a classifier that evaluates the evidence presented and makes a final decision as to the species. The preprocessor might automatically adjust for average light level, or threshold the image to remove the background of the conveyor belt, and so forth. For the moment let us pass over how the images of the fish might be segmented and consider how the feature extractor and classifier might be designed. Suppose somebody at the fish plant tells us that a sea bass is generally longer than a salmon. These, then, give us our tentative models for the fish: sea bass have some typical length, and this is greater than that for salmon. Then length becomes an obvious feature, and we might attempt to classify the fish merely by seeing whether or not the length l of a fish exceeds some critical value l . To choose l we could obtain some design or training training samples of the different types of fish, (somehow) make length measurements, samples and inspect the results. Suppose that we do this, and obtain the histograms shown in Fig. 1.2. These disappointing histograms bear out the statement that sea bass are somewhat longer than salmon, on average, but it is clear that this single criterion is quite poor; no matter how we choose l , we cannot reliably separate sea bass from salmon by length alone. Discouraged, but undeterred by these unpromising results, we try another feature -- the average lightness of the fish scales. Now we are very careful to eliminate variations in illumination, since they can only obscure the models and corrupt our new classifier. The resulting histograms, shown in Fig. 1.3, are much more satisfactory -- the classes are much better separated. So far we have tacitly assumed that the consequences of our actions are equally costly: deciding the fish was a sea bass when in fact it was a salmon was just as cost undesirable as the converse. Such a symmetry in the cost is often, but not invariably the case. For instance, as a fish packing company we may know that our customers easily accept occasional pieces of tasty salmon in their cans labeled "sea bass," but they object vigorously if a piece of sea bass appears in their cans labeled "salmon." If we want to stay in business, we should adjust our decision boundary to avoid antagonizing our customers, even if it means that more salmon makes its way into the cans of sea bass. In this case, then, we should move our decision boundary x to smaller values of lightness, thereby reducing the number of sea bass that are classified as salmon (Fig. 1.3). The more our customers object to getting sea bass with their 1.2. AN EXAMPLE 5 Figure 1.1: The objects to be classified are first sensed by a transducer (camera), whose signals are preprocessed, then the features extracted and finally the classification emitted (here either "salmon" or "sea bass"). Although the information flow is often chosen to be from the source to the classifier ("bottom-up"), some systems employ "top-down" flow as well, in which earlier levels of processing can be altered based on the tentative or preliminary response in later levels (gray arrows). Yet others combine two or more stages into a unified step, such as simultaneous segmentation and feature extraction. salmon -- i.e., the more costly this type of error -- the lower we should set the decision threshold x in Fig. 1.3. Such considerations suggest that there is an overall single cost associated with our decision, and our true task is to make a decision rule (i.e., set a decision boundary) so as to minimize such a cost. This is the central task of decision theory of which pattern classification is perhaps the most important subfield. Even if we know the costs associated with our decisions and choose the optimal decision boundary x , we may be dissatisfied with the resulting performance. Our first impulse might be to seek yet a different feature on which to separate the fish. Let us assume, though, that no other single visual feature yields better performance than that based on lightness. To improve recognition, then, we must resort to the use decision theory 6 CHAPTER 1. INTRODUCTION salmon Count 22 20 18 16 12 10 8 6 4 2 0 sea bass Length 5 10 15 l* 20 25 Figure 1.2: Histograms for the length feature for the two categories. No single threshold value l (decision boundary) will serve to unambiguously discriminate between the two categories; using length alone, we will have some errors. The value l marked will lead to the smallest number of errors, on average. Count 14 12 10 8 6 4 2 0 2 4 x* 6 Lightness 8 10 salmon sea bass Figure 1.3: Histograms for the lightness feature for the two categories. No single threshold value x (decision boundary) will serve to unambiguously discriminate between the two categories; using lightness alone, we will have some errors. The value x marked will lead to the smallest number of errors, on average. 1.2. AN EXAMPLE 7 Width 22 21 20 19 18 17 16 15 14 salmon sea bass Lightness 2 4 6 8 10 Figure 1.4: The two features of lightness and width for sea bass and salmon. The dark line might serve as a decision boundary of our classifier. Overall classification error on the data shown is lower than if we use only one feature as in Fig. 1.3, but there will still be some errors. of more than one feature at a time. In our search for other features, we might try to capitalize on the observation that sea bass are typically wider than salmon. Now we have two features for classifying fish -- the lightness x1 and the width x2 . If we ignore how these features might be measured in practice, we realize that the feature extractor has thus reduced the image of each fish to a point or feature vector x in a two-dimensional feature space, where x1 x2 x= . Our problem now is to partition the feature space into two regions, where for all patterns in one region we will call the fish a sea bass, and all points in the other we call it a salmon. Suppose that we measure the feature vectors for our samples and obtain the scattering of points shown in Fig. 1.4. This plot suggests the following rule for separating the fish: Classify the fish as sea bass if its feature vector falls above the decision boundary shown, and as salmon otherwise. This rule appears to do a good job of separating our samples and suggests that perhaps incorporating yet more features would be desirable. Besides the lightness and width of the fish, we might include some shape parameter, such as the vertex angle of the dorsal fin, or the placement of the eyes (as expressed as a proportion of the mouth-to-tail distance), and so on. How do we know beforehand which of these features will work best? Some features might be redundant: for instance if the eye color of all fish correlated perfectly with width, then classification performance need not be improved if we also include eye color as a feature. Even if the difficulty or computational cost in attaining more features is of no concern, might we ever have too many features? Suppose that other features are too expensive or expensive to measure, or provide little improvement (or possibly even degrade the performance) in the approach described above, and that we are forced to make our decision based on the two features in Fig. 1.4. If our models were extremely complicated, our classifier would have a decision boundary more complex than the simple straight line. In that case all the decision boundary 8 CHAPTER 1. INTRODUCTION Width 22 21 20 19 18 17 16 15 14 salmon sea bass ? Lightness 2 4 6 8 10 Figure 1.5: Overly complex models for the fish will lead to decision boundaries that are complicated. While such a decision may lead to perfect classification of our training samples, it would lead to poor performance on future patterns. The novel test point marked ? is evidently most likely a salmon, whereas the complex decision boundary shown leads it to be misclassified as a sea bass. training patterns would be separated perfectly, as shown in Fig. 1.5. With such a "solution," though, our satisfaction would be premature because the central aim of designing a classifier is to suggest actions when presented with novel patterns, i.e., fish not yet seen. This is the issue of generalization. It is unlikely that the complex decision boundary in Fig. 1.5 would provide good generalization, since it seems to be "tuned" to the particular training samples, rather than some underlying characteristics or true model of all the sea bass and salmon that will have to be separated. Naturally, one approach would be to get more training samples for obtaining a better estimate of the true underlying characteristics, for instance the probability distributions of the categories. In most pattern recognition problems, however, the amount of such data we can obtain easily is often quite limited. Even with a vast amount of training data in a continuous feature space though, if we followed the approach in Fig. 1.5 our classifier would give a horrendously complicated decision boundary -- one that would be unlikely to do well on novel patterns. Rather, then, we might seek to "simplify" the recognizer, motivated by a belief that the underlying models will not require a decision boundary that is as complex as that in Fig. 1.5. Indeed, we might be satisfied with the slightly poorer performance on the training samples if it means that our classifier will have better performance on novel patterns. But if designing a very complex recognizer is unlikely to give good generalization, precisely how should we quantify and favor simpler classifiers? How would our system automatically determine that the simple curve in Fig. 1.6 is preferable to the manifestly simpler straight line in Fig. 1.4 or the complicated boundary in Fig. 1.5? Assuming that we somehow manage to optimize this tradeoff, can we then predict how well our system will generalize to new patterns? These are some of the central problems in statistical pattern recognition. For the same incoming patterns, we might need to use a drastically different cost generalization The philosophical underpinnings of this approach derive from William of Occam (1284-1347?), who advocated favoring simpler explanations over those that are needlessly complicated -- Entia non sunt multiplicanda praeter necessitatem ("Entities are not to be multiplied without necessity"). Decisions based on overly complex models often lead to lower accuracy of the classifier. 1.2. AN EXAMPLE 9 Width 22 21 20 19 18 17 16 15 14 salmon sea bass Lightness 2 4 6 8 10 Figure 1.6: The decision boundary shown might represent the optimal tradeoff between performance on the training set and simplicity of classifier. function, and this will lead to different actions altogether. We might, for instance, wish instead to separate the fish based on their sex -- all females (of either species) from all males if we wish to sell roe. Alternatively, we might wish to cull the damaged fish (to prepare separately for cat food), and so on. Different decision tasks may require features and yield boundaries quite different from those useful for our original categorization problem. This makes it quite clear that our decisions are fundamentally task or cost specific, and that creating a single general purpose artificial pattern recognition device -- i.e., one capable of acting accurately based on a wide variety of tasks -- is a profoundly difficult challenge. This, too, should give us added appreciation of the ability of humans to switch rapidly and fluidly between pattern recognition tasks. Since classification is, at base, the task of recovering the model that generated the patterns, different classification techniques are useful depending on the type of candidate models themselves. In statistical pattern recognition we focus on the statistical properties of the patterns (generally expressed in probability densities), and this will command most of our attention in this book. Here the model for a pattern may be a single specific set of features, though the actual pattern sensed has been corrupted by some form of random noise. Occasionally it is claimed that neural pattern recognition (or neural network pattern classification) should be considered its own discipline, but despite its somewhat different intellectual pedigree, we will consider it a close descendant of statistical pattern recognition, for reasons that will become clear. If instead the model consists of some set of crisp logical rules, then we employ the methods of syntactic pattern recognition, where rules or grammars describe our decision. For example we might wish to classify an English sentence as grammatical or not, and here statistical descriptions (word frequencies, word correlations, etc.) are inapapropriate. It was necessary in our fish example to choose our features carefully, and hence achieve a representation (as in Fig. 1.6) that enabled reasonably successful pattern classification. A central aspect in virtually every pattern recognition problem is that of achieving such a "good" representation, one in which the structural relationships among the components is simply and naturally revealed, and one in which the true (unknown) model of the patterns can be expressed. In some cases patterns should be represented as vectors of real-valued numbers, in others ordered lists of attributes, in yet others descriptions of parts and their relations, and so forth. We seek a represen- 10 CHAPTER 1. INTRODUCTION tation in which the patterns that lead to the same action are somehow "close" to one another, yet "far" from those that demand a different action. The extent to which we create or learn a proper representation and how we quantify near and far apart will determine the success of our pattern classifier. A number of additional characteristics are desirable for the representation. We might wish to favor a small number of features, which might lead to simpler decision regions, and a classifier easier to train. We might also wish to have features that are robust, i.e., relatively insensitive to noise or other errors. In practical applications we may need the classifier to act quickly, or use few electronic components, memory or processing steps. A central technique, when we have insufficient training data, is to incorporate knowledge of the problem domain. Indeed the less the training data the more important is such knowledge, for instance how the patterns themselves were produced. One method that takes this notion to its logical extreme is that of analysis by synthesis, where in the ideal case one has a model of how each pattern is generated. Consider speech recognition. Amidst the manifest acoustic variability among the possible "dee"s that might be uttered by different people, one thing they have in common is that they were all produced by lowering the jaw slightly, opening the mouth, placing the tongue tip against the roof of the mouth after a certain delay, and so on. We might assume that "all" the acoustic variation is due to the happenstance of whether the talker is male or female, old or young, with different overall pitches, and so forth. At some deep level, such a "physiological" model (or so-called "motor" model) for production of the utterances is appropriate, and different (say) from that for "doo" and indeed all other utterances. If this underlying model of production can be determined from the sound (and that is a very big if ), then we can classify the utterance by how it was produced. That is to say, the production representation may be the "best" representation for classification. Our pattern recognition systems should then analyze (and hence classify) the input pattern based on how one would have to synthesize that pattern. The trick is, of course, to recover the generating parameters from the sensed pattern. Consider the difficulty in making a recognizer of all types of chairs -- standard office chair, contemporary living room chair, beanbag chair, and so forth -- based on an image. Given the astounding variety in the number of legs, material, shape, and so on, we might despair of ever finding a representation that reveals the unity within the class of chair. Perhaps the only such unifying aspect of chairs is functional: a chair is a stable artifact that supports a human sitter, including back support. Thus we might try to deduce such functional properties from the image, and the property "can support a human sitter" is very indirectly related to the orientation of the larger surfaces, and would need to be answered in the affirmative even for a beanbag chair. Of course, this requires some reasoning about the properties and naturally touches upon computer vision rather than pattern recognition proper. Without going to such extremes, many real world pattern recognition systems seek to incorporate at least some knowledge about the method of production of the patterns or their functional use in order to insure a good representation, though of course the goal of the representation is classification, not reproduction. For instance, in optical character recognition (OCR) one might confidently assume that handwritten characters are written as a sequence of strokes, and first try to recover a stroke representation from the sensed image, and then deduce the character from the identified strokes. analysis by synthesis 1.3. THE SUB-PROBLEMS OF PATTERN CLASSIFICATION 11 1.2.1 Related fields Pattern classification differs from classical statistical hypothesis testing, wherein the sensed data are used to decide whether or not to reject a null hypothesis in favor of some alternative hypothesis. Roughly speaking, if the probability of obtaining the data given some null hypothesis falls below a "significance" threshold, we reject the null hypothesis in favor of the alternative. For typical values of this criterion, there is a strong bias or predilection in favor of the null hypothesis; even though the alternate hypothesis may be more probable, we might not be able to reject the null hypothesis. Hypothesis testing is often used to determine whether a drug is effective, where the null hypothesis is that it has no effect. Hypothesis testing might be used to determine whether the fish on the conveyor belt belong to a single class (the null hypothesis) or from two classes (the alternative). In contrast, given some data, pattern classification seeks to find the most probable hypothesis from a set of hypotheses -- "this fish is probably a salmon." Pattern classification differs, too, from image processing. In image processing, the input is an image and the output is an image. Image processing steps often include rotation, contrast enhancement, and other transformations which preserve all the original information. Feature extraction, such as finding the peaks and valleys of the intensity, lose information (but hopefully preserve everything relevant to the task at hand.) As just described, feature extraction takes in a pattern and produces feature values. The number of features is virtually always chosen to be fewer than the total necessary to describe the complete target of interest, and this leads to a loss in information. In acts of associative memory, the system takes in a pattern and emits another pattern which is representative of a general group of patterns. It thus reduces the information somewhat, but rarely to the extent that pattern classification does. In short, because of the crucial role of a decision in pattern recognition information, it is fundamentally an information reduction process. The classification step represents an even more radical loss of information, reducing the original several thousand bits representing all the color of each of several thousand pixels down to just a few bits representing the chosen category (a single bit in our fish example.) image processing associative memory 1.3 The Sub-problems of Pattern Classification We have alluded to some of the issues in pattern classification and we now turn to a more explicit list of them. In practice, these typically require the bulk of the research and development effort. Many are domain or problem specific, and their solution will depend upon the knowledge and insights of the designer. Nevertheless, a few are of sufficient generality, difficulty, and interest that they warrant explicit consideration. 1.3.1 Feature Extraction The conceptual boundary between feature extraction and classification proper is somewhat arbitrary: an ideal feature extractor would yield a representation that makes the job of the classifier trivial; conversely, an omnipotent classifier would not need the help of a sophisticated feature extractor. The distinction is forced upon us for practical, rather than theoretical reasons. Generally speaking, the task of feature extraction is much more problem and domain dependent than is classification proper, and thus requires knowledge of the domain. A good feature extractor for sorting fish would 12 CHAPTER 1. INTRODUCTION surely be of little use for identifying fingerprints, or classifying photomicrographs of blood cells. How do we know which features are most promising? Are there ways to automatically learn which features are best for the classifier? How many shall we use? 1.3.2 Noise The lighting of the fish may vary, there could be shadows cast by neighboring equipment, the conveyor belt might shake -- all reducing the reliability of the feature values actually measured. We define noise very general terms: any property of the sensed pattern due not to the true underlying model but instead to randomness in the world or the sensors. All non-trivial decision and pattern recognition problems involve noise in some form. In some cases it is due to the transduction in the signal and we may consign to our preprocessor the role of cleaning up the signal, as for instance visual noise in our video camera viewing the fish. An important problem is knowing somehow whether the variation in some signal is noise or instead to complex underlying models of the fish. How then can we use this information to improve our classifier? 1.3.3 Overfitting In going from Fig 1.4 to Fig. 1.5 in our fish classification problem, we were, implicitly, using a more complex model of sea bass and of salmon. That is, we were adjusting the complexity of our classifier. While an overly complex model may allow perfect classification of the training samples, it is unlikely to give good classification of novel patterns -- a situation known as overfitting. One of the most important areas of research in statistical pattern classification is determining how to adjust the complexity of the model -- not so simple that it cannot explain the differences between the categories, yet not so complex as to give poor classification on novel patterns. Are there principled methods for finding the best (intermediate) complexity for a classifier? 1.3.4 Model Selection We might have been unsatisfied with the performance of our fish classifier in Figs. 1.4 & 1.5, and thus jumped to an entirely different class of model, for instance one based on some function of the number and position of the fins, the color of the eyes, the weight, shape of the mouth, and so on. How do we know when a hypothesized model differs significantly from the true model underlying our patterns, and thus a new model is needed? In short, how are we to know to reject a class of models and try another one? Are we as designers reduced to random and tedious trial and error in model selection, never really knowing whether we can expect improved performance? Or might there be principled methods for knowing when to jettison one class of models and invoke another? Can we automate the process? 1.3.5 Prior Knowledge In one limited sense, we have already seen how prior knowledge -- about the lightness of the different fish categories helped in the design of a classifier by suggesting a promising feature. Incorporating prior knowledge can be far more subtle and difficult. In some applications the knowledge ultimately derives from information about the production of the patterns, as we saw in analysis-by-synthesis. In others the knowledge may be about the form of the underlying categories, or specific attributes of the patterns, such as the fact that a face has two eyes, one nose, and so on. 1.3. THE SUB-PROBLEMS OF PATTERN CLASSIFICATION 13 1.3.6 Missing Features occlusion Suppose that during classification, the value of one of the features cannot be determined, for example the width of the fish because of occlusion by another fish (i.e., the other fish is in the way). How should the categorizer compensate? Since our two-feature recognizer never had a single-variable threshold value x determined in anticipation of the possible absence of a feature (cf., Fig. 1.3), how shall it make the best decision using only the feature present? The naive method, of merely assuming that the value of the missing feature is zero or the average of the values for the training patterns, is provably non-optimal. Likewise we occasionally have missing features during the creation or learning in our recognizer. How should we train a classifier or use one when some features are missing? 1.3.7 Mereology We effortlessly read a simple word such as BEATS. But consider this: Why didn't we read instead other words that are perfectly good subsets of the full pattern, such as BE, BEAT, EAT, AT, and EATS? Why don't they enter our minds, unless explicitly brought to our attention? Or when we saw the B why didn't we read a P or an I, which are "there" within the B? Conversely, how is it that we can read the two unsegmented words in POLOPONY -- without placing the entire input into a single word category? This is the problem of subsets and supersets -- formally part of mereology, the study of part/whole relationships. It is closely related to that of prior knowledge and segmentation. In short, how do we recognize or group together the "proper" number of elements -- neither too few nor too many? It appears as though the best classifiers try to incorporate as much of the input into the categorization as "makes sense," but not too much. How can this be done? 1.3.8 Segmentation In our fish example, we have tacitly assumed that the fish were isolated, separate on the conveyor belt. In practice, they would often be abutting or overlapping, and our system would have to determine where one fish ends and the next begins -- the individual patterns have to be segmented. If we have already recognized the fish then it would be easier to segment them. But how can we segment the images before they have been categorized or categorize them before they have been segmented? It seems we need a way to know when we have switched from one model to another, or to know when we just have background or "no category." How can this be done? Segmentation is one of the deepest problems in automated speech recognition. We might seek to recognize the individual sounds (e.g., phonemes, such as "ss," "k," ...), and then put them together to determine the word. But consider two nonsense words, "sklee" and "skloo." Speak them aloud and notice that for "skloo" you push your lips forward (so-called "rounding" in anticipation of the upcoming "oo") before you utter the "ss." Such rounding influences the sound of the "ss," lowering the frequency spectrum compared to the "ss" sound in "sklee" -- a phenomenon known as anticipatory coarticulation. Thus, the "oo" phoneme reveals its presence in the "ss" earlier than the "k" and "l" which nominally occur before the "oo" itself! How do we segment the "oo" phoneme from the others when they are so manifestly intermingled? Or should we even try? Perhaps we are focusing on groupings of the wrong size, and that the most useful unit for recognition is somewhat larger, as we saw in subsets and 14 CHAPTER 1. INTRODUCTION supersets (Sect. 1.3.7). A related problem occurs in connected cursive handwritten character recognition: How do we know where one character "ends" and the next one "begins"? 1.3.9 Context We might be able to use context -- input-dependent information other than from the target pattern itself -- to improve our recognizer. For instance, it might be known for our fish packing plant that if we are getting a sequence of salmon, that it is highly likely that the next fish will be a salmon (since it probably comes from a boat that just returned from a fishing area rich in salmon). Thus, if after a long series of salmon our recognizer detects an ambiguous pattern (i.e., one very close to the nominal decision boundary), it may nevertheless be best to categorize it too as a salmon. We shall see how such a simple correlation among patterns -- the most elementary form of context -- might be used to improve recognition. But how, precisely, should we incorporate such information? Context can be highly complex and abstract. The utterance "jeetyet?" may seem nonsensical, unless you hear it spoken by a friend in the context of the cafeteria at lunchtime -- "did you eat yet?" How can such a visual and temporal context influence your speech recognition? 1.3.10 Invariances In seeking to achieve an optimal representation for a particular pattern classification task, we confront the problem of invariances. In our fish example, the absolute position on the conveyor belt is irrelevant to the category and thus our representation should also be insensitive to absolute position of the fish. Here we seek a representation that is invariant to the transformation of translation (in either horizontal or vertical directions). Likewise, in a speech recognition problem, it might be required only that we be able to distinguish between utterances regardless of the particular moment they were uttered; here the "translation" invariance we must ensure is in time. The "model parameters" describing the orientation of our fish on the conveyor belt are horrendously complicated -- due as they are to the sloshing of water, the bumping of neighboring fish, the shape of the fish net, etc. -- and thus we give up hope of ever trying to use them. These parameters are irrelevant to the model parameters that interest us anyway, i.e., the ones associated with the differences between the fish categories. Thus here we try to build a classifier that is invariant to transformations such as rotation. orientation The orientation of the fish on the conveyor belt is irrelevant to its category. Here the transformation of concern is a two-dimensional rotation about the camera's line of sight. A more general invariance would be for rotations about an arbitrary line in three dimensions. The image of even such a "simple" object as a coffee cup undergoes radical variation as the cup is rotated to an arbitrary angle -- the handle may become hidden, the bottom of the inside volume come into view, the circular lip appear oval or a straight line or even obscured, and so forth. How might we insure that our pattern recognizer is invariant to such complex changes? size The overall size of an image may be irrelevant for categorization. Such differences might be due to variation in the range to the object; alternatively we may be genuinely unconcerned with differences between sizes -- a young, small salmon is still a salmon. 1.3. THE SUB-PROBLEMS OF PATTERN CLASSIFICATION 15 For patterns that have inherent temporal variation, we may want our recognizer to be insensitive to the rate at which the pattern evolves. Thus a slow hand wave and a fast hand wave may be considered as equivalent. Rate variation is a deep problem in speech recognition, of course; not only do different individuals talk at different rates, but even a single talker may vary in rate, causing the speech signal to change in complex ways. Likewise, cursive handwriting varies in complex ways as the writer speeds up -- the placement of dots on the i's, and cross bars on the t's and f's, are the first casualties of rate increase, while the appearance of l's and e's are relatively inviolate. How can we make a recognizer that changes its representations for some categories differently from that for others under such rate variation? A large number of highly complex transformations arise in pattern recognition, and many are domain specific. We might wish to make our handwritten optical character recognizer insensitive to the overall thickness of the pen line, for instance. Far more severe are transformations such as non-rigid deformations that arise in threedimensional object recognition, such as the radical variation in the image of your hand as you grasp an object or snap your fingers. Similarly, variations in illumination or the complex effects of cast shadows may need to be taken into account. The symmetries just described are continuous -- the pattern can be translated, rotated, sped up, or deformed by an arbitrary amount. In some pattern recognition applications other -- discrete -- symmetries are relevant, such as flips left-to-right, or top-to-bottom. In all of these invariances the problem arises: How do we determine whether an invariance is present? How do we efficiently incorporate such knowledge into our recognizer? rate deformation discrete symmetry 1.3.11 Evidence Pooling In our fish example we saw how using multiple features could lead to improved recognition. We might imagine that we could do better if we had several component classifiers. If these categorizers agree on a particular pattern, there is no difficulty. But suppose they disagree. How should a "super" classifier pool the evidence from the component recognizers to achieve the best decision? Imagine calling in ten experts for determining if a particular fish is diseased or not. While nine agree that the fish is healthy, one expert does not. Who is right? It may be that the lone dissenter is the only one familiar with the particular very rare symptoms in the fish, and is in fact correct. How would the "super" categorizer know when to base a decision on a minority opinion, even from an expert in one small domain who is not well qualified to judge throughout a broad range of problems? 1.3.12 Costs and Risks We should realize that a classifier rarely exists in a vacuum. Instead, it is generally to be used to recommend actions (put this fish in this bucket, put that fish in that bucket), each action having an associated cost or risk. Conceptually, the simplest such risk is the classification error: what percentage of new patterns are called the wrong category. However the notion of risk is far more general, as we shall see. We often design our classifier to recommend actions that minimize some total expected cost or risk. Thus, in some sense, the notion of category itself derives from the cost or task. How do we incorporate knowledge about such risks and how will they affect our classification decision? 16 CHAPTER 1. INTRODUCTION Finally, can we estimate the total risk and thus tell whether our classifier is acceptable even before we field it? Can we estimate the lowest possible risk of any classifier, to see how close ours meets this ideal, or whether the problem is simply too hard overall?ory label to improve the classifier. For instance, in optical character recognition, the input might be an image of a character, the actual output of the classifier the category label "R," and the desired output a "B." In reinforcement learning or learning with a critic, no desired category signal is given; instead, the only teaching feedback is that the tentative category is right or wrong. This is analogous to a critic who merely states that something is right or wrong, but does not say specifically how it is wrong. (Thus only binary feedback is given to the classifier; reinforcement learning also describes the case where a single scalar signal, say some number between 0 and 1, is given by the teacher.) In pattern classification, it is most common that such reinforcement is binary -- either the tentative decision is correct or it is not. (Of course, if our problem involves just two categories and equal costs for errors, then learning with a critic is equivalent to standard supervised learning.) How can the system learn which are important from such non-specific feedback? critic 1.5 Conclusion At this point the reader may be overwhelmed by the number, complexity and magnitude of these sub-problems. Further, these sub-problems are rarely addressed in isolation and they are invariably interrelated. Thus for instance in seeking to reduce the complexity of our classifier, we might affect its ability to deal with invariance. We point out, though, that the good news is at least three-fold: 1) there is an "existence proof" that many of these problems can indeed be solved -- as demonstrated by humans and other biological systems, 2) mathematical theories solving some of these problems have in fact been discovered, and finally 3) there remain many fascinating unsolved problems providing opportunities for progress. Summary by Chapters The overall organization of this book is to address first those cases where a great deal of information about the models is known (such as the probability densities, category labels, ...) and to move, chapter by chapter, toward problems where the form of the 18 CHAPTER 1. INTRODUCTION distributions are unknown and even the category membership of training patterns is unknown. We begin in Chap. ?? (Bayes decision theory) by considering the ideal case in which the probability structure underlying the categories is known perfectly. While this sort of situation rarely occurs in practice, it permits us to determine the optimal (Bayes) classifier against which we can compare all other methods. Moreover in some problems it enables us to predict the error we will get when we generalize to novel patterns. In Chap. ?? (Maximum Likelihood and Bayesian Parameter Estimation) we address the case when the full probability structure underlying the categories is not known, but the general forms of their distributions are -- i.e., the models. Thus the uncertainty about a probability distribution is represented by the values of some unknown parameters, and we seek to determine these parameters to attain the best categorization. In Chap. ?? (Nonparametric techniques) we move yet further from the Bayesian ideal, and assume that we have no prior parameterized knowledge about the underlying probability structure; in essence our classification will be based on information provided by training samples alone. Classic techniques such as the nearest-neighbor algorithm and potential functions play an important role here. We then in Chap. ?? (Linear Discriminant Functions) return somewhat toward the general approach of parameter estimation. We shall assume that the so-called "discriminant functions" are of a very particular form -- viz., linear -- in order to derive a class of incremental training rules. Next, in Chap. ?? (Nonlinear Discriminants and Neural Networks) we see how some of the ideas from such linear discriminants can be extended to a class of very powerful algorithms such as backpropagation and others for multilayer neural networks; these neural techniques have a range of useful properties that have made them a mainstay in contemporary pattern recognition research. In Chap. ?? (Stochastic Methods) we discuss simulated annealing by the Boltzmann learning algorithm and other stochastic methods. We explore the behavior of such algorithms with regard to the matter of local minima that can plague other neural methods. Chapter ?? (Non-metric Methods) moves beyond models that are statistical in nature to ones that can be best described by (logical) rules. Here we discuss tree-based algorithms such as CART (which can also be applied to statistical data) and syntactic based methods, such as grammar based, which are based on crisp rules. Chapter ?? (Theory of Learning) is both the most important chapter and the most difficult one in the book. Some of the results described there, such as the notion of capacity, degrees of freedom, the relationship between expected error and training set size, and computational complexity are subtle but nevertheless crucial both theoretically and practically. In some sense, the other chapters can only be fully understood (or used) in light of the results presented here; you cannot expect to solve important pattern classification problems without using the material from this chapter. We conclude in Chap. ?? (Unsupervised Learning and Clustering), by addressing the case when input training patterns are not labeled, and that our recognizer must determine the cluster structure. We also treat a related problem, that of learning with a critic, in which the teacher provides only a single bit of information during the presentation of a training pattern -- "yes," that the classification provided by the recognizer is correct, or "no," it isn't. Here algorithms for reinforcement learning will be presented. 1.5. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 19 Bibliographical and Historical Remarks Classification is among the first crucial steps in making sense of the blooming buzzing confusion of sensory data that intelligent systems confront. In the western world, the foundations of pattern recognition can be traced to Plato [2], later extended by Aristotle [1], who distinguished between an "essential property" (which would be shared by all members in a class or "natural kind" as he put it) from an "accidental property" (which could differ among members in the class). Pattern recognition can be cast as the problem of finding such essential properties of a category. It has been a central theme in the discipline of philosophical epistemology, the study of the nature of knowledge. A more modern treatment of some philosophical problems of pattern recognition, relating to the technical matter in the current book can be found in [22, 4, 18]. In the eastern world, the first Zen patriarch, Bodhidharma, would point at things and demand students to answer "What is that?" as a way of confronting the deepest issues in mind, the identity of objects, and the nature of classification and decision. A delightful and particularly insightful book on the foundations of artificial intelligence, including pattern recognition, is [9]. Early technical treatments by Minsky [14] and Rosenfeld [16] are still valuable, as are a number of overviews and reference books [5]. The modern literature on decision theory and pattern recognition is now overwhelming, and comprises dozens of journals, thousands of books and conference proceedings and innumerable articles; it continues to grow rapidly. While some disciplines such as statistics [7], machine learning [17] and neural networks [8], expand the foundations of pattern recognition, others, such as computer vision [6, 19] and speech recognition [15] rely on it heavily. Perceptual Psychology, Cognitive Science [12], Psychobiology [21] and Neuroscience [10] analyze how pattern recognition is achieved in humans and other animals. The extreme view that everything in human cognition -- including rule-following and logic -- can be reduced to pattern recognition is presented in [13]. Pattern recognition techniques have been applied in virtually every scientific and technical discipline. 20 CHAPTER 1. i |j )P (j |x). (11) risk decision rule In decision-theoretic terminology, an expected loss is called a risk, and R(i |x) is called the conditional risk. Whenever we encounter a particular observation x, we can minimize our expected loss by selecting the action that minimizes the conditional risk. We shall now show that this Bayes decision procedure actually provides the optimal performance on an overall risk. Stated formally, our problem is to find a decision rule against P (j ) that minimizes the overall risk. A general decision rule is a function (x) that tells us which action to take for every possible observation. To be more specific, for every x the decision function (x) assumes one of the a values 1 , ..., a . The overall risk R is the expected loss associated with a given decision rule. Since R(i |x) is the conditional risk associated with action i , and since the decision rule specifies the action, the overall risk is given by R= R((x)|x)p(x) dx, (12) where dx is our notation for a d-space volume element, and where the integral extends over the entire feature space. Clearly, if (x) is chosen so that R(i (x)) is as small as possible for every x, then the overall risk will be minimized. This justifies the following statement of the Bayes decision rule: To minimize the overall risk, compute the conditional risk c R(i |x) = j=1 (i |j )P (j |x) (13) Bayes risk for i = 1,...,a and select the action i for which R(i |x) is minimum. The resulting minimum overall risk is called the Bayes risk, denoted R , and is the best performance that can be achieved. 2.2.1 Two-Category Classification Let us consider these results when applied to the special case of two-category classification problems. Here action 1 corresponds to deciding that the true state of nature is 1 , and action 2 corresponds to deciding that it is 2 . For notational simplicity, let ij = (i |j ) be the loss incurred for deciding i when the true state of nature is j . If we write out the conditional risk given by Eq. 13, we obtain R(1 |x) R(2 |x) = 11 P (1 |x) + 12 P (2 |x) = 21 P (1 |x) + 22 P (2 |x). and (14) There are a variety of ways of expressing the minimum-risk decision rule, each having its own minor advantages. The fundamental rule is to decide 1 if R(1 |x) < R(2 |x). In terms of the posterior probabilities, we decide 1 if (21 - 11 )P (1 |x) > (12 - 22 )P (2 |x). (15) Note that if more than one action minimizes R(|x), it does not matter which of these actions is taken, and any convenient tie-breaking rule can be used. 2.3. MINIMUM-ERROR-RATE CLASSIFICATION 9 Ordinarily, the loss incurred for making an error is greater than the loss incurred for being correct, and both of the factors 21 - 11 and 12 - 22 are positive. Thus in practice, our decision is generally determined by the more likely state of nature, although we must scale the posterior probabilities by the loss differences. By employing Bayes' formula, we can replace the posterior probabilities by the prior probabilities and the conditional densities. This results in the equivalent rule, to decide 1 if (21 - 11 )p(x|1 )P (1 ) > (12 - 22 )p(x|2 )P (2 ), (16) and 2 otherwise. Another alternative, which follows at once under the reasonable assumption that 21 > 11 , is to decide 1 if p(x|1 ) 12 - 22 P (2 ) > . p(x|2 ) 21 - 11 P (1 ) (17) This form of the decision rule focuses on the x-dependence of the probability densities. We can consider p(x|j ) a function of j (i.e., the likelihood function), and then form the likelihood ratio p(x|1 )/p(x|2 ). Thus the Bayes decision rule can be interpreted as calling for deciding 1 if the likelihood ratio exceeds a threshold value that is independent of the observation x. likelihood ratio 2.3 Minimum-Error-Rate Classification In classification problems, each state of nature is usually associated with a different one of the c classes, and the action i is usually interpreted as the decision that the true state of nature is i . If action i is taken and the true state of nature is j , then the decision is correct if i = j, and in error if i = j. If errors are to be avoided, it is natural to seek a decision rule that minimizes the probability of error, i.e., the error rate. The loss function of interest for this case is hence the so-called symmetrical or zero-one loss function, (i |j ) = 0 1 i=j i=j i, j = 1, ..., c. (18) zero-one loss This loss function assigns no loss to a correct decision, and assigns a unit loss to any error; thus, all errors are equally costly. The risk corresponding to this loss function is precisely the average probability of error, since the conditional risk is c R(i |x) = j=1 (i |j )P (j |x) P (j |x) j=i = = 1 - P (i |x) (19) We note that other loss functions, such as quadratic and linear-difference, find greater use in regression tasks, where there is a natural ordering on the predictions and we can meaningfully penalize predictions that are "more wrong" than others. 10 CHAPTER 2. BAYESIAN DECISION THEORY and P (i |x) is the conditional probability that action i is correct. The Bayes decision rule to minimize risk calls for selecting the action that minimizes the conditional risk. Thus, to minimize the average probability of error, we should select the i that maximizes the posterior probability P (i |x). In other words, for minimum error rate: Decide i if P (i |x) > P (j |x) for all j = i. (20) This is the same rule as in Eq. 6. We saw in Fig. 2.2 some class-conditional probability densities and the posterior probabilities; Fig. 2.3 shows the likelihood ratio p(x|1 )/p(x|2 ) for the same case. In general, this ratio can range between zero and infinity. The threshold value a marked is from the same prior probabilities but with zero-one loss function. Notice that this leads to the same decision boundaries as in Fig. 2.2, as it must. If we penalize mistakes in classifying 1 patterns as 2 more than the converse (i.e., 21 > 12 ), then Eq. 17 leads to the threshold b marked. Note that the range of x values for which we classify a pattern as 1 gets smaller, as it should. p(x|1) p(x|2) b a x R2 R1 R2 R1 Figure 2.3: The likelihood ratio p(x|1 )/p(x|2 ) for the distributions shown in Fig. 2.1. If we employ a zero-one or classification loss, our decision boundaries are determined by the threshold a . If our loss function penalizes miscategorizing 2 as 1 patterns more than the converse, (i.e., 12 > 21 ), we get the larger threshold b , and hence R1 becomes smaller. 2.3.1 *Minimax Criterion Sometimes we must design our classifier to perform well over a range of prior probabilities. For instance, in our fish categorization problem we can imagine that whereas the physical properties of lightness and width of each type of fish remain constant, the prior probabilities might vary widely and in an unpredictable way, or alternatively we want to use the classifier in a different plant where we do not know the prior probabilities. A reasonable approach is then to design our classifier so that the worst overall risk for any value of the priors is as small as possible -- that is, minimize the maximum possible overall risk. 2.3. MINIMUM-ERROR-RATE CLASSIFICATION 11 In order to understand this, we let R1 denote that (as yet unknown) region in feature space where the classifier decides 1 and likewise for R2 and 2 , and then write our overall risk Eq. 12 in terms of conditional risks: R = R1 [11 P (1 ) p(x|1 ) + 12 P (2 ) p(x|2 )] dx [21 P (1 ) p(x|1 ) + 22 P (2 ) p(x|2 )] dx. R2 + (21) We use the fact that P (2 ) = 1 - P (1 ) and that to rewrite the risk as: R1 p(x|1 ) dx = 1 - p(x|1 ) dx R2 = Rmm , minimax risk R(P (1 )) = 22 + (12 - 22 ) R1 p(x|2 ) dx (22) + P (1 ) (11 - 22 ) - (21 - 11 ) p(x|1 ) dx - (12 - 22 ) p(x|2 ) dx . R2 = R1 0 for minimax solution This equation shows that once the decision boundary is set (i.e., R1 and R2 determined), the overall risk is linear in P (1 ). If we can find a boundary such that the constant of proportionality is 0, then the risk is independent of priors. This is the minimax solution, and the minimax risk, Rmm , can be read from Eq. 22: minimax risk Rmm = 22 + (12 - 22 ) R1 p(x|2 ) dx p(x|1 ) dx. (23) = 11 + (21 - 11 ) R2 Figure 2.4 illustrates the approach. Briefly stated, we search for the prior for which the Bayes risk is maximum, the corresponding decision boundary gives the minimax solution. The value of the minimax risk, Rmm , is hence equal to the worst Bayes risk. In practice, finding the decision boundary for minimax risk may be difficult, particularly when distributions are complicated. Nevertheless, in some cases the boundary can be determined analytically (Problem 3). The minimax criterion finds greater use in game theory then it does in traditional pattern recognition. In game theory, you have a hostile opponent who can be expected to take an action maximally detrimental to you. Thus it makes great sense for you to take an action (e.g., make a classification) where your costs -- due to your opponent's subsequent actions -- are minimized. 12 CHAPTER 2. BAYESIAN DECISION THEORY P(error) .4 .4 .3 .3 .2 .2 .1 .1 P(1) 0 0.2 0.4 0.6 0.8 1 Figure 2.4: The curve at the bottom shows the minimum (Bayes) error as a function of prior probability P (1 ) in a two-category classification problem of fixed distributions. For each value of the priors (e.g., P (1 ) = 0.25) there is a corresponding optimal decision boundary and associated Bayes error rate. For any (fixed) such boundary, if the priors are then changed, the probability of error will change as a linear function of P (1 ) (shown by the dashed line). The maximum such error will occur at an extreme value of the prior, here at P (1 ) = 1. To minimize the maximum of such error, we should design our decision boundary for the maximum Bayes error (here P (1 ) = 0.6), and thus the error will not change as a function of prior, as shown by the solid red horizontal line. 2.3.2 *Neyman-Pearson Criterion In some problems, we may wish to minimize the overall risk subject to a constraint; for instance, we might wish to minimize the total risk subject to the constraint R(i |x) dx < constant for some particular i. Such a constraint might arise when there is a fixed resource that accompanies one particular action i , or when we must not misclassify pattern from a particular state of nature i at more than some limited frequency. For instance, in our fish example, there might be some government regulation that we must not misclassify more than 1% of salmon as sea bass. We might then seek a decision that minimizes the chance of classifying a sea bass as a salmon subject to this condition. We generally satisfy such a Neyman-Pearson criterion by adjusting decision boundaries numerically. However, for Gaussian and some other distributions, NeymanPearson solutions can be found analytically (Problems 5 & 6). We shall have cause to mention Neyman-Pearson criteria again in Sect. 2.8.3 on operating characteristics. 2.4. CLASSIFIERS, DISCRIMINANTS AND DECISION SURFACES 13 2.4 2.4.1 Classifiers, Discriminant Functions and Decision Surfaces The Multi-Category Case There are many different ways to represent pattern classifiers. One of the most useful is in terms of a set of discriminant functions gi (x), i = 1, ..., c. The classifier is said to assign a feature vector x to class i if gi (x) > gj (x) for all j = i. (24) Thus, the classifier is viewed as a network or machine that computes c discriminant functions and selects the category corresponding to the largest discriminant. A network representation of a classifier is illustrated in Fig. 2.5. Action (e.g., classification) Costs Discriminant functions g1(x) g2(x) ... gc(x) Input x1 x2 x3 ... xd Figure 2.5: The functional structure of a general statistical pattern classifier which includes d inputs and c discriminant functions gi (x). A subsequent step determines which of the discriminant values is the maximum, and categorizes the input pattern accordingly. The arrows show the direction of the flow of information, though frequently the arrows are omitted when the direction of flow is self-evident. A Bayes classifier is easily and naturally represented in this way. For the generion of the discriminability (from an arbitrary x ) allows us to calculate the Bayes error rate -- the most important property of any 2.8. *ERROR BOUNDS FOR NORMAL DENSITIES 35 classifier. If the actual error rate differs from the Bayes rate inferred in this way, we should alter the threshold x accordingly. It is a simple matter to generalize the above discussion and apply it to two categories having arbitrary multidimensional distributions, Gaussian or not. Suppose we have two distributions p(x|1 ) and p(x|2 ) which overlap, and thus have non-zero Bayes classification error. Just as we saw above, any pattern actually from 2 could be properly classified as 2 (a "hit") or misclassified as 1 (a "false alarm"). Unlike the one-dimensional case above, however, there may be many decision boundaries that give a particular hit rate, each with a different false alarm rate. Clearly here we cannot determine a fundamental measure of discriminability without knowing more about the underlying decision rule than just the hit and false alarm rates. In a rarely attainable ideal, we can imagine that our measured hit and false alarm rates are optimal, for example that of all the decision rules giving the measured hit rate, the rule that is actually used is the one having the minimum false alarm rate. If we constructed a multidimensional classifier -- regardless of the distributions used -- we might try to characterize the problem in this way, though it would probably require great computational resources to search for such optimal hit and false alarm rates. In practice, instead we eschew optimality, and simply vary a single parameter controlling the decision rule and plot the resulting hit and false alarm rates -- a curve called merely an operating characteristic. Such a control parameter might be the bias or nonlinearity in a discriminant function. It is traditional to choose a control parameter that can yield, at extreme values, either a vanishing false alarm or a vanishing hit rate, just as can be achieved with a very large or a very small x in an ROC curve. We should note that since the distributions can be arbitrary, the operating characteristic need not be symmetric (Fig. 2.21); in rare cases it need not even be concave down at all points. hit operating characteristic p(x|i) 1 2 p(x < x* | x 1 1) false alarm x p(x < x* | x 2) 1 Figure 2.21: In a general operating characteristic curve, the abscissa is the probability of false alarm, P (x R2 |x 1 ), and the ordinate the probability of hit, P (x R2 |x 2 ). As illustrated here, operating characteristic curves are generally not symmetric, as shown at the right. Classifier operating curves are of value for problems where the loss matrix ij might be changed. If the operating characteristic has been determined as a function of the control parameter ahead of time, it is a simple matter, when faced with a new loss function, to deduce the control parameter setting that will minimize the expected risk (Problem 38). 36 CHAPTER 2. BAYESIAN DECISION THEORY 2.9 Bayes Decision Theory -- Discrete Features Until now we have assumed that the feature vector x could be any point in a ddimensional Euclidean space, Rd . However, in many practical applications the components of x are binary-, ternary-, or higher integer valued, so that x can assume only one of m discrete values v1 , ..., vm . In such cases, the probability density function p(x|j ) becomes singular; integrals of the form p(x|j ) dx must then be replaced by corresponding sums, such as P (x|j ), x (77) (78) where we understand that the summation is over all values of x in the discrete distribution. Bayes' formula then involves probabilities, rather than probability densities: P (j |x) = where c P (x|j )P (j ) , P (x) (79) P (x) = j=1 P (x|j )P (j ). (80) The definition of the conditional risk R(|x) is unchanged, and the fundamental Bayes decision rule remains the same: To minimize the overall risk, select the action i for which R(i |x) is minimum, or stated formally, = arg max R(i |x). i (81) The basic rule tat we must integrate (marginalize) the posterior probability over the bad features. Finally we use the Bayes decision rule on the resulting posterior probabilities, i.e., choose i if P (i |xg ) > P (j |xg ) for all i and j. We shall consider the Expectation-Maximization (EM) algorithm in Chap. ??, which addresses a related problem involving missing features. 2.10.2 Noisy Features It is a simple matter to generalize the results of Eq. 91 to the case where a particular feature has been corrupted by statistically independent noise. For instance, in our fish classification example, we might have a reliable measurement of the length, while variability of the light source might degrade the measurement of the lightness. We assume we have uncorrupted (good) features xg , as before, and a noise model, expressed as p(xb |xt ). Here we let xt denote the true value of the observed xb features, i.e., without the noise present; that is, the xb are observed instead of the true xt . We assume that if xt were known, xb would be independent of i and xg . From such an assumption we get: P (i |xg , xb ) = p(i , xg , xb , xt ) dxt . p(xg , xb ) (92) Of course, to tell the classifier that a feature value is missing, the feature extractor must be designed to provide more than just a numerical value for each feature. 2.11. *COMPOUND BAYES DECISION THEORY AND CONTEXT 41 Now p(i , xg , xb , xt ) = P (i |xg , xb , xt )p(xg , xb , xt ), but by our independence assumption, if we know xt , then xb does not provide any additional information about i . Thus we have P (i |xg , xb , xt ) = P (i |xg , xt ). Similarly, we have p(xg , xb , xt ) = p(xb |xg , xt )p(xg , xt ), and p(xb |xg , xt ) = p(xb |xt ). We put these together and thereby obtain P (i |xg , xt )p(xg , xt )p(xb |xt ) dxt p(xg , xt )p(xb |xt ) dxt gi (x)p(x)p(xb |xt ) dxt , p(x)p(xb |xt ) dxt (93) P (i |xg , xb ) = = which we use as discriminant functions for classification in the manner dictated by Bayes. Equation 93 differs from Eq. 91 solely by the fact that the integral is weighted by the noise model. In the extreme case where p(xb |xt ) is uniform over the entire space (and hence provides no predictive information for categorization), the equation reduces to the case of missing features -- a satisfying result. 2.11 Compound Bayesian Decision Theory and Context Let us reconsider our introductory example of designing a classifier to sort two types of fish. Our original assumption was that the sequence of types of fish was so unpredictable that the state of nature looked like a random variable. Without abandoning this attitude, let us consider the possibility that the consecutive states of nature might not be statistically independent. We should be able to exploit such statistical dependence to gain improved performance. This is one example of the use of context to aid decision making. The way in which we exploit such context information is somewhat different when we can wait for n fish to emerge and then make all n decisions jointly than when we must decide as each fish emerges. The first problem is a compound decision problem, and the second is a sequential compound decision problem. The former case is conceptually simpler, and is the one we shall examine here. To state the general problem, let = ((1), ..., (n))t be a vector denoting the n states of nature, with (i) taking on one of the c values 1 , ..., c . Let P () be the prior probability for the n states of nature. Let X = (x1 , ..., xn ) be a matrix giving the n observed feature vectors, with xi being the feature vector obtained when the state of nature was (i). Finally, let p(X|) be the conditional probability density function for X given the true set of states of nature . Using this notation we see that the posterior probability of is given by P (|X) = p(X|)P () = p(X) p(X|)P () . p(X|)P () (94) In general, one can define a loss matrix for the compound decision problem and seek a decision rule that minimizes the compound risk. The development of this theory parallels our discussion for the simple decision problem, and concludes that the optimal procedure is to minimize the compound conditional risk. In particular, if there is no loss for being correct, and if all errors are equally costly, then the procedure 42 CHAPTER 2. BAYESIAN DECISION THEORY reduces to computing P (|X) for all and selecting the for which this posterior probability is maximum. While this provides the theoretical solution, in practice the computation of P (|X) can easily prove to be an enormous task. If each component (i) can have one of c values, there are cn possible values of to consider. Some simplification can be obtained if the distribution of the feature vector xi depends only on the corresponding state of nature (i), not on the values of the other feature vectors or the other states of nature. In this case the joint density p(X|) is merely the product of the component densities p(xi |(i)): n p(X|) = i=1 p(xi |(i)). (95) While this simplifies the problem of computing p(X|), there is still the problem of computing the prior probabilities P (). This joint probability is central to the compound Bayes decision problem, since it reflects the interdependence of the states of nature. Thus it is unacceptable to simplify the problem of calculating P () by assuming that the states of nature are independent. In addition, practical applications usually require some method of avoiding the computation of P (|X) for all cn possible values of . We shall find some solutions to this problem in Chap. ??. Summary The basic ideas underlying Bayes decision theory are very simple. To minimize the overall risk, one should always choose the action that minimizes the conditional risk R(|x). In particular, to minimize the probability of error in a classification problem, one should always choose the state of nature that maximizes the posterior probability P (j |x). Bayes' formula allows us to calculate such probabilities from the prior probabilities P (j ) and the conditional densities p(x|j ). If there are different penalties for misclassifying patterns from i as if from j , the posteriors must be first weighted according to such penalties before taking action. If the underlying distributions are multivariate Gaussian, the decision boundaries will be hyperquadrics, whose form and position depends upon the prior probabilities, means and covariances of the distributions in question. The true expected error can be bounded above by the Chernoff and computationally simpler Bhattacharyya bounds. If an input (test) pattern has missing or corrupted features, we should form the marginal distributions by integrating over such features, and then using Bayes decision procedure on the resulting distributions. Receiver operating characteristic curves describe the inherent and unchangeable properties of a classifier and can be used, for example, to determine the Bayes rate. For many pattern classification applications, the chief problem in applying these results is that the conditional densities p(x|j ) are not known. In some cases we may know the form these densities assume, but may not know characterizing parameter values. The classic case occurs when the densities are known to be, or can assumed to be multivariate normal, but the values of the mean vectors and the covariance matrices are not known. More commonly even less is known about the conditional densities, and procedures that are less sensitive to specific assumptions about the densities must be used. Most of the remainder of this book will be devoted to various procedures that have been developed to attack such problems. 2.11. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 43 Bibliographical and Historical Remarks The power, coherence and elegance of Bayesian theory in pattern recognition make it among the most beautiful formalisms in science. Its foundations go back to Bayes himself, of course [3], but he stated his theorem (Eq. 1) for the case of uniform priors. It was Laplace [25] who first stated it for the more general (but discrete) case. There are several modern and clear descriptions of the ideas -- in pattern recognition and general decision theory -- that can be recommended [7, 6, 26, 15, 13, 20, 27]. Since Bayesian theory rests on an axiomatic foundation, it is guaranteed to have quantitative coherence; some other classification methods do not. Wald presents a non-Bayesian perspective on these topics that can be highly recommended [36], and the philosophical foundations of Bayesian and non-Bayesian methods are explored in [16]. Neyman and Pearson provided some of the most important pioneering work in hypothesis testing, and used the probability of error as the criterion [28]; Wald extended this work by introducing the notions of loss and risk [35]. Certain conceptual problems have always attended the use of loss functions and prior probabilities. In fact, the Bayesian approach is avoided by many statisticians, partly because there are problems for which a decision is made only once, and partly because there may be no reasonable way to determine the prior probabilities. Neither of these difficulties seems to present a serious drawback in typical pattern recognition applications: for nearly all critical pattern recognition problems we will have training data; we will use our recognizer more than once. For these reasons, the Bayesian approach will continue to be of great use in pattern recognition. The single most important drawback of the Bayesian approach is its assumption that the true probability distributions for the problem can be represented by the classifier, for instance the true distributions are Gaussian, and all that is unknown are parameters describing these Gaussians. This is a strong assumption that is not always fulfilled and we shall later consider other approaches that do not have this requirement. Chow[10] was among the earliest to use Bayesian decision theory for pattern recognition, and he later established fundamental relations between error and reject rate [11]. Error rates for Gaussians have been explored by [18], and the Chernoff and Bhattacharyya bounds were first presented in [9, 8], respectively and are explored in a number of statistics texts, such as [17]. Computational approximations for bounding integrals for Bayesian probability of error (the source for one of the homework problems) appears in [2]. Neyman and Pearson also worked on classification given constraints [28], and the analysis of minimax estimators for multivariate normals is presented in [5, 4, 14]. Signal detection theory and receiver operating characteristics are fully explored in [21]; a brief overview, targetting experimental psychologists, is [34]. Our discussion of the missing feature problem follows closely the work of [1] while the definitive book on missing features, including a great deal beyond our discussion here, can be found in [30]. Entropy was the central concept in the foundation of information theory [31] and the relation of Gaussians to entropy is explored in [33]. Readers requiring a review of information theory [12], linear algebra [24, 23], calculus and continuous mathematics, [38, 32] probability [29] calculus of variations and Lagrange multipliers [19] should consult these texts and those listed in our Appendix. 44 CHAPTER 2. BAYESIAN DECISION THEORY Problems Section 2.1 1. In the two-category case, under the Bayes' decision rule the conditional error is given by Eq. 7. Even if the posterior densities are continuous, this form of the conditional error virtually always leads to a discontinuous integrand when calculating the full error by Eq. 5. (a) Show that for arbitrary densities, we can replace Eq. 7 by P (error|x) = 2P (1 |x)P (2 |x) in the integral and get an upper bound on the full error. (b) Show that if we use P (error|x) = P (1 |x)P (2 |x) for < 2, then we are not guaranteed that the integral gives an upper bound on the error. (c) Analogously, show that we can use instead P (error|x) = P (1 |x)P (2 |x) and get a lower bound on the full error. (d) Show that if we use P (error|x) = P (1 |x)P (2 |x) for > 1, then we are not guaranteed that the integral gives an lower bound on the error. Section 2.2 2erive is the best, even among our model set. We shall return to the problem of choosing among candidate models in Chap. ??. sample covariance absolutely unbiased asymptotically unbiased 3.3 Bayesian estimation We now consider the Bayesian estimation or Bayesian learning approach to pattern classification problems. Although the answers we get by this method will generally be nearly identical to those obtained by maximum likelihood, there is a conceptual difference: whereas in maximum likelihood methods we view the true parameter vector we seek, , to be fixed, in Bayesian learning we consider to be a random variable, and training data allows us to convert a distribution on this variable into a posterior probability density. 10 CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION 3.3.1 The Class-Conditional Densities The computation of the posterior probabilities P (i |x) lies at the heart of Bayesian classification. Bayes' formula allows us to compute these probabilities from the prior probabilities P (i ) and the class-conditional densities p(x|i ), but how can we proceed when these quantities are unknown? The general answer to this question is that the best we can do is to compute P (i |x) using all of the information at our disposal. Part of this information might be prior knowledge, such as knowledge of the functional forms for unknown densities and ranges for the values of unknown parameters. Part of this information might reside in a set of training samples. If we again let D denote the set of samples, then we can emphasize the role of the samples by saying that our goal is to compute the posterior probabilities P (i |x, D). From these probabilities we can obtain the Bayes classifier. Given the sample D, Bayes' formula then becomes P (i |x, D) = p(x|i , D)P (i |D) c j=1 . (23) p(x|j , D)P (j |D) As this equation suggests, we can use the information provided by the training samples to help determine both the class-conditional densities and the a priori probabilities. Although we could maintain this generality, we shall henceforth assume that the true values of the a priori probabilities are known or obtainable from a trivial calculation; thus we substitute P (i ) = P (i |D). Furthermore, since we are treating the supervised case, we can separate the training samples by class into c subsets D1 , ..., Dc , with the samples in Di belonging to i . As we mentioned when addressing maximum likelihood methods, in most cases of interest (and in all of the cases we shall consider), the samples in Di have no influence on p(x|j , D) if i = j. This has two simplifying consequences. First, it allows us to work with each class separately, using only the samples in Di to determine p(x|i , D). Used in conjunction with our assumption that the prior probabilities are known, this allows us to write Eq. 23 as P (i |x, D) = p(x|i , Di )P (i ) c j=1 . (24) p(x|j , Dj )P (j ) Second, because each class can be treated independently, we can dispense with needless class distinctions and simplify our notation. In essence, we have c separate problems of the following form: use a set D of samples drawn independently according to the fixed but unknown probability distribution p(x) to determine p(x|D). This is the central problem of Bayesian learning. 3.3.2 The Parameter Distribution Although the desired probability density p(x) is unknown, we assume that it has a known parametric form. The only thing assumed unknown is the value of a parameter vector . We shall express the fact that p(x) is unknown but has known parametric form by saying that the function p(x|) is completely known. Any information we might have about prior to observing the samples is assumed to be contained in a known prior density p(). Observation of the samples converts this to a posterior density p(|D), which, we hope, is sharply peaked about the true value of . 3.4. BAYESIAN PARAMETER ESTIMATION: GAUSSIAN CASE 11 Note that we are changing our supervised learning problem into an unsupervised density estimation problem. To this end, our basic goal is t category, we shall write Dn = {x1 , ..., xn }. Then from Eq. 52, if n > 1 p(Dn |) = p(xn |)p(Dn-1 |). (53) Substituting this in Eq. 51 and using Bayes' formula, we see that the posterior density satisfies the recursion relation 3.5. BAYESIAN PARAMETER ESTIMATION: GENERAL THEORY 17 p(|Dn ) = p(xn |)p(|Dn-1 ) . p(xn |)p(|Dn-1 ) d (54) With the understanding that p(|D0 ) = p(), repeated use of this equation produces the sequence of densities p(), p(|x1 ), p(|x1 , x2 ), and so forth. (It should be obvious from Eq. 54 that p(|Dn ) depends only on the points in Dn , not the sequence in which they were selected.) This is called the recursive Bayes approach to parameter estimation. This is, too, our first example of an incremental or on-line learning method, where learning goes on as the data is collected. When this sequence of densities converges to a Dirac delta function centered about the true parameter value -- Bayesian learning (Example 1). We shall come across many other, non-incremental learning schemes, where all the training data must be present before learning can take place. In principle, Eq. 54 requires that we preserve all the training points in Dn-1 in order to calculate p(|Dn ) but for some distributions, just a few parameters associated with p(|Dn-1 ) contain all the information needed. Such parameters are the sufficient statistics of those distributions, as we shall see in Sect. 3.6. Some authors reserve the term recursive learning to apply to only those cases where the sufficient statistics are retained -- not the training data -- when incorporating the information from a new training point. We could call this more restrictive usage true recursive Bayes learning. Example 1: Recursive Bayes learning Suppose we believe our one-dimensional samples come from a uniform distribution 1/ 0 0x otherwise, recursive Bayes incremental learning p(x|) U (0, ) = but initially we know only that our parameter is bounded. In particular we assume 0 < 10 (a non-informative or "flat prior" we shall discuss in Sect. 3.5.2). We will use recursive Bayes methods to estimate and the underlying densities from the data D = {4, 7, 2, 8}, which were selected randomly from the underlying distribution. Before any data arrive, then, we have p(|D0 ) = p() = U (0, 10). When our first data point x1 = 4 arrives, we use Eq. 54 to get an improved estimate: 1/ 0 for 4 10 otherwise, p(|D1 ) p(x|)p(|D0 ) = where throughout we will ignore the normalization. When the next data point x2 = 7 arrives, we have 1/2 0 for 7 10 otherwise, p(|D2 ) p(x|)p(|D1 ) = and similarly for the remaining sample points. It should be clear that since each successive step introduces a factor of 1/ into p(x|), and the distribution is nonzero only for x values above the largest data point sampled, the general form of our solution is p(|Dn ) 1/n for max[Dn ] 10, as shown in the figure. Given our full data x 18 CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION p(|Dn) 4 0.6 3 2 0.4 1 0.2 0 2 4 6 8 10 The posterior p(|Dn ) for the model and n points in the data set in this Example. The posterior begins p() U (0, 10), and as more points are incorporated it becomes increasingly peaked at the value of the highest data point. ^ set, the maximum likelihood solution here is clearly = 8, and this implies a uniform p(x|D) U (0, 8). According to our Bayesian methodology, which requires the integration in Eq. 50, the density is uniform up to x = 8, but has a tail at higher values -- an indication that the influence of our prior p() has not yet been swamped by the information in the training data. p(x|D) 0.2 ML 0.1 Bayes x 0 2 4 6 8 10 Given the full set of four points, the distribution based on the maximum likelihood ^ solution is p(x|) U (0, 8), whereas the distribution derived from Bayesian methods has a small tail above x = 8, reflecting the prior information that values of x near 10 are possible. Whereas the maximum likelihood approach estimates a point in space, the Bayesian approach instead estimates a distribution. Technically speaking, then, we cannot directly compare these estimates. It is only when the second stage of inference is done -- that is, we compute the distributions p(x|D), as shown in the above figure -- that the comparison is fair. For most of the typically encountered probability densities p(x|), the sequence of posterior densities does indeed converge to a delta function. Roughly speaking, this implies that with a large number of samples there is only one value for that causes p(x|) to fit the data, i.e., that can be determined uniquely from p(x|). When this is the case, p(x|) is said to be identifiable. A rigorous proof of convergence under these conditions requires a precise statement of the properties required of p(x|) and p() and considerable care, but presents no serious difficulties (Problem 21). There are occasions, however, when more than one value of may yield the same value for p(x|). In such cases, cannot be determined uniquely from p(x|), and p(x|Dn ) will peak near all of the values of that explain the data. Fortunately, this ambiguity is erased by the integration in Eq. 26, since p(x|) is the same for all of identifiability 3.5. BAYESIAN PARAMETER ESTIMATION: GENERAL THEORY 19 these values of . Thus, p(x|Dn ) will typically converge to p(x) whether or not p(x|) is identifiable. While this might make the problem of identifiabilty appear to be moot, we shall see in Chap. ?? that identifiability presents a genuine problem in the case of unsupervised learning. 3.5.1 When do Maximum Likelihood and Bayes methods differ? In virtually every case, maximum likelihood and Bayes solutions are equivalent in the asymptotic limit of infinite training data. However since practical pattern recognition problems invariably have a limited set of training data, it is natural to ask when maximum likelihood and Bayes solutions may be expected to differ, and then which we should prefer. There are several criteria that will influence our choice. One is computational complexity (Sec. 3.7.2), and here maximum likelhood methods are often to be pref^ ered since they require merely differential calculus techniques or gradient search for , rather than a possibly complex multidimensional integration needed in Bayesian estimation. This leads to another consideration: interpretability. In many cases the maximum likelihood solution will be easier to interpret and understand since it returns the single best model from the set the designer provided (and presumably understands). In contrast Bayesian methods give a weighted average of models (parameters), often leading to solutions more complicated and harder to understand than those provided by the designer. The Bayesian approach reflects the remaining uncertainty in the possible models. Another consideration is our confidence in the prior information, such as in the ^ form of the underlying distribution p(x|). A maximum likelihood solution p(x|) must of course be of the assumed parametric form; not so for the Bayesian solution. We saw this difference in Example 1, where the Bayes solution was not of the parametric form originally assumed, i.e., a uniform p(x|D). In general, through their use of the full p(|D) distribution Bayesian methods use more of the information brought to the problem than do maximum likelihood methods. (For instance, in Example 1 the addition of the third training point did not change the maximum likelihood solution, but did refine the Bayesian estimate.) If such information is reliable, Bayes methods can be expected to give better results. Further, general Bayesian methods with a "flat" or uniform prior (i.e., where no prior information is explicitly imposed) are equivalent to maximum likelihood methods. If there is much data, leading to a strongly peaked p(|D), and the prior p() is uniform or flat, then the MAP estimate is essentially the same as the maximum likelihood estimate. ^ When p(|D) is broad, or asymmetric around , the methods are quite likely to yield p(x|D) distributions that differ from one another. Such a strong asymmetry (when not due to rare statisticspect to that invariance. It is tempting to assert that the use of non-informative priors is somehow "objective" and lets the data speak for themselves, but such a view is a bit naive. For 3.6. *SUFFICIENT STATISTICS 21 example, we may seek a non-informative prior when estimating the standard deviation of a Gaussian. But this requirement might not lead to the non-informative prior for estimating the variance, 2 . Which should we use? In fact, the greatest benefit of this approach is that it forces the designer to acknowledge and be clear about the assumed invariance -- the choice of which generally lies outside our methodology. It may be more difficult to accommodate such arbitrary transformations in a maximum a posteriori (MAP) estimator (Sec. 3.2.1), and hence considerations of invariance are of greatest use in Bayesian estimation, or when the posterior is very strongly peaked and the mode not influenced by transformations of the density (Problem 19). 3.6 *Sufficient Statistics From a practical viewpoint, the formal solution provided by Eqs. 26, 51 & 52 is not computationally attractive. In pattern recognition applications it is not unusual to have dozens or hundreds of parameters and thousands of training samples, which makes the direct computation and tabulation of p(D|) or p(|D) quite out of the question. We shall see in Chap. ?? how neural network methods avoid many of the difficulties of setting such a large number of parameters in a classifier, but for now we note that the only hope for an analytic, computationally feasible maximum likelihood solution lies in being able to find a parametric form for p(x|) that on the one hand matches the characteristics of the problem and on the other hand allows a reasonably tractable solution. Consider the simplification that occurred in the problem of learning the parameters of a multivariate Gaussian density. The basic data processing required was merely the computation of the sample mean and sample covariance. This easily computed and easily updated statistic contained all the information in the samples relevant to estimating the unknown population mean and covariance. One might suspect that this simplicity is just one more happy property of the normal distribution, and that such good fortune is not likely to occur in other cases. While this is largely true, there are distributions for which computationally feasible solutions can be obtained, and the key to their simplicity lies in the notion of a sufficient statistic. To begin with, any function of the samples is a statistic. Roughly speaking, a sufficient statistic is a (possibly vector-valued) function s of the samples D that contains all of the information relevant to estimating some parameter . Intuitively, one might expect the definition of a sufficient statistic to involve the requirement that p(|s, D) = p(|s). However, this would require treating as a random variable, limiting the definition to a Bayesian domain. To avoid such a limitation, the conventional definition is as follows: A statistic s is said to be sufficient for if p(D|s,) is independent of . If we think of as a random variable, we can write p(|s, D) = p(D|s, )p(|s) , p(D|s) (56) whereupon it becomes evident that p(|s, D) = p(|s) if s is sufficient for . Conversely, if s is a statistic for which p(|s, D) = p(|s), and if p(|s) = 0, it is easy to show that p(D|s, ) is independent of (Problem 27). Thus, the intuitive and the conventional definitions are basically equivalent. As one might expect, for a Gaussian distribution the sample mean and covariance, taken together, represent a sufficient statistic for the true mean and covariance; if these are known, all other statistics 22 CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION such as the mode, range, higher-order moments, number of data points, etc., are superfluous when estimating the true mean and covariance. A fundamental theorem concerning sufficient statistics is the Factorization Theorem, which states that s is sufficient for if and only if p(D|) can be factored into the produe ones for which the difference between the means is large relative to the standard deviations. However no feature is useless if its means for the two classes differ. An obvious way to reduce the error rate further is to introduce new, independent features. Each new feature need not add much, but if r can be increased without limit, the probability of error can be made arbitrarily small. In general, if the performance obtained with a given set of features is inadequate, it is natural to consider adding new features, particularly ones that will help separate the class pairs most frequently confused. Although increasing the number of features increases the cost and complexity of both the feature extractor and the classifier, it is often reasonable to believe that the performance will improve. After all, if the probabilistic structure of the problem were completely known, the Bayes risk could not possibly be increased by adding new features. At worst, the Bayes classifer would ignore the new features, but if the new features provide any additional information, the performance must improve (Fig. 3.3). Unfortunately, it has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance. This apparent paradox presents a genuine and serious problem for classifier 28 CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION x3 x2 x1 Figure 3.3: Two three-dimensional distributions have nonoverlapping densities, and thus in three dimensions the Bayes error vanishes. When projected to a subspace -- here, the two-dimensional x1 - x2 subspace or a one-dimensional x1 subspace -- there can be greater overlap of the projected distributions, and hence greater Bayes errors. design. The basic source of the difficulty can always be traced to the fact that we have the wrong model -- e.g., the Gaussian assumption or conditional assumption are wrong -- or the number of design or training samples is finite and thus the distributions are not estimated accurately. However, analysis of the problem is both challenging and subtle. Simple cases do not exhibit the experimentally observed phenomena, and more realistic cases are difficult to analyze. In an attempt to provide some rigor, we shall return to topics related to problems of dimensionality and sample size in Chap. ??. 3.7.2 Computational Complexity order big oh We have mentioned that one consideration affecting our design methodology is that of the computational difficulty, and here the technical notion of computational complexity can be useful. First, we will will need to understand the notion of the order of a function f (x): we say that the f (x) is "of the order of h(x)" -- written f (x) = O(h(x)) and generally read "big oh of h(x)" -- if there exist constants c0 and x0 such that |f (x)| c0 |h(x)| for all x > x0 . This means simply that for sufficiently large x, an upper bound on the function grows no worse than h(x). For instance, suppose f (x) = a0 + a1 x + a2 x2 ; in that case we have f (x) = O(x2 ) because for sufficiently large x, the constant, linear and quadratic terms can be "overcome" by proper choice of c0 and x0 . The generalization to functions of two or more variables is straightforward. It should be clear that by the definition above, the big oh order of a function is not unique. For instance, we can describe our particular f (x) as being O(x2 ), O(x3 ), O(x4 ), O(x2 ln x). Because of the non-uniqueness of the big oh notation, we occasionally need to be 3.7. PROBLEMS OF DIMENSIONALITY 29 more precise in describing the order of a function. We say that f (x) = (h(x)) "big theta of h(x)" if there are constants x0 , c1 and c2 such that for x > x0 , f (x) always lies between c1 h(x) and c2 h(x). Thus our simple quadratic function above would obey f (x) = (x2 ), but would not obey f (x) = (x3 ). (A fuller explanation is provided in the Appendix.) In describing the computational complexity of an algorithm we are generally interested in the number of basic mathematical operations, such lly necessary for us to determine these constants to find which of several implemementations is the simplest. Nevertheless, big oh and big theta analyses, as just described, are generally the best way to describe the computational complexity of an algorithm. Sometimes we stress space and time complexities, which are particularly relevant when contemplating parallel implementations. For instance, the sample mean of a category could be calculated with d separate processors, each adding n sample values. Thus we can describe this implementation as O(d) in space (i.e., the amount of memory or possibly the number of processors) and O(n) in time (i.e., number of sequential steps). Of course for any particular algorithm there may be a number of time-space tradeoffs, for instance using a single processor many times, or using many processors in parallel for a shorter time. Such tradeoffs are important considerations can be important in neural network implementations, as we shall see in Chap. ??. A common qualitative distinction is made between polynomially complex and exponentially complex algorithms -- O(ak ) for some constant a and aspect or variable k of the problem. Exponential algorithms are generally so complex that for reasonable size cases we avoid them altogether, and resign ourselves to approximate solutions that can be found by polynomially complex algorithms. 3.7.3 Overfitting It frequently happens that the number of available samples is inadequate, and the question of how to proceed arises. One possibility is to reduce the dimensionality, either by redesigning the feature extractor, by selecting an appropriate subset of the existing features, or by combining the existing features in some way (Chap ??). Another possibility is to assume that all c classes share the same covariance matrix, and to pool the available data. Yet another alternative is to look for a better estimate for . If any reasonable a priori estimate 0 is available, a Bayesian or pseudo-Bayesian estimate of the form 0 + (1 - ) might be employed. If 0 is diagonal, this diminishes the troublesome effects of "accidental" correlations. Alternatively, one can remove chance correlations heuristically by thresholding the sample covariance matrix. For example, one might assume that all covariances for which the magnitude of the correlation coefficient is not near unity are actually zero. An extreme of this approach is to assume statistical independence, thereby making all the off-diagonal elements be zero, regardless of empirical evidence to the contrary -- an O(nd) calculation. Even though such assumptions are almost surely incorrect, the resulting heuristic estimates sometimes provide better performance than the maximum likelihood estimate of the full parameter space. 3.7. PROBLEMS OF DIMENSIONALITY 31 Here we have another apparent paradox. The classifier that results from assuming independence is almost certainly suboptimal. It is understandable that it will perform better if it happens that the features actually are independent, but how can it provide better performance when this assumption is untrue? The answer again involves the problem of insufficient data, and some insight into its nature can be gained from considering an analogous problem in curve fitting. Figure 3.4 shows a set of ten data points and two candidate curves for fitting them. The data points were obtained by adding zero-mean, independent noise to a parabola. Thus, of all the possible polynomials, presumably a parabola would provide the best fit, assuming that we are interested in fitting data obtained in the future as well as the points at hand. Even a straight line could fit the training data fairly well. The parabola provides a better fit, but one might wonder whether the data are adequate to fix the curve. The best parabola for a larger data set might be quite different, and over the interval shown the straight line could easily be superior. The tenth-degree polynomial fits the given data perfectly. However, we do not expect that a tenth-degree polynomial is required here. In ge to an improved estimate, labelled by the iteration number i; here, after three iterations the algorithm has converged. We must be careful and note that the EM algorithm leads to the greatest loglikelihood of the good data, with the bad data marginalized. There may be particular values of the bad data that give a different solution and an even greater log-likelihood. For instance, in this Example if the missing feature had value x41 = 2, so that x4 = 2 , we would have a solution 4 1.0 2.0 = 0.5 2.0 and a log-likelihood for the full data (good plus bad) that is greater than for the good alone. Such an optimization, however, is not the goal of the canonical EM algorithm. Note too that if no data is missing, the calculation of Q(; i ) is simple since no integrals are involved. Generalized Expectation-Maximization or GEM algorithms are a bit more lax than the EM algorithm, and require merely that an improved i+1 be set in the M step (line 5) of the algorithm -- not necessarily the optimal. Naturally, convergence will not be as rapid as for a proper EM algorithm, but GEM algorithms afford greater freedom to choose computationally simpler steps. One version of GEM is to find the maximum likelihood value of unknown features at each iteration step, then recalculate in light of these new values -- if indeed they lead to a greater likelihood. generalized ExpectationMaximization 36 CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION In practice, the term Expectation-Maximization has come to mean loosely any iterative scheme in which the likelihood of some data increases with each step, even if such methods are not, technically speaking, the true EM algorithm as presented here. 3.9 Bayesian Belief Networks The methods we have described up to now are fairly general -- all that we assumed, at base, was that we could parameterize the distributions by a feature vector . If we had prior information about the distribution of , this too could be used. Sometimes our knowledge about a distribution is not directly of this type, but instead about the statistical dependencies (or independencies) among the component features. Recall that for some multidimensional distribution p(x), if for two features we have p(xi , xj ) = p(xi )p(xj ), we say those variables are statistically independent (Fig. 3.6). x3 x1 x2 Figure 3.6: A three-dimensional distribution which obeys p(x1 , x3 ) = p(x1 )p(x3 ); thus here x1 and x3 are statistically independent but the other feature pairs are not. There are many cases where we know or can safely assume which variables are or are not independent, even without sampled data. Suppose for instance we are describing the state of an automobile -- temperature of the engine, pressures of the fluids and in the tires, voltages in the wires, and so on. Our basic knowledge of cars includes the fact that the oil pressure in the engine and the air pressure in a tire are functionally unrelated, and hence can be safely assumed to be statistically independent. However the oil temperature and engine temperature are not independent (but could be conditionally independent). Furthermore we may know several variables that might influence another: the coolant temperature is affected by the engine temperature, the speed of the radiator fan (which blows air over the coolant-filled radiator), and so on. We will represent these dependencies graphically, by means of Bayesian belief nets, also called causal networks, or simply belief nets. They take the topological form of a directed acyclic graph (DAG), where each link is directional, and there are no loops. (More general networks permit such loops, however.) While such nets can represent continuous multidimensional distributions, they have enjoyed greatest application and 3.9. *BAYESIAN BELIEF NETWORKS 37 success for discrete variables. For this reason, and because the formal properties are simpler, we shall concentrate on the discrete case. P(a) A P(c|a) C P(c|d) P(e|c) E P(f|e) F P(g|f) P(g|e) G D P(b) B P(d|b) Figure 3.7: A belief network consists of nodes (labelur entire belief net consisted of X, its parents and children, and we needed to update only the values on X. In the more general case, where the network is large, there may be many nodes whose values are unknown. In that case we may have to visit nodes randomly and update the probabilites until the entire configuration of probabilities is stable. It can be shown that under weak conditions, this process will converge to consistent values of the variables throughout the entire network (Problem 44). Belief nets have found increasing use in complicated problems such as medical diagnosis. Here the upper-most nodes (ones without their own parents) represent a fundamental biological agent such as the presence of a virus or bacteria. Intermediate nodes then describe diseases, such as flu or emphysema, and the lower-most nodes the symptoms, such as high temperature or coughing. A physician enters measured values into the net and finds the most likely disease or cause. Such networks can be used in a somewhat more sophisticated way, automatically computing which unknown variable (node) should be measured to best reveal the identity of the disease. We will return in Chap. ?? to address the problem of learning in such belief net models. 3.10 Hidden Markov Models While belief nets are a powerful method for representing the dependencies and independencies among variables, we turn now to the problem of representing a particular but extremely important dependencies. In problems that have an inherent temporality -- that is, consist of a process that unfolds in time -- we may have states at time t that are influenced directly by a state at t - 1. Hidden Markov models (HMMs) have found greatest use in such problems, for instance speech recognition or gesture recognition. While the notation and description is aunavoidably more complicated than the simpler models considered up to this point, we stress that the same underlying ideas are exploited. Hidden Markov models have a number of parameters, whose values are set so as to best explain training patterns for the known category. Later, a test pattern is classified by the model that has the highest posterior probability, i.e., that best "explains" the test pattern. 3.10.1 First-order Markov models We consider a sequence of states at successive times; the state at any time t is denoted (t). A particular sequence of length T is denoted by T = {(1), (2), ..., (T )} as 3.10. *HIDDEN MARKOV MODELS 43 for instance we might have 6 = {1 , 4 , 2 , 2 , 1 , 4 }. Note that the system can revisit a state at different steps, and not every state need be visited. Our model for the production of any sequence is described by transition probabilities P (j (t + 1)|i (t)) = aij -- the time-independent probability of having state j at step t + 1 given that the state at time t was i . There is no requirement that the transition probabilities be symmetric (aij = aji , in general) and a particular state may be visited in succession (aii = 0, in general), as illustrated in Fig. 3.9. transition probability a22 2 a21 a11 1 a12 a31 a13 Figure 3.9: The discrete states, i , in a basic Markov model are represented by nodes, and the transition probabilities, aij , by links. In a first-order discrete time Markov model, at any step t the full system is in a particular state (t). The state at step t + 1 is a random function that depends solely on the state at step t and the transition probabilities. Suppose we are given a particular model -- that is, the full set of aij -- as well as a particular sequence T . In order to calculate the probability that the model generated the particular sequence we simply multiply the successive probabilities. For instance, to find the probability that a particular model generated the sequence described above, we would have P ( T |) = a14 a42 a22 a21 a14 . If there is a prior probability on the first state P ((1) = i ), we could include such a factor as well; for simplicity, we will ignore that detail for now. Up to here we have been discussing a Markov model, or technically speaking, a first-order discrete time Markov model, since the probability at t + 1 depends only on the states at t. For instance, in a Markov model for the production of spoken words, we might have states representing phonemes, and a Markov model for the production of a spoken work might have states representing phonemes. Such a Markov model for the word "cat" would have states for /k/, /a/ and /t/, with transitions from /k/ to /a/; transitions from /a/ to /t/; and transitions from /t/ to a final silent state. Note however that in speech recognition the perceiver does not have access to the states (t). Instead, we measure some properties of the emitted sound. Thus we will have to augment our Markov model to allow for visible states -- which are directly accessible to external measurement -- as separate from the states, which are not. a23 a32 a33 3 3.10.2 First-order hidden Markov models We continue to assume that at every time step t the system is in a state (t) but now we also assume that it emits some (visible) symbol v(t). While sophisticated Markov models allow for the emission of continuous functions (e.g., spectra), we will restrict ourselves to the case where a discrete symbol is emitted. As with the states, we define 44 CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION a particular sequence of such visible states as VT = {v(1), v(2), ..., v(T )} and thus we might have V6 = {v5 , v1 , v1 , v5 , v2 , v3 }. Our model is then that in any state (t) we have a probability of emitting a particular visible state vk (t). We denote this probability P (vk (t)|j (t)) = bjk . Because we have access only to the visible states, while the i are unobservable, such a full model is called a hidden Markov model (Fig. 3.10) v1 v2 v3 v 4 b21 a22 b22 b23 b24 2 a21 a11 1 a12 a31 a13 b11 b 12 v1 v2 b b b13 14 31 b32 v1 v2 v3 v4 b b33 34 v3 v4 a23 a32 a33 3 Figure 3.10: Three hidden units in an HMM and the transitions between them are shown in black while the visible states and the emission probabilities of visible states are shown in red. This model shows all transitions as being possible; in other HMMs, some such candidate transitions are not allowed. 3.10.3 Hidden Markov Model Computation absorbing state Now we define some new terms and clarify our notation. In general networks such as those in Fig. 3.10 are finite-state machines, and when they have associated transition probabilities, they are called Markov networks. They are strictly causal -- the probabilities depend only upon previous states. A Markov model is called ergodic if every one of the states has a non-zero probability of occuring given some starting state. A final or absorbing state 0 is one which, if entered, is never left (i.e., a00 = 1). As mentioned, we denote the transition probabilities aij among hidden states and for the probability bjk of the emission of a visible state: aij bjk = P (j (t + 1)|i (t)) = P (vk (t)|j (t)). (86) We demand that some transition occur from step t t + 1 (even if it is to the same state), and that some visible symbol be emitted after every step. Thus we have the normalization conditions: aij j = 1 for all i and 3.10. *HIDDEN MARKOV MODELS bjk k 45 1 for all j, (87) = where the limits on the summations are over all hidden states and all visible symbols, respectively. With these preliminaries behind us, we can now focus on the three central issues in hidden Markov models: The Evaluation problem. Suppose we have an HMM, complete with transition probabilites aij and bjk . Determine the probability that a particular sequence of visible states VT was generated by that model. The Decoding problem. Suppose we have an HMM as well as a set of observations VT . Determine the most likely sequence of hidden states T that led to those observations. The Learning problem. Suppose we are given the coarse structure of a model (the number of states and the number of visible states) but not the probabilities aij and bjk . Given a set of training observations of visible symbols, determine these parameters. We consider each of these problems in turn. 3.10.4 Evaluation rmax The probability that the model produces a sequence VT of visible states is: P (VT ) = r=1 P (VT | T )P ( T ), r r (88) where each r indexes a particular sequence T = {(1), (2), ..., (T )} of T hidden r states. In the general case of c hidden states, there will be rmax = cT possible terms in the sum of Eq. 88, corresponding to all possible sequences of length T . Thus, according to Eq. 88, in order to compute the probability that the model generated the particular sequence of T visible states VT , we should take each conceivable sequence of hidden states, calculate the probability they produce VT , and then add up these probabilities. The probability of a particular visible sequence is merely the product of the corresponding (hidden) transition probabilities aij and the (visible) output probabilities bjk of each step. Because we are dealing here with a first-order Markov process, the second factor in Eq. 88, which describes the transition probability for the hidden states, can be rewritten as: T P ( T ) = r t=1 P ((t)|(t - 1)) (89) that is, a product of the aij 's according to the hidden sequence in question. In Eq. 89, (T ) = 0 is some final absorbing state, which uniquely emits the visible state v0 . In speech recognition applications, 0 typically represents a null state or lack of utterance, and v0 is some symbol representing silence. Because of our assumption that the output probabilities depend only upon the hidden state, we can write the first factor in Eq. 88 as 46 CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION T P (VT | T ) = r t=1 P (v(t)|(t)), (90) that is, a product of bjk 's according to the hidden state and the corresponding visible state. We can now use Eqs. 89 & 90 to express Eq. 88 as rmax T P (VT ) = r=1 t=1 P (v(t)|(t))P ((t)|(t - 1)). (91) Despite its formal complexity, Eq. 91 has a straightforward interpretation. The probability that we observe the particular sequence of T visible states VT is equal to the sum over all rmax possible sequences of hidden states of the conditional probability that the system has made a particular transition multiplied by the probability that it then emitted the visible symbol in our target sequence. All these are captured in our paramters aij and bkj , and thus Eq. 91 can be evaluated directly. Alas, this is an O(cT T ) calculation, which is quite prohibitive in practice. For instance, if c = 10 and T = 20, we must perform on the order of 1021 calculations. A computationaly simpler algorithm for the same goal is as follows. We can calculate P (VT ) recursively, since each term P (v(t)|(t))P ((t)|(t - 1)) involves only v(t), (t) and (t - 1). We do this by defining t = 0 and i = initial state 0 1 t = 0 and i = initial state i (t) = (92) (t - 1)aij bjk v(t) otherwise, j where the notation bjk v(t) means the transition probability bjk selected by the visible state emitted at time t. thus the only non-zero contribution to the sum is for the index k which matches the visible state v(t). Thus i (t) represents the probability that our HMM is in hidden state i at step t having generated the first t elements of VT . This calculation is implemented in the Forward algorithm in the following way: Algorithm 2 (HMM Forward) 1 2 3 4 5 6 initialize (1), t = 0, aij , bjk , visible sequence VT , (0) = 1 for t t + 1 j (t) c i (t - 1)aij bjk i=1 until t = T return P (VT ) 0 (T ) end where in line 5, 0 denotes the probability of the associated sequence ending to the known final state. The Forward algorithm has, thus, a computational complexity of O(c2 T ) -- far more efficient than the complexity associated with exhaustive enumeration of paths of Eq. 91 (Fig. 3.11). For the illustration of c = 10, T = 20 above, we would need only on the order of 2000 calculations -- more than 17 orders of magnitude faster than that to examine each path individually. We shall have cause to use the Backward algorithm, which is the time-reversed version of the Forward algorithm. Algorithm 3 (HMM Backward) 3.10. *HIDDEN MARKOV MOents the phoneme /v/, 2 represents /i/, ..., and 0 a final silent state. Such a left-to-right model is more restrictive than the general HMM in Fig. 3.10, and precludes transitions "back" in time. The Forward algorithm gives us P (V T |). The prior probability of the model, P (), is given by some external source, such as a language model in the case of speech. This prior probability might depend upon the semantic context, or the previous words, or yet other information. In the absence of such information, it is traditional to assume a uniform density on P (), and hence ignore it in any classification problem. (This is an example of a "non-informative" prior.) 3.10.5 Decoding Given a sequence of visible states VT , the decoding problem is to find the most probable sequence of hidden states. While we might consider enumerating every possible path and calculating the probability of the visible sequence observed, this is an O(cT T ) calculation and prohibitive. Instead, we use perhaps the simplest decoding algorithm: Algorithm 4 (HMM decoding) 1 2 4 5 7 8 10 11 12 13 14 begin initialize Path = {}, t = 0 for t t + 1 k = 0, 0 = 0 for k k + 1 k (t) bjk v(t) until k = c j arg max j (t) j c i (t - 1)aij i=1 AppendT o Path j until t = T return Path end A closely related algorithm uses logarithms of the probabilities and calculates total probabilities by addition of such logarithms; this method has complexity O(c2 T ) (Problem 48). 50 CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION max(T) 0 max(1) 1 1 1 max(3) 2 2 max(2) 3 3 3 3 3 3 2 2 1 1 max(T-1) 2 2 1 0 0 0 0 0 c c c c c c t= 1 2 3 4 T-1 T Figure 3.13: The decoding algorithm finds at each time step t the state that has the highest probability of having come from the previous step and generated the observed visible state vk . The full path is the sequence of such states. Because this is a local optimization (dependent only upon the single previous time step, not the full sequence), the algorithm does not guarantee that the path is indeed allowable. For instance, it might be possible that the maximum at t = 5 is 1 and at t = 6 is 2 , and thus these would appear in the path. This can even occur if a12 = P (2 (t+1)|1 (t)) = 0, precluding that transition. The red line in Fig. 3.13 corresponds to Path, and connects the hidden states with the highest value of i at each step t. There is a difficulty, however. Note that there is no guarantee that the path is in fact a valid one -- it might not be consistent with the underlying models. For instance, it is possible that the path actually implies a transition that is forbidden by the model, as illustrated in Example 5. Example 5: HMM decoding We find the path for the data of Example 4 for the sequence {1 , 3 , 2 , 1 , 0 }. Note especially that the transition from 3 to 2 is not allowed according to the transition probabilities aij given in Example 4. The path locally optimizes the probability through the trellis. 3.10. *HIDDEN MARKOV MODELS 51 V3 V1 0 V3 0 V2 0 V0 .0011 0 0 1 1 .09 .0052 .0024 0 2 3 t= 0 .01 .0077 .0002 0 0 1 .2 2 .0057 3 .0007 4 0 5 The locally optimal path through the HMM trellis of Example 4. HMMs address the problem of rate invariance in the following two ways. The first is that the transition probabilities themselves incorporate probabilistic structure of the durations. Moreover, using postprocessing, we can delete repeated states and just get the sequence somewhat independent of variations in rate. Thus in post-processing we can convert the sequence {1 , 1 , 3 , 2 , 2 , 2 } to {1 , 3 , 2 }, which would be appropriate for speech recognition, where the fundamental phonetic units are not repeated in natural speech. 3.10.6 Learning The goal in HMM learning is to determine model parameters -- the transition probabilities aij and bjk -- from an ensemble of training samples. There is no known method for obtaining the optimal or most likely set of parameters from the data, but we can nearly always determine a good solution by a straightforward technique. The Forward-backward Algorithm The Forward-backward algorithm is an instance of a generalized Expectation-Maximization algorithm. The general approach will be to iteratively update the weights in order to better explain the observed training sequences. Above, we defined i (t) as the probability that the model is in state i (t) and has generated the target sequence up to step t. We can analogously define i (t) to be the probability that the model is in state i (t) and will generate the remainder of the given target sequence, i.e., from t + 1 T . We express i (t) as: i (t) = sequence's final state and t = T 0 1 i (t) = sequence's final state and t = T i (t) = aij bjk v(t + 1)j (t + 1) otherwise, j (94) To understand Eq. 94, imagine we knew i (t) up to step T - 1, and we wanted to calculate the probability that the model would generate the remaining single visible 52 CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION symbol. This probability, i (T ), is just the probability we make a transition to state i (T ) multiplied by the probability that this hidden state emitted the correct final visible symbol. By the definition of i (T ) in Eq. 94, this will be either 0 (if i (T ) is not the final hidden state) or 1 (if it is). Thus it is clear that i (T - 1) = j aij bij v(T )i (T ). Now that we have determined i (T - 1), we can repeat the process, to determine i (T - 2), and so on, backward through the trellis of Fig. ??. But the i (t) and i (t) we determined are merely estimates of their true values, since we don't know the actual value of the transition probabilities aij and bij in Eq. 94. We can calculate an improved value by first defining ij (t) -- the probability of transition between i (t-1) and j (t), given the model generated the entire training sequence VT by any path. We do this by defining ij (t), as follows: ij (t) = i (t - 1)aij bij i (t) , P (V T |) (95) where P (VT |) is the probability that the model generated sequence VT by any path. Thus ij (t) is the probability of a transition from state i (t - 1) to j (t) given that the model generated the complete visible sequence V T . We can now calculate an improved estimate for aij . The expected number of transitions between state i (t - 1) and j (t) at any time in the sequence is simply T T ^ k ik (t). Thus aij (the estimate of the t=1 ij (t), whereas at step t it is t=1 probability of a transition from i (t - 1) to j (t)) can be found by taking the ratio between the expected number of transitions from i to j and the total expected number of any transitions from i . That is: T ij (t) aij = ^ t=1 T . ik (t) (96) t=1 k In the same way, we can obtain an improved estimate ^ij by calculating the ratio b between the frequency that any particular symbol vk is emitted and that for any symbol. Thus we have ^jk = b jk (t) T . (97) jk (t) t=1 In short, then, we start with rough or arbitrary estimates of aij and bjk , calculate improved estimates by Eqs. 96 & 97, and repeat until some convergence criterion is met (e.g., sufficiently small change in the estimated values of the parameters on subsequent iterations). This is the Baum-Welch or Forward-backward algorithm -- an example of a Generalized Expectation-Maximumization algorithm (Sec. 3.8): Algorithm 5 (Forward-backward) 1 2 3 4 5 6 begin initialize aij , bjk , training sequence V T , convergence criterion do z z + 1 Compute a(z) from a(z - 1) and b(z - 1) by Eq. 96 ^ Compute ^ b(z) from a(z - 1) and b(z - 1) by Eq. 97 aij (z) aij (z - 1) ^ bjk (z) ^jk (z - 1) b 3.10. SUMMARY 7 8 9 53 until max[aij (z) - aij (z - 1), bjk (z) - bjk (z - 1)] < ; convergence achievedln : F orBackstop i,j,k return aij aij (z); bjk bjk (z) end The stopping or convergence criterion in line ?? halts learning when no estimated transition probability changes more than a predetermined amount, . In typical speech recognition applications, convergence requires several presentations of each training sequence (fewer than five is common). Other popular stopping criteria are based on overall probability that the learned model could have generated the full training data. Summary If we know a parametric form of the class-conditional probability densities, we can reduce our learning task from one of finding the distribution itself, to that of finding the parameters (represented by a vector i for each category i ), and use the resulting distributions for classification. The maximum likelihood method seeks to find the parameter value that is best supported by the training data, i.e., maximizes the probability of obtaining the samples actually observed. (In practice, for computational simplicity one typically uses log-likelihood.) In Bayesian estimation the parameters are considered random variables having a known a priori density; the training data convert this to an a posteriori density. The recursive Bayes method updates the Bayesian parameter estimate incrementally, i.e., as each training point is sampled. While Bayesian estimation is, in principle, to be preferred, maximum likelihood methods are generally easier to implement and in the limit of large training sets give classifiers nearly as accurate. A sufficient statistic s for is a function of the samples that contains all information needed to determine . Once we know the sufficient statistic for models of a given form (e.g., exponential family), we need only estimate their value from data to create our classifier -- no other functions of the data are relevant. Expectation-Maximization is an iterative scheme to maximize model parameters, even when some data are missing. Each iteration employs two steps: the expectation or E step which requires marginalizing over the missing variables given the current model, and the maximization or M step, in which the optimum parameters of a new model are chosen. Generalized Expectation-Maximization algorithms demand merely that parameters be improved -- not optimized -- on each iteration and have been applied to the training of a large range of models. Bayesian belief nets allow the designer to specify, by means of connection topology, the functional dependences and independencies among model variables. When any subset of variables is clamped to some known values, each node comes to a probability of its value through a Bayesian inference calculation. Parameters representing conditional dependences can be set by an expert. Hidden Markov models consist of nodes representing hidden states, interconnected by links describing the conditional probabilities of a transition between the states. Each hidden state also has an associated set of probabilities of emiting a particular visible states. HMMs can be useful in modelling sequences, particularly context dependent ones, such as phonemes in speech. All the transition probabilities can be learned (estimated) iteratively from sample sequences by means of the Forward-backward or Baum-Welch algorithm, an example of a generalized EM algorithm. Classification 54 CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION proceeds by finding the single model among candidates that is most likely to have produced a given observed sequence. Bibliographical and Historical Remarks Maximum likelihood and Bayes estimation have a long history. The Bayesian approach to learning in pattern recognition began by the suggestion that the proper way to use samples when the conditional densities are unknown is the calculation of P (i |x, D), [6]. Bayes himself appreciated the role of non-informative priors. An analysis of different priors from statistics appears in [21, 15] and [4] has an extensive list of references. The origins of Bayesian belief nets traced back to [33], and a thorough literature review can be found in [8]; excellent modern books such as [24, 16] and tutorials [7] can be recommended. An important dissertation on the theory of belief nets, with an application to medical diagnosis is [14], and a summary of work on diagnosis of machine faults is [13]. While we have focussed on directed acyclic graphs, belief nets are of broader use, and even allow loops or arbitrary topologies -- a topic that would lead us far afield here, but which is treated in [16]. The Expectation-Maximization algorithm is due to Dempster et al.[11] and a thorough overview and history appears in [23]. On-line or incremental versions of EM are described in [17, 31]. The definitive compendium of work on missing data, including much beyond our discussion here, is [27]. Markov developed what later became called the Markov framework [22] in order to analyze the the text of his fellow Russian Pushkin's masterpiece Eugene Onegin. Hidden Markov models were introduced by Baum and collaborators [2, 3], and have had their greatest applications in the speech recognition [25, 26], and to a lesser extent statistical language learning [9], and sequence identification, such as in DNA sequences [20, 1]. Hidden Markov methods have been extended to two-dimensions and applied to recognizing characters in optical document images [19]. The decoding algorithm is related to pioneering work of Viterbi and followers [32, 12]. The relationship between hidden Markov models and graphical models such as Bayesian belief nets is explored in [29]. Knuth's classic [18] was the earliest compendium of the central results on computational complexity, the majority due to himself. The standard books [10], which inspired several homework problems below, are a bit more accessible for those without deep backgrounds in computer science. Finally, several other pattern recognition textbooks, such as [28, 5, 30] which take a somewhat different approach to the field can be recommended. Problems Section 3.2 1. Let x have an exponential density p(x|) = e-x 0 x0 otherwise. (a) Plot p(x|) versus x for = 1. Plot p(x|) versus , (0 5), for x = 2. 3.10. PROBLEMS 55 (b) Suppose that n samples x1 , ..., xn are drawn independently according to p(x|). Show that the maximum likelihood estimate for is given by ^ = 1 n 1 n . xk k=1 (c) On your graph generated with = 1 in part (a), mark the maximum likelihood ^ estimate for large n. 2. Let x have a uniform density p(x|) U (0, ) = 1/ 0 0x otherwise. (a) Suppose that n samples D = {x1 , ..., xn } are drawn independently according to p(x|). Show that the maximum likelihood estimate for is max[D], i.e., the value of the maximum element in D. (b) Suppose that n = 5 points are drawn from the distribution and the maximum value of which happens to be max xk = 0.6. Plot the likelihood p(D|) in the range 0 1. Explain in words why you do not need to know the values of the other four points. 3. Maximum likelihood methods apply to estimates of prior probabilities as well. Let samples be drawn by successive, independent selections of a state of nature i with unknown probability P (i ). Let zik = 1 if the state of nature for the kth sample is i and zik = 0 otherwise. (a) Show that n k P (zi1 , . . . , zin |P (i )) = k=1 P (i )zik (1 - P (i ))1-zik . (b) Show that the maximum likelihood estimate for P (i ) is 1 ^ P (i ) = n Interpret your result in words. 4. Let x be a d-dimensional binary (0 or 1) vector with a multivariate Bernoulli distribution d n zik . k=1 P (x|) = i=1 x i i (1 - i )1-xi , where = (1 , ..., d )t is an unknown parameter vector, i being the probability that xi = 1. Show that the maximum likelihood estimate for is 1 ^ = n n xk . k=1 56 CHAPTER 3. MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION 5. Let each component xi of x be binary valued (0 or 1) in a two-category problem with P (1 ) = P (2 ) = 0.5. Suppose that the probability of obtaining a 1 in any component is pi1 pi2 = p = 1 - p, and we assume for definiteness p > 1/2. The probability of error is known to approach zero as the dimensionality d approaches infinity. This problem asks you to explore the behavior as we increase the number of features in a single sample -- a complementary situation. (a) Suppose that a single sample x = (x1 , ..., xd )t is drawn from category 1 . Show that the maximum likelihood estimate for p is given by p= ^ 1 d d xi . i=1 (b) Describe the behavior of p as d approaches infinity. Indicate why such behavior ^ means that by letting the number of featntinuous at x. The second condition, which only makes sense if p(x) = 0, assures us that the frequency ratio will converge (in probability) to the probability P . The third condition is clearly necessary if pn (x) given by Eq. 7 is to converge at all. It also says that although a huge number of samples will eventually fall within the small region Rn , they will form a negligibly small fraction of the total number of samples. There are two common ways of obtaining sequences of regions that satisfy these conditions (Fig. 4.2). One is to shrink initial region by specifying the volume Vn an as some function of n, such as Vn = 1/ n. It then must be shown that the random variables kn and kn /n behave properly, or more to the point, that pn (x) converges to 6 CHAPTER 4. NONPARAMETRIC TECHNIQUES p(x). This is basically the Parzen-window method that will be examined in Sect. 4.3. The second method is to specify kn as some function of n, such as kn = n. Here the volume Vn is grown until it encloses kn neighbors of x. This is the kn -nearestneighbor estimation method. Both of these methods do in fact converge, although it is difficult to make meaningful statements about their finite-sample behavior. n=1 2 3 10 Figure 4.2: Two methods for estimating the density at a point x (at the center of each square) are to xxx. 4.3 Parzen Windows The Parzen-window approach to estimating densities can be introduced by temporarily assuming that the region Rn is a d-dimensional hypercube. If hn is the length of an edge of that hypercube, then its volume is given by V n = hd . n window function (8) We can obtain an analytic expression for kn , the number of samples falling in the hypercube, by defining the following window function: (u) = 1 0 |uj | 1/2 otherwise. j = 1, ..., d (9) Thus, (u) defines a unit hypercube centered at the origin. It follows that ((x - xi )/hn ) is equal to unity if xi falls within the hypercube of volume Vn centered at x, and is zero otherwise. The number of samples in this hypercube is therefore given by n kn = i=1 x - xi hn , (10) and when we substitute this into Eq. 7 we obtain the estimate pn (x) = 1 n n i=1 1 Vn x - xi hn . (11) This equation suggests a more general approach to estimating density functions. Rather than limiting ourselves to the hypercube window function of Eq. 9, suppose we allow a more general class of window functions. In such a case, Eq. 11 expresses our estimate for p(x) as an average of functions of x and the samples xi . In essence, 4.3. PARZEN WINDOWS 7 the window function is being used for interpolation -- each sample contributing to the estimate in accordance with its distance from x. It is natural to ask that the estimate pn (x) be a legitimate density function, i.e., that it be nonnegative and integrate to one. This can be assured by requiring the window function itself be a density function. To be more precise, if we require that (x) 0 and (u) du = 1, (13) (12) and if we maintain the relation Vn = hd , then it follows at once that pn (x) also n satisfies these conditions. Let us examine the effect that the window width hn has on pn (x). If we define the function n (x) by n (x) = then we can write pn (x) as the average pn (x) = 1 n n 1 Vn x hn , (14) n (x - xi ). i=1 (15) Since Vn = hd , hn clearly affects both the amplitude and the width of n (x) (Fig. 4.3). n If hn is very large, the amplitude of n is small, and x must be far from xi before n (x - xi ) changes much from n (0). In this case, pn (x) is the superposition of n broad, slowly changing functions and is a very smooth "out-of-focus" estimate of p(x). On the other hand, if hn is very small, the peak value of n (x - xi ) is large and occurs near x = xi . In this case p(x) is the superposition of n sharp pulses centered at the samples -- an erratic, "noisy" estimate (Fig. 4.4). For any value of hn , the distribution is normalized, i.e., n (x - xi ) dx = 1 Vn x - xi hn dx = (u) du = 1. (16) Thus, as hn approaches zero, n (x - xi ) approaches a Dirac delta function centered at xi , and pn (x) h1 = .5 1 h1 = .2 n=1 0 1 1 2 3 4 0 1 1 2 3 4 0 1 1 2 3 4 n = 16 0 1 n = 256 1 2 3 4 0 1 1 2 3 4 0 1 1 2 3 4 0 1 n= 1 2 3 4 0 1 1 2 3 4 0 1 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 Figure 4.7: Parzen-window estimates of a bimodal distribution using different window widths and numbers of samples. Note particularly that the n = estimates are the same (and match the true generating distribution), regardless of window width h. 4.3.4 Classification example In classifiers based on Parzen-window estimation, we estimate the densities for each category and classify a test point by the label corresponding to the maximum posterior. If there are multiple categories with unequal priors we can easily include these too (Problem 4). The decision regions for a Parzen-window classifier depend upon the choice of window function, of course, as illustrated in Fig. 4.8. In general, the training error -- the empirical error on the training points themselves -- can be made arbitrarily low by making the window width sufficiently small. However, the goal of creating a classifier is to classify novel patterns, and alas a low training error does not guarantee a small test error, as we shall explore in Chap. ??. Although a generic Gaussian window shape can be justified by considerations of noise, statistical independence and uncertainty, in the absense of other information about the underlying distributions there is little theoretical justification of one window width over another. These density estimation and classification examples illustrate some of the power and some of the limitations of nonparametric methods. Their power resides in their generality. Exactly the same procedure was used for the unimodal normal case and the bimodal mixture case and we did not need to make any assumptions about the We ignore cases in which the same feature vector has been assigned to multiple categories. 4.3. PARZEN WINDOWS 15 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 Figure 4.8: The decision boundaries in a two-dimensional Parzen-window dichotomizer depend on the window width h. At the left a small h leads to boundaries that are more complicated than for large h on same data set, shown at the right. Apparently, for this data a small h would be appropriate for the upper region, while a large h for the lower region; no single window width is ideal overall. distributions ahead of time. With enough samples, we are essentially assured of convergence to an arbitrarily complicated target density. On the other hand, the number of samples needed may be very large indeed -- much greater than would be required if we knew the form of the unknown density. Little or nothing in the way of data reduction is provided, which leads to severe requirements for computation time and storage. Moreover, the demand for a large number of samples grows exponentially with the dimensionality of the feature space. This limitation is related to the "curse of dimensionality," and severely restricts the practical application of such nonparametric procedures (Problem 11). The fundamental reason for the curse of dimensionality is that high-dimensional functions have the potential to be much more complicated than low-dimensional ones, and that those complications are harder to discern. The only way to beat the curse is to incorporate knowledge about the data that is correct. 4.3.5 Probabilistic Neural Networks (PNNs) A hardware implementation of the Parzen windows approach is found in Probabilistic Neural Networks (Fig. 4.9). Suppose we wish to form a Parzen estimate based on n patterns, each of which is d-dimensional, randomly sampled from c classes. The PNN for this case consists of d input units comprising the input layer, each unit is connect to each of the n pattern units; each pattern unit is, in turn, connected to one and only one of the c category units. The connections from the input to pattern units represent modifiable weights, which will be trained. (While these wen one dimension and the k-nearest-neighbor density estimates, for k = 3 and 5. Note especially that the discontinuities in the slopes in the estimates generally occur away fom the positions of the points themselves. 3 2 1 0 Figure 4.11: The k-nearest-neighbor estimate of a two-dimensional density for k = 5. Notice how such a finite n estimate can be quite "jagged," and that discontinuities in the slopes generally occur along lines away from the positions of the points themselves. 20 1 CHAPTER 4. NONPARAMETRIC TECHNIQUES 1 n=1 kn = 1 0 1 n = 16 kn = 4 1 2 3 4 0 1 1 2 3 4 0 1 n = 256 kn = 16 1 2 3 4 0 1 1 2 3 4 0 1 1 2 3 4 0 1 1 2 3 4 n= kn = 0 1 2 3 4 0 1 2 3 4 Figure 4.12: Several k-nearest-neighbor estimates of two unidimensional densities: a Gaussian and a bimodal distribution. Notice how the finite n estimates can be quite "spiky." and kn = n = 1, the estimate becomes pn (x) = 1 . 2|x - x1 | (32) This is clearly a poor estimate of p(x), with its integral embarrassing us by diverging to infinity. As shown in Fig. 4.12, the estimate becomes considerably better as n gets larger, even though the integral of the estimate remains infinite. This unfortunate fact is compensated by the fact that pn (x) never plunges to zero just because no samples fall within some arbitrary cell or window. While this might seem to be a meager compensation, it can be of considerable value in higher-dimensional spaces. As with the Parzen-window approach, we could obtain a family of estimates by taking kn = k1 n and choosing different values for k1 . However, in the absense of any additional information, one choice is as good as another, and we can be confident only that the results will be correct in the infinite data case. For classification, one popular method is to adjust the window width until the classifier has the lowest error on a separate set of samples, also drawn from the target distributions, a technique we shall explore in Chap. ??. 4.5. THE NEAREST-NEIGHBOR RULE 21 4.4.1 Estimation of a posteriori probabilities The techniques discussed in the previous sections can be used to estimate the a posteriori probabilities P (i |x) from a set of n labelled samples by using the samples to estimate the densities involved. Suppose that we place a cell of volume V around x and capture k samples, ki of which turn out to be labelled i . Then the obvious estimate for the joint probability p(x, i ) is pn (x, i ) = and thus a reasonable estimate for P (i |x) is Pn (i |x) = pn (x, i ) c ki /n , V ki . k (33) = (34) pn (x, j ) j=1 That is, the estimate of the a posteriori probability that i is the state of nature is merely the fraction of the samples within the cell that are labelled i . Consequently, for minimum error rate we select the category most frequently represented within the cell. If there are enough samples and if the cell is sufficiently small, it can be shown that this will yield performance approaching the best possible. When it comes to choosing the size of the cell, it is clear that we can use either the Parzen-window approach or the kn -nearest-neighbor approach. In the first case, Vn would be some specified function of n, such as Vn = 1/ n. In the second case, Vn would be expanded until some specified number of samples were captured, such as k = n. In either case, as n goes to infinity an infinite number of samples will fall within the infinitely small cell. The fact that the cell volume could become arbitrarily small and yet contain an arbitrarily large number of samples would allow us to learn the unknown probabilities with virtual certainty and thus eventually obtain optimum performance. Interestingly enough, we shall now see that we can obtain comparable performance if we base our decison solely on the label of the single nearest neighbor of x. 4.5 The Nearest-Neighbor Rule While the k-nearest-neighbor algorithm was first proposed for arbitrary k, the crucial matter of determining the error bound was first solved for k = 1. This nearestneighbor algorithm has conceptual and computational simplicity. We begin by letting Dn = {x1 , ..., xn } denote a set of n labelled prototypes, and x Dn be the prototype nearest to a test point x. Then the nearest-neighbor rule for classifying x is to assign it the label associated with x . The nearest-neighbor rule is a sub-optimal procedure; its use will usually lead to an error rate greater than the minimum possible, the Bayes rate. We shall see, however, that with an unlimited number of prototypes the error rate is never worse than twice the Bayes rate. Before we get immersed in details, let us try to gain a heuristic understanding of why the nearest-neighbor rule should work so well. To begin with, note that the label associated with the nearest neighbor is a random variable, and the probability that = i is merely the a posteriori probability P (i |x ). When the number of samples is very large, it is reasonable to assume that x is sufficiently close to x that P (|x ) P (i |x). Since this is exactly the probability that nature will be in state i , the nearest-neighbor rule is effectively matching probabilities with nature. 22 If we define m (x) by CHAPTER 4. NONPARAMETRIC TECHNIQUES P (m |x) = max P (i |x), i (35) Voronoi tesselation then the Bayes decision rule always selects m . This rule allows us to partition the feature space into cells consisting of all points closer to a given training point x than to any other training points. All points in such a cell are thus labelled by the category of the training point -- a so-called Voronoi tesselation of the space (Fig. 4.13). Figure 4.13: In two dimensions, the nearest-neighbor algorithm leads to a partitioning of the input space into Voronoi cells, each labelled by the category of the training point it contains. In three dimensions, the cells are three-dimensional, and the decision boundary resembles the surface of a crystal. When P (m |x) is close to unity, the nearest-neighbor selection is almost always the same as the Bayes selection. That is, when the minimum probability of error is small, the nearest-neighbor probability of error is also small. When P (m |x) is close to 1/c, so that all classes are essentially equally likely, the selections made by the nearest-neighbor rule and the Bayes decision rule are rarely the same, but the probability of error is approximately 1 - 1/c for both. While more careful analysis is clearly necessary, these observations should make the good performance of the nearest-neighbor rule less surprising. Our analysis of the behavior of the nearest-neighbor rule will be directed at obtaining the infinite-sample conditional average probability of error P (e|x), where the averaging is with respect to the training samples. The unconditional average probability of error will then be found by averaging P (e|x) over all x: P (e) = P (e|x)p(x) dx. (36) In passing we should recall that the Bayes decision rule minimizes P (e) by minimizing P (e|x) for every x. Recall from Chap. ?? that if we let P (e|x) be the minimum possible value of P (e|x), and P be the minimum possible value of P (e), then P (e|x) = 1 - P (m |x) (37) 4.5. THE NEAREST-NEIGHBOR RULE and P = P (e|x)p(x) dx. 23 (38) 4.5.1 Convergence of the Nearest Neighbor We now wish to evaluate the average probability of error for the nearest-neighbor rule. In particular, if Pn (e) is the n-sample error rate, and if P = lim Pn (e), n (39) then we want to show that P P P 2 - c P . c-1 (40) We begin by observing that when the nearest-neighbor rule is used with a particular set of n samples, the resulting error rate will depend on the accidental characteristics of the samples. In particular, if different sets of n samples are used to classify x, different vectors x will be obtained for the nearest-neighbor of x. Since the decision rule depends on this nearest-neighbor, we have a conditional probability of error P (e|x, x ) that depends on both x and x . By averaging over x , we obtain P (e|x) = P (e|x, x )p(x |x) dx . (41) where we understand that there is an implicit dependence upon the ncalculation is O(d), and thus this search 4.5. THE NEAREST-NEIGHBOR RULE 29 B C 1 A D E 1 OR AND AND AND ... ... 12 d 12 ... d 12 ... d 12 ... d 12 ... d ... ... 12 d 12 ... d 12 ... d ... ... 12 d Figure 4.17: A parallel nearest-neighbor circuit can perform search in constant -- i.e., O(1) -- time. The d-dimensional test pattern x is presented to each box, which calculates which side of a cell's face x lies on. If it is on the "close" side of every face of a cell, it lies in the Voronoi cell of the stored pattern, and receives its label. is O(dn2 ). An alternative but straightforward parallel implementation is shown in Fig. 4.17, which is O(1) in time and O(n) in space. There are three general algorithmic techniques for reducing the computational burden in nearest-neighbor searches: computing partial distances, prestructuring, and editing the stored prototypes. In partial distance, we calculate the distance using some subset r of the full d dimensions, and if this partial distance is too great we do not compute further. The partial distance based on r selected dimensions is r 1/2 partial distance Dr (a, b) = k=1 (ak - bk ) 2 (56) where r < d. Intuitively speaking, partial distance methods assume that what we know about the distance in a subspace is indicative of the full space. Of course, the partial distance is strictly non-decreasing as we add the contributions from more and more dimensions. Consequently, we can confidently terminate a distance calculation to any prototype once its partial distance is greater than the full r = d Euclidean distance to the current closest prototype. In presturcturing we create some form of search tree in which prototypes are selectively linked. During classification, we compute the distance of the test point to one or a few stored "entry" or "root" prototypes and then consider only the prototypes linked to it. Of these, we find the one that is closest to the test point, and recursively consider only subsequent linked prototypes. If the tree is properly structured, we will reduce the total number of prototypes that need to be searched. Consider a trivial illustration of prestructuring in which we store a large number of prototypes that happen to be distributed uniformly in the unit square, i.e., p(x) search tree 30 CHAPTER 4. NONPARAMETRIC TECHNIQUES editing U 0 , 1 . Imagine we prestructure this set using four entry or root prototypes -- 0 1 3/4 at 1/4 , 1/4 , 3/4 and 3/4 -- each fully linked only to points in its corresponding 1/4 3/4 1/4 quadrant. When a test pattern x is presented, the closest of these four prototypes is determined, and then the search is limited to the prototypes in the corresponding quadrant. In this way, 3/4 of the prototypes need never be queried. Note that in this method we are no longer guaranteed to find the closest prototype. For instance, suppose the test point is near a boundary of the quadrants, e.g., x = 0.499 0.499 . In this particular case only prototypes in the first quadrant will be searched. Note however that the closest prototype might actually be in one of the other three quadrants, somewhere near 0.5 . This illustrates a very general property in pattern 0.5 recognition: the tradeoff of search complexity against accuracy. More sophisticated search trees will have each stored prototype linked to a small number of others, and a full analysis of these methods would take us far afield. Nevertheless, here too, so long as we do not query all training prototypes, we are not guaranteed that the nearest prototype will be found. The third method for reducing the complexity of nearest-neighbor search is to eliminate "useless" prototypes during training, a technique known variously as editing, pruning or condensing. A simple method to reduce the O(n) space complexity is to eliminate prototypes that are surrounded by training points of the same category label. This leaves the decision boundaries -- and hence the error -- unchanged, while reducing recall times. A simple editing algorithm is as follows. Algorithm 3 (Nearest-neighbor editnd a2 . The small red number in each image is the Euclidean distance between the tangent approximation and the image generated by the unapproximated transformations. Of course, this Euclidean distance is 0 for the prototype and for the cases a1 = 1, a2 = 0 and a1 = 0, a2 = 1. (The patterns generated with a1 + a2 > 1 have a gray background because of automatic grayscale conversion of images with negative pixel values.) optimal a can also be found by standard matrix methods, but these generally have higher computational complexities, as is explored in Problems 21 & 22. We note that the methods for editing and prestructuring data sets described in Sec. 4.5.5 can be applied to tangent distance classifers too. Nearest-neighbor classifiers using tangent distance have been shown to be highly accurate, but they require the designer to know which invariances and to be able to perform them on each prototype. Some of the insights from tangent approach can also be used for learning which invariances underly the training data -- a topic we shall revisit in Chap. ??. 36 CHAPTER 4. NONPARAMETRIC TECHNIQUES x3 x1 x2 ) D tan(x', x2 Ta tange TV1 nt sp ace TV2 x2 x' x1 Figure 4.22: A stored prototype x , if transformed by combinations of two basic transformations, would fall somewhere on a complicated curved surface in the full d-dimensional space (gray). The tangent space at x is an r-dimensional Euclidean space, spanned by the tangent vectors (here TV1 and TV2 ). The tangent distance Dtan (x , x) is the smallest Euclidean distance from x to the tangent space of x , shown in the solid red lines for two points, x1 and x2 . Thus although the Euclidean distance from x to x1 is less than to x2 , for the tangent distance the situation is reversed. The Euclidean distance from x2 to the tangent space of x is a quadratic function of the parameter vector a, as shown by the pink paraboloid. Thus simple gradient descent methods can find the optimal vector a and hence the tangent distance Dtan (x , x2 ). 4.7 Fuzzy Classification Occassionally we may have informal knowledge about a problem domain where we seek to build a classifier. For instance, we might feel, generally speaking, that an adult salmon is oblong and light in color, while a sea bass is stouter and dark. The approach taken in fuzzy classification is to create so-called "fuzzy category memberships functions," which convert an objectively measurable parameter into a subjective "category memberships," which are then used for classification. We must stress immediately that the term "categories" used by fuzzy practitioners refers not to the final class as we have been discussing, but instead just overlapping ranges of feature values. For instance, if we consider the feature value of lightness, fuzzy practitioners might split this into five "categories" -- dark, medium-dark, medium, medium-light and light. In order to avoid misunderstandings, we shall use quotations when discussing such "categories." For example we might have the lightness and shape of a fish be judged as in Fig. 4.23. Next we need a way to convert an objective measurement in several features into a category decision about the fish, and for this we need a merging or conjunction conjunction rule -- a way to take the "category memberships" (e.g., lightness and shape) and rule yield a number to be used for making the final decision. Here fuzzy practitioners have 4.7. *FUZZY CLASSIFICATION 37 1 x Figure 4.23: "Category membership" functions, derived from the designer's prior knowledge, together with a lead to discriminants. In this figure x might represent an objectively measureable value such as the reflectivity of a fish's skin. The designer believes there are four relevant ranges, which might be called dark, medium-dark, medium-light and light. Note, the memberships are not in true categories we wish to classify, but instead merely ranges of feature values. at their disposal a large number of possible functions. Indeed, most functions can be used and there are few principled criteria to preference one overial way of obtaining polynomial discriminant functions. discriminant Before becoming too enthusiastic, however, we should note one of the problems with this approach. A key property of a useful window function is its tendency to peak at the origin and fade away elsewhere. Thus ((x - xi )/hn ) should peak sharply at x = xi , and contribute little to the approximation of pn (x) for x far from xi . Unfortunately, polynomials have the annoying property of becoming unbounded. Thus, in a polynomial expansion we might find the terms associated with an xi far from x contributing most (rather than least) to the expansion. It is quite important, therefore, to be sure that the expansion of each windown function is in fact accurate in the region of interest, and this may well require a large number of terms. There are many types of series expansions one might consider. Readers familiar with integral equations will naturally interpret Eq. 66 as an expansion of the kernel 4.9. *APPROXIMATIONS BY SERIES EXPANSIONS eigenfunction 43 (x, xi ) in a series of eigenfunctions. (In analogy with eigenvectors and eigenvalues, eigenfunctions are solutions to certain differential equations with fixed real-number coefficients.) Rather than computing eigenfunctions, one might choose any reasonable set of functions orthogonal over the region of interest and obtain a least-squares fit to the window function. We shall take an even more straightforward approach and expand the window function in a Taylor series. For simplicity, we confine our attention to a one-dimensional example using a Gaussian window function: (u) = e-u 2 m-1 (-1)j j=0 u2j . j! This expansion is most accurate near u = 0, and is in error by less than u2m /m!. If we substitute u = (x - xi )/h, we obtain a polynomial of degree 2(m - 1) in x and xi . For example, if m = 2 the window function can be approximated as x - xi h = and thus pn (x) = 1 nh n 1- x - xi 2 h 2 1 1 1 + 2 x xi - 2 x2 - 2 x2 , h h h i i=1 x - xi h b0 + b1 x + b2 x2 , (70) where the coefficients are 1 1 1 - h h3 n 2 1 h3 n 1 . h3 n n b0 b1 b2 = = x2 i i=1 xi i=1 = - This simple expansion condenses the information in n samples into the values, b0 , b1 , and b2 . It is accurate if the largest value of |x - xi | is not greater than h. Unfortunately, this restricts us to a very wide window that is not capable of much resolution. By taking more terms we can use a narrower window. If we let r be the largest value of |x - xi | and use the fact that the error is the m-term expansion of ((x - xi )/h) is less than (r/h)2m m!, then using Stirling's approximation for m! we find that the error in approximating pn (x) is less than 1 r/h h m! 2m h 2m 1 e m r h 2 m . (71) Thus, the error becomes small only when m > e(r/h)2 . This implies the need for many terms if the window size h is small relative to the distance r from x to the most 44 CHAPTER 4. NONPARAMETRIC TECHNIQUES distant sample. Although this example is rudimentary, similar considerations arise in the multidimensional case even when more sophisticated expansions are used, and the procedure is most attractive when the window size is relatively large. 4.10 Fisher Linear Discriminant One of the recurring problems encountered in applying statistical techniques to pattern recognition problems has been called the "curse of dimensionality." Procedures that are analytically or computationally manageable in low-dimensional spaces can become completely impractical in a space of 50 or 100 dimensions. Pure fuzzy methods are particularly ill-suited to such high-dimensional problems since it is implausible that the designer's linguistic intuition extends to such spaces. Thus, various techniques have been developed for reducing the dimensionality of the feature space in the hope of obtaining a more manageable problem. We can reduce the dimensionality from d dimensions to one dimension if we merely project the d-dimensional data onto a line. Of course, even if the samples formed well-separated, compact clusters in d-space, projection onto unchanged. If we have very little data, we would tend to project to a subspace of low dimension, while if there is more data, we can use a higher dimension, as we shall explore in Chap. ??. Once we have projected the distributions onto the optimal subspace (defined as above), we can use the methods of Chapt. ?? to create our full classifier. As in the two-class case, multiple discriminant analysis primarily provides a reasonable way of reducing the dimensionality of the problem. Parametric or nonparametric techniques that might not have been feasible in the original space may work well in the lower-dimensional space. In particular, it may be possible to estimate separate covariance matrices for each class and use the general multivariate normal assumption after the transformation where this could not be done with the original data. In general, if the transformation causes some unnecessary overlapping of the data and increases the theoretically achievable error rate, then the problem of classifying the data still remains. However, there are other ways to reduce the dimensionality of 4.11. SUMMARY 51 data, and we shall encounter this subject again in Chap. ??. We note that there are also alternate methods of discriminant analysis -- such as the selection of features based on statistical sigificance -- some of which are given in the references for this chapter. Of these, Fisher's method remains a fundamental and widely used technique. Summary There are two overarching approaches to non-parametric estimation for pattern classification: in one the densities are estimated (and then used for classification), in the other the category is chosen directly. The former approach is exemplified by Parzen windows and their hardware implementation, Probabilistic neural networks. The latter is exemplified by k-nearest-neighbor and several forms of relaxation networks. In the limit of infinite training data, the nearest-neighbor error rate is bounded from above by twice the Bayes error rate. The extemely high space complexity of the nominal nearest-neighbor method can be reduced by editing (e.g., removing those prototypes that are surrounded by prototypes of the same category), prestructuring the data set for efficient search, or partial distance calculations. Novel distance measures, such as the tangent distance, can be used in the nearest-neighbor algorithm for incorporating known tranformation invariances. Fuzzy classification methods employ heuristic choices of "category membership" and heuristic conjunction rules to obtain discriminant functions. Any benefit of such techniques is limited to cases where there is very little (or no) training data, small numbers of features, and when the knowledge can be gleaned from the designer's prior knowledge. Relaxation methods such as potential functions create "basins of attraction" surrounding training prototypes; when a test pattern lies in such a basin, the corresponding prototype can be easily identified along with its category label. Reduced coloumb energy networks are one in the class of such relaxation networks, the basins are adjusted to be as large as possible yet not include prototypes from other categories. The Fisher linear discriminant finds a good subspace in which categories are best separated; other techniques can then be applied in the subspace. Fisher's method can be extended to cases with multiple categories projected onto subspaces of higher dimension than a line. Bibliographical and Historical Remarks Parzen introduced his window method for estimating density functions [32], and its use in regression was pioneered by Ndaraya and Watson [?, ?]. Its natural application to classification problems stems from the work of Specht [39], including its PNN hardware implementation [40]. Nearest-neighbor methods were first introduced by [16, 17], but it was over fifteen years later that computer power had increased, thereby making it practical and renewing interest in its theoretical foundations. Cover and Hart's foundational work on asymptotic bounds [10] were expanded somewhat s are all large and roughly equal in Rd , and that neighborhoods that have even just a few points must have large radii. (c) Find ld (p), the length of a hypercube edge in d dimensions that contains the fraction p of points (0 p 1). To better appreciate the implications of your result, calculate: l5 (0.01), l5 (0.1), l20 (0.01), and l20 (0.1). (d) Show that nearly all points are close to an edge of the full space (e.g., the unit hypercube in d dimensions). Do this by calculating the L distance from one point to the closest other point. This shows that nearly all points are closer to an edge than to another training point. (Argue that L is more favorable than L2 distance, even though it is easier to calculate here.) The result shows that most points are on or near the convex hull of training samples and that nearly every point is an "outlier" with respects to all the others. 12. Show how the "curse of dimensionality" (Problem 11) can be "overcome" by choosing or assuming that your model is of a particular sort. Suppose that we are estimating a function of the form y = f (x) + N (0, 2 ). n (a) Suppose the true function is linear, f (x) = j=1 aj xj , and that the approximation ^ is f (x) = n aj xj . Of course, the fit coefficients are: ^ j=1 n yi - d 2 aj xij , aj = arg min ^ aj i=1 j=1 ^ for j = 1, . . . , d. Prove that E[f (x) - f (x)]2 = d 2 /n, i.e., that it increases linearly with d, and not exponentially as the curse of dimensionality might otherwise suggest. (b) Generalize your result from part (a) to the case where a function is expressed n in a different basis set, i.e., f (x) = i=1 ai Bi (x) for some well-behaved basis set Bi (x), and hence that the result does not depend on the fact that we have used a linear basis. 13. Consider classifiers based on samples from the distributions 2x for 0 x 1 0 otherwise, 2 - 2x for 0 x 1 0 otherwise. p(x|1 ) and p(x|2 ) = = (a) What is the Bayes decision rule and the Bayes classification error? 56 CHAPTER 4. NONPARAMETRIC TECHNIQUES (b) Suppose we randomly select a single point from 1 and a single point from 2 , and create a nearest-neighbor classifier. Suppose too we select a test point from one of the categories (1 for definiteness). Integrate to find the expected error rate P1 (e). (c) Repeat with two training samples from each category and a single test point in order to find P2 (e). (d) Generalize to find the arbitrary Pn (e). (e) Compare lim Pn (e) with the Bayes error. n 14. Repeat Problem 13 but with 3/2 0 3/2 0 for 0 x 2/3 otherwise, for 1/3 x 1 otherwise. p(x|1 ) and p(x|2 ) = = 15. Expand in greater detail Algorithm 3 and add a conditional branch that will speed it. Assuming the data points come from c categories and there are, on average, k Voronoi neighbors of any point x, on average how much faster will your improved algorithm be? 16. Consider the simple nearest-neighbor editing algorithm (Algorithm 3). (a) Show by counterexample that this algorithm does not yield the minimum set of points. (Hint: consider a problem where the points from each of two-categories are constrained to be on the intersections of a two-dimensional Cartesian grid.) (b) Create a sequential editing algorithm, in which each point is considered in turn, and retained or rejected before the next point is considered. Prove that your algorithm does or does not depend upon the sequence the points are considered. 17. Consider classification problem where each of the c categories possesses the same distribution as well as prior P (i ) = 1/c. Prove that the upper bound in Eq. 53, i.e., P P 2 - c P , c-1 is achieved in this "zero-information" case. 18. Derive Eq. 55. Section 4.6 19. Consider the Euclidean metric in d dimensions: d D(a, b) = k=1 (ak - bk )2 . 4.11. PROBLEMS 57 Suppose we rescale each axis by a fixed factor, i.e., let xk = k xk for real, non-zero constants k , k = 1, 2, ..., d. Prove that the resulting space is a metric space. Discuss the import of this fact for standard nearest-neighbor classification methods. 20. Prove that the Minkowski me of the properties of density estimation in the following way. sampled from x3 0.14 -0.38 0.69 1.31 0.87 1.35 0.92 0.97 0.99 0.88 4.11. COMPUTER EXERCISES 63 (a) Write a program to generate points according to a uniform distribution in a unit cube, -1/2 xi 1/2 for i = 1, 2, 3. Generate 104 such points. (b) Write a program to estimate the density at the origin based on your 104 points as a function of the size of a cubical window function of size h. Plot your estimate as a function of h, for 0 < h 1. (c) Evaluate the density at the origin using n of your points and the volume of a cube window which just encloses n points. Plot your estimate as a function of n = 1, ..., 104 . (d) Write a program to generate 104 points from a spherical Gaussian density (with = I) centered on the origin. Repeat (b) & (c) with your Gaussian data. (e) Discuss any qualitative differences between the functional dependencies of your estimation results for the uniform and Gaussian densities. Section 4.3 2. Consider Parzen-window estimates and classifiers for points in the table above. Let your window function be a spherical Gaussian, i.e., ((x - xi )/h) Exp[-(x - xi )t (x - xi )/(2h2 )]. (a) Write a program to classify an arbitrary test point x based on the Parzen window estimates. Train your classifier using the three-dimensional data from your three categories in the table above. Set h = 1 and classify the following three points: (0.50, 1.0, 0.0)t , (0.31, 1.51, -0.50)t and (-0.3, 0.44, -0.1)t . (b) Repeat with h = 0.1. Section 4.4 3. Consider k-nearest-neighbor density estimations in different numbers of dimensions (a) Write a program to find the k-nearest-neighbor density for n (unordered) points in one dimension. Use your program to plot such a density estimate for the x1 values in category 3 in the table above for k = 1, 3 and 5. (b) Write a program to find the k-nearest-neighbor density estimate for n points in two dimensions. Use your program to plot such a density estimate for the x1 - x2 values in 2 for k = 1, 3 and 5. (c) Write a program to form a k-nearest-neighbor classifier for the three-dimensional data from the three categories in the table above. Use your program with k = 1, 3 and 5 to estimate the relative densities at the following points: (-0.41, 0.82, 0.88)t , (0.14, 0.72, 4.1)t and (-0.81, 0.61, -0.38)t . Section 4.5 4. Write a program to create a Voronoi tesselation in two dimensions as follows. 64 CHAPTER 4. NONPARAMETRIC TECHNIQUES (a) First derive analytically the equation of a line separating two arbitrary points. (b) Given the full data set D of prototypes and a particular point x D, write a program to create a list of line segments comprising the Voronoi cell of x. (c) Use your program to form the Voronoi tesselation of the x1 - x2 features from the data of 1 and 3 in the table above. Plot your Voronoi diagram. (d) Write a program to find the category decision boundary based on this full set D. (e) Implement a version of the pruning method described in Algorithm 3. Prune your data set from (c) to form a condensed set. (f) Apply your programs from (c) & (d) to form the Voronoi tesselation and boundary for your condensed data set. Compare the decision boundaries you found for the full and the condensed sets. 5. Explore the tradeoff between computational complexity (as it relates to partial distance calculations) and search accuracy in nearest-neighbor classifiers in the following exercise. (a) Write a program to generate n prototypes from a uniform distributions in a 6-dimensional hypercube centered on the origin. Use your program to generate 106 points for category 1 , 106 different points for category 2 , and likewise for 3 and 4 . Denote this full set D. (b) Use your program to generate a test set Dt of n = 100 points, also uniformly distributed in the 6-dimensional hypercube. (c) Write a program to implement the nearest-neighbor neighbor algorithm. Use this program to label each of your points in Dt by the category of its nearest neighbor in D. From now on we will assume that the labels you find are gorithm, 18 training Algorithm, 17 probability posterior nonparametric estimation, 5 subjective, 38 prototype, 19 Rayleigh quotient, 47 RCE, see reduced coulomb energy classification Algorithm, 41 training Algorithm, 40 reduced coulomb energy, 40 scatter between-class, 47 74 within-class, 46, 47 scatter matrix, see matrix, scatter search tree, 30 subjective probability, see probability, subjective tangent vector, 34 training data limited, 7 triangle inequality, see metric, triangle inequality variance Parzen estimate convergence, 11 vector mean total, see mean vector, total Voronoi cell, 23 tesselation, 23 window function Gaussian, 43 within-class scatter, see scatter, withinclass zero-information distribution, 27 INDEX Contents 5 Linear Discriminant Functions 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Linear Discriminant Functions and Decision Surfaces . . . . . . . . . 5.2.1 The Two-Category Case . . . . . . . . . . . . . . . . . . . . . 5.2.2 The Multicategory Case . . . . . . . . . . . . . . . . . . . . . 5.3 Generalized Linear Discriminant Functions . . . . . . . . . . . . . . 5.4 The Two-Category Linearly-Separable Case . . . . . . . . . . . . . . 5.4.1 Geometry and Terminology . . . . . . . . . . . . . . . . . . . 5.4.2 Gradient Descent Procedures . . . . . . . . . . . . . . . . . . Algorithm 1: Gradient descent . . . . . . . . . . . . . . . . . . . . . . Algorithm 2: Newton descent . . . . . . . . . . . . . . . . . . . . . . 5.5 Minimizing the Perceptron Criterion Function . . . . . . . . . . . . . 5.5.1 The Perceptron Criterion Function . . . . . . . . . . . . . . . Algorithm 3: Batch Perceptron . . . . . . . . . . . . . . . . . . . . . 5.5.2 Convergence Proof for Single-Sample Correction . . . . . . . Algorithm 4: Fixed increment descent . . . . . . . . . . . . . . . . . 5.5.3 Some Direct Generalizations . . . . . . . . . . . . . . . . . . Algorithm 5: Fixed increment descent . . . . . . . . . . . . . . . . . Algorithm 6: Batch variable increment Perceptron . . . . . . . . . . Algorithm 7: Balanced Winnow algorithm . . . . . . . . . . . . . . . 5.6 Relaxation Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 The Descent Algorithm . . . . . . . . . . . . . . . . . . . . . Algorithm 8: Relaxation training with margin . . . . . . . . . . . . . Algorithm 9: Relaxation rule . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Convergence Proof . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Nonseparable Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Minimum Squared Error Procedures . . . . . . . . . . . . . . . . . . 5.8.1 Minimum Squared Error and the Pseudoinverse . . . . . . . . Example 1: Constructing a linear classifier by matrix pseudoinverse 5.8.2 Relation to Fisher's Linear Discriminant . . . . . . . . . . . . 5.8.3 Asymptotic Approximation to an Optimal Discriminant . . . 5.8.4 The Widrow-Hoff Procedure . . . . . . . . . . . . . . . . . . . Algorithm 10: LMS algorithm . . . . . . . . . . . . . . . . . . . . . . 5.8.5 Stochastic Approximation Methods . . . . . . . . . . . . . . . 5.9 *The Ho-Kashyap Procedures . . . . . . . . . . . . . . . . . . . . . . 5.9.1 The Descent Procedure . . . . . . . . . . . . . . . . . . . . . Algorithm 11: Ho-Kashyap . . . . . . . . . . . . . . . . . . . . . . . 5.9.2 Convergence Proof . . . . . . . . . . . . . . . . . . . . . . . . 1 3 3 4 4 6 6 11 11 12 13 14 14 14 15 17 18 21 21 22 23 23 23 24 25 25 27 28 28 29 30 32 34 34 35 37 37 39 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 5.9.3 Nonseparable Behavior . . . . . . . . . . . . . . 5.9.4 Some Related Procedures . . . . . . . . . . . . Algorithm 12: Modified Ho-Kashyap . . . . . . . . . . 5.10 *Linear Programming Algorithms . . . . . . . . . . . . 5.10.1 Linear Programming . . . . . . . . . . . . . . . 5.10.2 The Linearly Separable Case . . . . . . . . . . 5.10.3 Minimizing the Perceptron Criterion Function . 5.11 *Support Vector Machines . . . . . . . . . . . . . . . . 5.11.1 SVM training . . . . . . . . . . . . . . . . . . . Example 2: SVM for the XOR problem . . . . . . . . . 5.12 Multicategory Generalizations . . . . . . . . . . . . . . 5.12.1 Kesler's Construction . . . . . . . . . . . . . . 5.12.2 Convergence of the Fixed-Increment Rule . . . 5.12.3 Generalizations for MSE Procedures . . . . . . Bibliographical and Historical Remarks . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computer exercises . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 42 42 44 44 45 46 49 50 51 52 52 53 55 57 57 65 67 70 Chapter 5 Linear Discriminant Functions 5.1 Introduction I n Chap. ?? we assumed that the forms for the underlying probability densities were known, and used the training samples to estimate the values of their parameters. In this chapter we shall instead assume we know the proper forms for the discriminant functions, and use the samples to estimate the values of parameters of the classifier. We shall examine various procedures for determining discriminant functions, some of which are statistical and some of which are not. None of them, however, requires knowledge of the forms of underlying probability distributions, and in this limited sense they can be said to be nonparametric. Throughout this chapter we shall be concerned with discriminant functions that are either linear in the components of x, or linear in some given set of functions of x. Linear discriminant functions have a variety of pleasant analytical properties. As we have seen in Chap. ??, they can be optimal if the underlying distributions are cooperative, such as Gaussians having equal covariance, as might be obtained through an intelligent choice of feature detectors. Even when they are not optimal, we might be willing to sacrifice some performance in order to gain the advantage of their simplicity. Linear discriminant functions are relatively easy to compute and in the absense of information suggesting otherwise, linear classifiers are an attractive candidates for initial, trial classifiers. They also illustrate a number of very important principles which will be used more fully in neural networks (Chap. ??). The problem of finding a linear discriminant function will be formulated as a problem of minimizing a criterion function. The obvious criterion function for classification purposes is the sample risk, or training error -- the average loss incurred in classifying the set of training samples. We must emphasize right away, however, that despite the attractiveness of this criterion, it is fraught with problems. While our goal will be to classify novel test patterns, a small training error does not guarantee a small test error -- a fascinating and subtle problem that will command our attention in Chap. ??. As we shall see here, it is difficult to derive the minimum-risk linear discriminant anyway, and for that reason we investigate several related criterion functions that are analytically more tractable. Much of our attention will be devoted to studying the convergence properties 3 training error 4 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS and computational complexities of various gradient descent procedures for minimizing criterion functions. The similarities between many of the procedures sometimes makes it difficult to keep the differences between them clear and for this reason we have included a summary of the principal results in Table 5.1 at the end of Sect. 5.10. 5.2 5.2.1 Linear Discriminant Functions and Decision Surfaces The Two-Category Case A discriminant function that is a linear combination of the components of x can be written as g(x) = wt x + w0 , threshold weight (1) where w is the weight vector and w0 the bias or threshold weight. A two-category linear classifier implements the following decision rule: Decide 1 if g(x) > 0 and 2 if g(x) < 0. Thus, x is assigned to 1 if the inner product wt x exceeds the threshold -w0 and 2 otherwise. If g(x) = 0, x can ordinarily be assigned to either class, but in this chapter we shall leave the assignment undefined. Figure 5.1 shows a typical implementation, a clear example of the general structure of a pattern recognition system we saw in Chap. ??. g(x) w0 wd w1 w2 x0 = 1 x1 x2 ... ... xd Figure 5.1: A simple linear classifier having d input units, each corresponding to the values of the components of an input vector. Each input feature value xi is multiplied by its corresponding weight wi ; the output unit sums all these products and emits a +1 if wt x + w0 > 0 or a -1 otherwise. The equation g(x) = 0 defines the decision surface that separates points assigned to 1 from points assigned to 2 . When g(x) is linear, this decision surface is a hyperplane. If x1 and x2 are both on the decision surface, then wt x1 + w0 = wt x2 + w0 or wt (x1 - x2 ) = 0, 5.2. LINEAR DISCRIMINANT FUNCTIONS AND DECISION SURFACES 5 and this shows that w is normal to any vector lying in the hyperplane. In general, the hyperplane H divides the feature space into two halfspaces, decision region R1 for 1 and region R2 for 2 . Since g(x) > 0 if x is in R1 , it follows that the normal vector w points into R1 . It is sometimes said that any x in R1 is on the positive side of H, and any x in R2 is on the negative side. The discriminant function g(x) gives an algebraic measure of the distance from x to the hyperplane. Perhaps the easiest way to see this is to express x as x = xp + r w , w where xp is the normal projection of x onto H, and r is the desired algebraic distance -- positive if x is on the positive side and negative if x is on the negative side. Then, since g(xp ) = 0, g(x) = wt x + w0 = r w , or g(x) . w r= In particular, the distance from the origin to H is given by w0 / w . If w0 > 0 the origin is on the positive side of H, and if w0 < 0 it is on the negative side. If w0 = 0, then g(x) has the homogeneous form wt x, and the hyperplane passes through the origin. A geometric illustration of these algebraic results is given in Fig. 5.2. x3 r x R1 xp R2 w || /||w g(x)=0 x2 H x1 w0 Figure 5.2: The linear decision boundary H, where g(x) = wt x + w0 = 0, separates the feature space into two half-spaces R1 (where g(x) > 0) and R2 (where g(x) < 0). To summarize, a linear discriminant function divides the feature space by a hyperplane decision surface. The orientation of the surface is determined by the normal vector w, and the location of the surface is determined by the bias w0 . The discriminant function g(x) is proportional to the signed distance from x to the hyperplane, with g(x) > 0 when x is on the positive side, and g(x) < 0 when x is on the negative side. 6 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS 5.2.2 The Multicategory Case There is more than one way to devise multicategory classifiers employing linear discriminant functions. For example, we might reduce the problem to c - 1 two-class problems, where the ith problem is solved by a linear discriminant function that separates points assigned to i from those not assigned to i . A more extravagant approach would be to use c(c - 1)/2 linear discriminants, one for every pair of classes. As illustrated in Fig. 5.3, both of these approaches can lead to regions in which the classification is undefined. We shall avoid this problem by adopting the approach taken in Chap. ??, defining c linear discriminant functions gi (x) = wt xi + wi0 linear machine i = 1, ..., c, (2) and assigning x to i if gi (x) > gj (x) for all j = i; in case of ties, the classification is left undefined. The resulting classifier is called a linear machine. A linear machine divides the feature space into c decision regions, with gi (x) being the largest discriminant if xo y reduces the problem to one of finding a homogeneous linear discriminant function. Some of the advantages and disadvantages of this approach can be clarified by considering a simple example. Let the quadratic discriminant function be g(x) = a1 + a2 x + a3 x2 , so that the three-dimensional vector y is given by 1 y = x . x2 (7) (8) The mapping from x to y is illustrated in Fig. 5.5. The data remain inherently onedimensional, since varying x causes y to trace out a curve in three dimensions. Thus, one thing to notice immediately is that if x is governed by a probability law p(x), the induced density p(y) will be degenerate, being zero everywhere except on the curve, ^ ^ where it is infinite. This is a common problem whenever d > d, and the mapping takes points from a lower-dimensional space to a higher-dimensional space. ^ ^ The plane H defined by at y = 0 divides the y-space into two decision regions R1 ^ 2 . Figure ?? shows the separating plane corresponding to a = (-1, 1, 2)t , the and R ^ ^ decision regions R1 and R2 , and their corresponding decision regions R1 and R2 in the original x-space. The quadratic discriminant function g(x) = -1 + x + 2x2 is 5.3. GENERALIZED LINEAR DISCRIMINANT FUNCTIONS 9 4 y= () 1 x x2 2 R1 y3 x -2 -1 0 1 2 0 R2 1 0 0.5 1 0 2 R1 R2 R1 y2 y1 1.5 -1 2 2.5 Figure 5.5: The mapping y = (1, x, x2 )t takes a line and transforms it to a parabola in three dimensions. A plane splits the resulting y space into regions corresponding to two categories, and this in turn gives a non-simply connected decision region in the one-dimensional x space. positive if x < -1 or if x > 0.5, and thus R1 is multiply connected. Thus although the decision regions in y-space are convex, this is by no means the case in x-space. More generally speaking, even with relatively simple functions yi (x), decision surfaces induced in an x-space can be fairly complex (Fig. 5.6). Unfortunately, the curse of dimensionality often makes it hard to capitalize on ^ this flexibility in practice. A complete quadratic discriminant function involves d = (d + 1)(d + 2)/2 terms. If d is modestly large, say d = 50, this requires the com^ putation of a great many terms; inclusion of cubic and higher orders leads to O(d3 ) ^ components of the weight vector a must be determined terms. Furthermore, the d ^ from training samples. If we think of d as specifying the number of degrees of freedom for the discriminant function, it is natural to require that the number of samples be not less than the number of degrees of freedom (cf., Chap. ??). Clearly, a general series expansion of g(x) can easily lead to completely unrealistic requirements for computation and data. We shall see in Sect. ?? that this drawback can be accommodated by imposing a constraint of large margins, or bands between the training patterns, however. In this case, we are not technically speaking fitting all the free parameters; instead, we are relying on the assumption that the mapping to a highdimensional space does not impose any spurious structure or relationships among the training points. Alternatively, multilayer neural networks approach this problem by employing multiple copies of a single nonlinear function of the input features, as we shall see in Chap. ??. While it may be hard to realize the potential benefits of a generalized linear discriminant function, we can at least exploit the convenience of being able to write g(x) in the homogeneous form at y. In the particular case of the linear discriminant function 10 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS y3 y2 w x x 2 1x R1 ^ 2 ) 1 x y = x2 ^ H R2 x2 ( ^ y1 x1 R1 R2 R1 x1 Figure 5.6: The two-dimensional input space x is mapped through a polynomial function f to y. Here the mapping is y1 = x1 , y2 = x2 and y3 x1 x2 . A linear discriminant in this transformed space is a hyperplane, which cuts the surface. Points ^ to the positive side of the hyperplane H correspond to category 1 , and those beneath it 2 . Here, in terms of the x space, R1 is a not simply connected. d d g(x) = w0 + i=1 wi xi = i=0 wi xi (9) where we set x0 = 1. Thus we can write 1 x1 y= . . . xd augmented vector , = x 1 (10) and y is sometimes called an augmented feature vector. Likewise, an augmented weight vector can be written as: w0 w0 w1 . a= . = (11) . w . wd This mapping from d-dimensional x-space to (d+1)-dimensional y-space is mathematically trivial but nonetheless quite convenient. The addition of a constant component to x preserves all distance relationships among samples. The resulting y vectors all lie in a d-dimensional subspace, which is the x-space itself. The hyperplane deci^ sion surface H defined by at y = 0 passes through the origin in y-space, even though the corresponding hyperplane H can be in any position in x-space. The distance from ^ y to H is given by |at y|/ a , or |g(x)|/ a . Since a > w , this distance is less 5.4. THE TWO-CATEGORY LINEARLY-SEPARABLE CASE 11 than, or at most equal to the distance from x to H. By using this mapping we reduce the problem of finding a weight vector w and a threshold weight w0 to the problem of finding a single weight vector a (Fig. 5.7). R1 y0 y0 = 1 y2 a R2 y0 = 0 y1 Figure 5.7: A three-dimensional augmented feature space y and augmented weight vector a (at the origin). The set of points for which at y = 0 is a plane (or more generally, a hyperplane) perpendicular to a and passing through the origin of yspace, as indicated by the red disk. Such a plane need not pass through the origin of the two-dimensional x-space at the top, of course, as shown by the dashed line. Thus there exists an augmented weight vector a that will lead to any straight decision line in x-space. 5.4 5.4.1 The Two-Category Linearly-Separable Case Geometry and Terminology Suppose now that we have a set of n samples y1 , ..., yn , some labelled 1 and some labelled 2 . We want to use these samples to determine the weights a in a linear discriminant function g(x) = at y. Suppose we have reason to believe that there exists a solution for which the probability of error is very low. Then a reasonable approach is to look for a weight vector that classifies all of the samples correctly. If such a weight vector exists, the samples are said to be linearly separable. A sample yi is classified correctly if at yi > 0 and yi is labelled 1 or if at yi < 0 and yi is labelled 2 . This suggests a "normalization" that simplifies the treatment of the two-category case, viz., the replacement of all samples labelled 2 by their negatives. With this "normalization" we can forget the labels and look for a weight vector a such that at yi > 0 for all of the samples. Such a weight vector is called a separating vector or more generally a solution vector. The weight vector a can be thought of as specifying a point in weight space. Each sample yi places a constraint on the possible location of a solution vector. The equation at yi = 0 defines a hyperplane through the origin of weight space having yi as a normal vector. The solution vector -- if it exists -- must be on the positive side linearly separable separating vector 12 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS solution region of every hyperplane. Thus, a solution vector must lie in the intersection of n halfspaces; indeed any vector in this region is a solution vector. The corresponding region is called the solution region, and should not be confused with the decision region in feature space corresponding to any particular category. A two-dimensional example illustrating the solution region for both the normalized and the unnormalized case is shown in Fig. 5.8. solution region y2 a solution region y2 a separ atin ne g pla y1 "sepa ra pla ting" ne y1 Figure 5.8: Four training samples (black for 1 , red for 2 ) and the solution region in feature space. The figure on the left shows the raw data; the solution vectors leads to a plane that separates the patterns from the two categories. In the figure on the right, the red points have been "normalized" -- i.e., changed in sign. Now the solution vector leads to afier. batch training Thus, the batch Perceptron algorithm for finding a solution vector can be stated very simply: the next weight vector is obtained by adding some multiple of the sum of the misclassified samples to the present weight vector. We use the term "batch" to refer to the fact that (in general) a large group of samples is used when computing each weight update. (We shall soon see alternate methods based on single samples.) Figure 5.12 shows how this algorithm yields a solution vector for a simple two-dimensional example with a(1) = 0, and (k) = 1. We shall now show that it will yield a solution for any linearly separable problem. 5.5. MINIMIZING THE PERCEPTRON CRITERION FUNCTION 17 10 5 Jp 0 y3 y1 -2 0 a2 2 y2 y3 y1 solution y3 region 4 4 2 a1 0 -2 Figure 5.12: The Perceptron criterion, Jp is plotted as a function of the weights a1 and a2 for a three-pattern problem. The weight vector begins at 0, and the algorithm sequentially adds to it vectors equal to the "normalized" misclassified patterns themselves. In the example shown, this sequence is y2 , y3 , y1 , y3 , at which time the vector lies in the solution region and iteration terminates. Note that the second update (by y3 ) takes the candidate vector farther from the solution region than after the first update (cf. Theorem 5.1. (In an alternate, batch method, all the misclassified points are added at each iteration step leading to a smoother trajectory in weight space.) 5.5.2 Convergence Proof for Single-Sample Correction We shall begin our examination of convergence properties of the Perceptron algorithm with a variant that is easier to analyze. Rather than testing a(k) on all of the samples and basing our correction of the set Yk of misclassified training samples, we shall consider the samples in a sequence and shall modify the weight vector whenever it misclassifies a single sample. For the purposes of the convergence proof, the detailed nature of the sequence is unimportant as long as every sample appears in the sequence infinitely often. The simplest way to assure this is to repeat the samples cyclically, though from a practical point of view random selection is often to be preferred (Sec. 5.8.5). Clearly neither the batch nor this single-sample version of the Perceptron algorithm are on-line since we must store and potentially revisit all of the training patterns. Two further simplifications help to clarify the exposition. First, we shall temporarily restrict our attention to the case in which (k) is constant -- the so-called fixed-increment case. It is clear from Eq. 18 that if (t) is constant it merely serves to scale the samples; thus, in the fixed-increment case we can take (t) = 1 with no loss in generality. The second simplification merely involves notation. When the samples fixed increment 18 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS are considered sequentially, some will be misclassified. Since we shall only change the weight vector when there is an error, we really need only pay attention to the misclassified samples. Thus we shall denote the sequence of samples using superscripts, i.e., by y1 , y2 , ..., yk , ..., where each yk is one of the n samples y1 , ..., yn , and where each yk is misclassified. For example, if the samples y1 , y2 , and y3 are considered cyclically, and if the marked samples y1 , y2 , y3 , y1 , y2 , y3 , y1 , y2 , ... fixedincrement rule (19) are misclassified, then the sequence y1 , y2 , y3 , y4 , y5 , ... denotes the sequence y1 , y3 , y1 , y2 , y2 , ... With this understanding, the fixed-increment rule for generating a sequence of weight vectors can be written as a(1) a(k + 1) = a(k) + yk arbitrary k1 (20) where at (k)yk 0 for all k. If we let n denote the total number of patterns, the algorithm is: Algorithm 4 (Fixed-increment single-sample Perceptron) 1 2 3 4 5 6 begin initialize a, k = 0 do k (k + 1)modn if yk is misclassified by a then a a - yk until all patterns properly classified return a end The fixed-increment Perceptron rule is the simplest of many algorithms that have been proposed for solving systems of linear inequalities. Geometrically, its interpretation in weight space is particularly clear. Since a(k) misclassifies yk , a(k) is not on the positive side of the yk hyperplane at yk = 0. The addition of yk to a(k) moves the weight vector directly toward and perhaps across this hyperplane. Whether the hyperplane is crossed or not, the new inner product at (k + 1)yk is larger than the old inner product at (k)yk by the amount yk 2 , and the correction is clearly moving the weight vector in a good direction (Fig. 5.13). 5.5. MINIMIZING THE PERCEPTRON CRITERION FUNCTION 1 2 3 19 4 5 6 7 8 9 Figure 5.13: Samples from two categories, 1 (black) and 2 (red) are shown in augmented feature space, along with an augmented weight vector a. At each step in a fixed-increment rule, one of the misclassified patterns, yk , is shown by the large dot. A correction a (proportional to the pattern vector yk ) is added to the weight vector -- towards an 1 point or away from an 2 point. This changes the decision boundary from the dashed position (from the previous update) to the solid position. The sequence of resulting a vectors is shown, where later values are shown darker. In this example, by step 9 a solution vector has been found and the categories successfully separated by the decision boundary shown. Clearly this algorithm can only terminate if the samples are linearly separable; we now prove that indeed it terminates so long as the samples are linearly separable. Theorem 5.1 (Perceptron Convergence) If training samples are linearly separable then the sequence of weight vectors given by Algorithm 4 will terminate at a solution vector. Proof: In seeking a proof, it is natural to try to show that each correction brings the weight ^ vector closer to the solution region. That is, one might try to show that if a is any ^ ^ solution vector, then a(k + 1) - a is smaller than a(k) - a . While this turns out 20 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS not to be true in general (cf. steps 6 & 7 in Fig. 5.13), we shall see that it is true for solution vectors that are sufficiently long. ^ ^ Let a be any solution vector, so that at yi is strictly positive for all i, and let be a positive scale factor. From Eq. 20, a(k + 1) - ^ = (a(k) - ^) + yk a a and hence a(k + 1) - ^ a 2 = a(k) - ^ a 2 + 2(a(k) - ^)t yk + yk a 2 . Since yk was misclassified, at (k)yk 0, and thus a(k + 1) - ^ a 2 a(k) - ^ a 2 - 2^t yk + yk a 2 . ^ Because at yk is strictly positive, the second term will dominate the third if is sufficiently large. In particular, if we let be the maximum length of a pattern vector, 2 = max yi 2 , i (21) and be the smallest inner product of the solution vector with any pattern vector, i.e., ^ = min at yi > 0, i (22) then we have the inequality a(k + 1) - ^ a If we choose = we obtain a(k + 1) - ^ a 2 2 a(k) - ^ a 2 - 2 + 2 . 2 , (23) a(k) - ^ a 2 - 2. Thus, the squared distance from a(k) to ^ is reduced by at least 2 at each correction, a and after k corrections a(k + 1) - ^ a 2 a(k) - ^ a 2 - k 2 . (24) Since the squared distance cannot become negative, it follows that the sequence of corrections must terminate after no more than k0 corrections, where k0 = a(1) - ^ a 2 2 . (25) 5.5. MINIMIZING THE PERCEPTRON CRITERION FUNCTION 21 Since a correction occurs whenever a sample is misclassified, and since each sample appears infinitely often in the sequence, it follows that when corrections cease the resulting weight vector must classify all of the samples correctly. The number k0 gives us a bound on the number of corrections. If a(1) = 0, we get the following particularly simple expression for k0 : ^ 2 a k0 = 2 2 ^ 2 2 a = 2 2 max yi = i i 2 ^ a 2 t^ min[yi a]2 . (26) The denominator in Eq. 26 shows that the difficulty of the problem is essentially determined by the samples most nearly orthogonal to the solution vector. Unfortunately, it provides no help when we face an unsolved problem, since the bound is expressed in terms of a solution vector whid be finite. It is not outside either, since each correction causes the weight vector to move times its distance from the boundary plane, thereby preventing the vector from being bounded away from the boundary forever. Hence the limit point must be on the boundary. 5.7 Nonseparable Behavior The Perceptron and relaxation procedures give us a number of simple methods for finding a separating vector when the samples are linearly separable. All of these methods are called error-correcting procedures, because they call for a modification of the weight vector when and only when an error is encountered. Their success on separable problems is largely due to this relentless search for an error-free solution. In practice, one would only consider the use of these methods if there was reason to believe that the error rate for the optimal linear discriminant function is low. Of course, even if a separating vector is found for the training samples, it does not follow that the resulting classifier will perform well on independent test data. ^ A moment's reflection will show that any set of fewer than 2d samples is likely to be linearly separable -- a matter we shall return to in Chap. ??. Thus, one should use several times that many design samples to overdetermine the classifier, thereby ensuring that the performance on training and test data will be similar. Unfortunately, sufficiently large design sets are almost certainly not linearly separable. This makes it important to know how the error-correction procedures will behave when the samples are nonseparable. Since no weight vector can correctly classify every sample in a nonseparable set (by definition), it is clear that the corrections in an error-correction procedure can never cease. Each algorithm produces an infinite sequence of weight vectors, any member of which may or may not yield a useful "solution." The exact nonseparable behavior of these rules has been studied thoroughly in a few special cases. It is known, for example, that the length of the weight vectors produced by the fixed-increment rule are bounded. Empirical rules for terminating the correction procedure are often based on this tendency for the length of the weight vector to fluctuate near some limiting value. From a theoretical viewpoint, if the components of the samples are integervalued, the fixed-increment procedure yields a finite-state process. If the correction process is terminated at some arbitrary point, the weight vector may or may not be in a good state. By averaging the weight vectors produced by the correction rule, one can reduce the risk of obtaining a bad solution by accidentally choosing an unfortunate termination time. A number of similar heuristic modifications to the error-correction rules have been suggested and studied empirically. The goal of these modifications is to obtain acceptable performance on nonseparable problems while preserving the ability to find a separating vector on separable problems. A common suggestion is the use of a variable increment (k), with (k) approaching zero as k approaches infinity. The rate at which (k) approaches zero is quite important. If it is too slow, the results will still be sensitive to those training samples that render the set nonseparable. If it is too fast, the weight vector may converge prematurely with less than optimal results. One way to choose (k) is to make it a function of recent performance, decreasing it as performance improves. Another way is to program (k) by a choice such as (k) = (1)/k. When we examine stochastic approximation techniques, we shall see that this latter choice is the theoretical solution to an analogous problem. Before we errorcorrecting procedure 28 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS take up this topic, however, we shall consider an approach that sacrifices the ability to obtain a separating vector for good compromise performance on both separable and nonseparable problems. 5.8 5.8.1 Minimum Squared Error Procedures Minimum Squared Error and the Pseudoinverse The criterion functions we have considered es that the sequence of weight vectors tends to converge to the desired solution. Instead of pursuing this topic further, we shall turn to a very similar rule that arises from a stochastic descent procedure. We note, however, that the solution need not give a separating vector, even if one exists, as shown in Fig. 5.17 (Computer exercise 10). y2 se r pa ati ng r pe hy e an pl LMS solut ion y1 Figure 5.17: The LMS algorithm need not converge to a separating hyperplane, even if one exists. Since the LMS solution minimizes the sum of the squares of the distances of the training points to the hyperplane, for this exmple the plane is rotated clockwise compared to a separating hyperplane. 5.8.5 Stochastic Approximation Methods All of the iterative descent procedures we have considered thus far have been described in deterministic terms. We are given a particular set of samples, and we generate a particular sequence of weight vectors. In this section we digress briefly to consider an MSE procedure in which the samples are drawn randomly, resulting in a random sequence of weight vectors. We will return in Chap. ?? to the theory of stochastic approximation though here some of the main ideas will be presented without proof. Suppose that samples are drawn independently by selecting a state of nature with probability P (i ) and then selecting an x according to the probability law p(x|i ). For each x we let be its label, with = +1 if x is labelled 1 and = -1 if x is labelled 2 . Then the data consist of an infinite sequence of independent pairs (x, 1 ), (x2 , 2 ), ..., (xk , k ), .... Even though the label variable is binary-valued it can be thought of as a noisy version of the Bayes discriminant function g0 (x). This follows from the observation that P ( = 1|x) = P (1 |x), and P ( = -1|x) = P (2 |x), so that the conditional mean of is given by E|x [] = P (|x) = P (1 |x) - P (2 |x) = g0 (x). (62) 36 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS Suppose that we wish to approximate g0 (x) by the finite series expansion ^ d g(x) = at y = i=1 ai yi (x), ^ where both the basis functions yi (x) and the number of terms d are known. Then we ^ can seek a weight vector a that minimizes the mean-squared approximation error 2 2 = E[(at y - g0 (x))2 ]. (63) Minimization of would appear to require knowledge of Bayes discriminant g0 (x). However, as one might have guessed from the analogous situation in Sect. 5.8.3, it ^ can be shown that the weight vector a that minimizes 2 also minimizes the criterion function Jm (a) = E[(at y - )2 ]. (64) This should also be plausible from the fact that is essentially a noisy version of g0 (x) (Fig. ??). Since the gradient is Jm = 2E[(at y - )y], we can obtain the closed-form solution ^ a = E[yyt ]-1 E[y]. (66) (65) Thus, one way to use the samples is to estimate E[yyt ] and E[y], and use Eq. 66 to obtain the MSE optimal linear discriminant. An alternative is to minimize Jm (a) by a gradient descent procedure. Suppose that in place of the true gradient we substitute the noisy version 2(at yk - k )yk . This leads to the update rule a(k + 1) = a(k) + (k - at (k)yk )yk , (67) which is basically just the Widrow-Hoff rule. It can be shown (Problem ??) that if E[yyt ] is nonsingular and if the coefficients (k) satisfy m m lim (k) = + k=1 (68) and m m lim 2 (k) < k=1 (69) ^ then a(k) converges to a in mean square: k ^ lim E[ a(k) - a 2 ] = 0. (70) The reasons we need these conditions on (k) are simple. The first condition keeps the weight vector from converging so fast that a systematic error will remain forever uncorrected. The second condition ensures that random fluctuations are eventually suppressed. Both conditions are satisfied by the conventional choice (k) = 1/k. 5.9. *THE HO-KASHYAP PROCEDURES 37 Unfortunately, this kind of programmed decrease of (k), independent of the problem at hand, often leads to very slow convergence. Of course, this is neither the only nor the best descent algorithm for minimizing Jm . For example, if we note that the matrix of second partial derivatives for Jm is given by D = 2E[yyt ], we see that Newton's rule for minimizing Jm (Eq. 15) is a(k + 1) = a(k) + E[yyt ]-1 E[( - at y)y]. A stochastic analog of this rule is a(k + 1) = a(k) + Rk+1 (k - at (k)yk )yk . with t R-1 = R-1 + yk yk , k+1 k (71) (72) or, equivalently, Rk+1 = Rk - Rk yk (Rk yk )t . t 1 + yk Rk yk (73) This rule also produces a sequence of weight vectors that converges to the optimal solution in mean square. Its convergence is faster, but it requires more computation per step (Computer exercise 8). These gradient procedures can be viewed as methods for minimizing a criterion function, or finding the zero of its gradient, in the presence of noise. In the statistical literature, functions such as Jm and Jm that have the form E[f (a, x)] are called regression functions, and the iterative algorithms are called stochastic approximation procedures. Two well known ones are the Kiefer-Wolfowitz procedure for minimizing a regression function, and the Robbins-Monro procedure for finding a root of a regression function. Often the easiest way to obtain a convergence proof for a particular descent or approximation procedure is to show that it satisfies the convergence conditions for these more general procedures. Unfortunately, an exposition of these methods in their full generality would lead us rather far afield, and we must close this digression by referring the interested reader to the literature. regression function stochastic approximation 5.9 5.9.1 The Ho-Kashyap Procedures The Descent Procedure The procedures we have considered thus far differ in several ways. The Perceptron and relaxation procedures find separating vectors if the samples are linearly separable, but do not converge on nonseparable problems. The MSE procedures yield a weight vector whether the samples are linearly separable or not, but there is no guarantee This recursive formula for computing Rk , which is roughly (1/k)E[yyt ]-1 , cannot be used if Rk is singular. The equivalence of Eq. 72 and Eq. 73 follows from Problem ?? of Chap. ??. 38 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS that this vector is a separating vector in the separable case (Fig. 5.17). If the margin vector b is chosen arbitrarily, all we can say is that the MSE procedures minimize Ya - b 2 . Now if the training samples happen to be linearly separable, then there ^ ^ exists an a and a b such that Y^ = b > 0, a ^ ^ ^ where by b > 0, we mean that every component of b is positive. Clearly, were we ^ to take b = b and apply the MSE procedure, we would obtain a separating vector. ^ Of course, we usually do not know b beforehand. However, we shall now see how the MSE procedure can be modified to obtain both a separating vector a and a margin vector b. The underlying idea comes from the observation that if the samples are separable, and if both a and b in the criterion function Js (a, b) = Ya - b 2 (74) are allowed to vary (subject to the constraint b > 0), then the minimum value of Js is zero, and the a that achieves that minimum is a separating vector. To minimize Js , we shall use a modified gradient descent procedure. The gradient of Js with respect to a is given by a Js = 2Yt (Ya - b), and the gradient of Js with respect to b is given by b Js = -2(Ya - b). For any value of b, we can always take a = Y b, (77) (76) (75) thereby obtaining a Js = 0 and minimizing Js with respect to a in one step. We are not so free to modify b, however, since we must respect the constraint b > 0, and we must avoid a descent procedure that converges to b = 0. One way to prevent b from converging to zero is to start with b > 0 and to refuse to reduce any of its components. We can do this and still try to follow the negative gradient if we first set all positive components of b Js to zero. Thus, if we let |v| denote the vector whose components are the magnitudes of the corresponding components of v, we are led to consider an update rule for the margin of the form 1 b(k + 1) = b(k) - [b Js - |b Js |]. 2 (78) Using Eqs. 76 & 77, and being a bit more specific, we obtain the the second term becomes et (k)(YY - I)e+ (k) = -et (k)e+t (k) = - e+ (k) 2 , 5.9. *THE HO-KASHYAP PROCEDURES 41 the nonzero components of e+ (k) being the positive components of e(k). Since YY is symmetric and is equal to (YY )t (YY ), the third term simplifies to (YY - I)e+ (k) 2 = 2 e+t (k)(YY - I)t (YY - I)e+ (k) = 2 e+ (k) 2 - 2 e+ (k)YY e+ (k), and thus we have 1 ( e(k) 4 2 - e(k + 1) 2 ) = (1 - ) e+ (k) 2 + 2 e+t (k)YY e+ (k). (85) Since e+ (k) is nonzero by assumption, and since YY is positive semidefinite, e(k) 2 > e(k + 1) 2 if 0 < < 1. Thus the sequence e(1) 2 , e(2) 2 , ... is monotonically decreasing and must converge to some limiting value e 2 . But for convergence to take place, e+ (k) must converge to zero, so that all the positive com^ ponents of e(k) must converge to zero. Since et (k)b = 0 for all k, it follows that all of the components of e(k) must converge to zero. Thus, if 0 < < 1 and if the samples are linearly separable, a(k) will converge to a solution vector as k goes to infinity. If we test the signs of the components of Ya(k) at each step and terminate the algorithm when they are all positive, we will in fact obtain a separating vector in a finite number of steps. This follows from the fact that Ya(k) = b(k) + e(k), and that the components of b(k) never decrease. Thus, if bmin is the smallest component of b(1) and if e(k) converges to zero, then e(k) must enter the hypersphere e(k) = bmin after a finite number of steps, at which point Ya(k) > 0. Although we ignored terminating conditions to simplify the proof, such a terminating condition would always be used in practice. 5.9.3 Nonseparable Behavior If the convergence proof just given is examined to see how the assumption of separability was employed, it will be seen that it was needed twice. First, the fact that ^ et (k)b = 0 was used to show that either e(k) = 0 for some finite k, or e+ (k) is never zero and corrections go on forever. Second, this same constraint was used to show that if e+ (k) converges to zero, e(k) must also converge to zero. If the samples are not linearly separable, it no longer follows that if e+ (k) is zero then e(k) must be zero. Indeed, on a nonseparable problem one may well obtain a nonzero error vector having no positive components. If this occurs, the algorithm automatically terminates and we have proof that the samples are not separable. What happens if the patterns are not separable, but e+ (k) is never zero? In this case it still follows that e(k + 1) = e(k) + 2(YY - I)e+ (k) and 1 ( e(k) 4 2 (86) - e(k + 1) 2 ) = (1 - ) e+ (k) 2 + 2 e+t (k)YY e+ (k). (87) Thus, the sequence e(1) 2 , e(2) 2 , ... must still converge, though the limiting value e 2 cannot be zero. Since convergence requires that e+ (k) = 0 for some finite k, 42 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS or e+ (k) converges to zero while e(k) is bounded away from zero. Thus, the HoKashyap algorithm provides us with a separating vector in the separable case, and with evidence of nonseparability in the nonseparable case. However, there is no bound on the number of steps needed to disclose nonseparability. 5.9.4 Some Related Procedures If we write Y = (Yt Y)-1 Yt and make use of the fact that Yt e(k) = 0, we can modify the Ho-Kashyap rule as follows b(1) > 0 but otherwise arbitrary a(1) = Y b(1) (88) b(k + 1) = b(k) + (e(k) + |e(k)|) a(k + 1) = a(k) + Y |e(k)|, where, as usual, e(k) = Ya(k) - b(k). This then gives the algorithm for fixed learning rate: Algorithm 12 (Modified Ho-Kashyap) 1 2 3 4 5 6 7 8 9 10 (89) begin initialize a, b, < 1, criterion bmin , kmax do k k + 1 e Ya - b e+ 1/2(e + Abs[e]) b b + 2(k)(e + Abs[e]) a Y b if Abs[e] bmin then return a, b and exit until k = kmax print NO SOLUTION FOUND end This algorithm differs from the Perceptron and relaxation algorithms for solving linear inequalities in at least three ways: (1) it varies both the weight vector a and the margin vector b, (2) it provides evidence of nonseparability, but (3) it requires the computation of the pseudoinverse of Y. Even though this last computation need be done only once, it can be time consuming, and it requires special treatment if Yt Y is singular. An interesting alternative algorithm that resembles Eq. 88 but avoids the need for computing Y is b(1) > 0 but otherwise arbitrary a(1) = arbitrary , (90) b(k + 1) = b(k) + (e(k) + |e(k)|) a(k + 1) = a(k) + RYt |e(k)| ^ ^ where R is an arbitrary, constant, postive-definite d-by-d matrix. We shall show that if is properly chosen, this algorithm also yields a solution vector in a finite number of steps, provided that a solution exists. Furthermore, if no solution exists, the vector Yt |e(k)| either vanishes, exposing the nonseparability, or converges to zero. The proof is fairly straightforward. Whether the samples are linearly separable or not, Eqs. 89 & 90 show that 5.9. *THE HO-KASHYAP PROCEDURES 43 e(k + 1) = Ya(k + 1) - b(k + 1) = (YRYt - I)|e(k)|. We can find, then, that the squared magnitude is 2 e(k + 1) and furthermore = |e(k)|t ( 2 YRYt YRY - 2YRYt + I)|e(k)|, e where 2 - e(k + 1) 2 = (Yt |e(k)|)t A(Yt |e(k)|), (91) A = 2R - 2 RYt R. (92) Clearly, if is positive but sufficiently small, A will be approximately 2R and hence positive definite. Thus, if Yt |e(k)| = 0 we will have e(k) 2 > e(k + 1) 2 . At this point we must distinguish between the separable and the nonseparable ^ ^ case. In the separable case there exists an a and a b > 0 satisfying Y^ = b. Thus, if a ^ |e(k)| = 0, ^ |e(k)|t Y^ = |e(k)|t b > 0, a so that Yt |e(k)| can not be zero unless e(k) is zero. Thus, the sequence e(1) 2 , e(2) 2 , ... is monotonically decreasing and must converge to some limiting value e 2 . But for convergence to take place, Yt |e(k)| must converge to zero, which implies that |e(k)| and hence e(k) must converge to zero. Since e(k) starts out positive and never decreases, it follows that a(k) must converge to a separating vector. Moreover, by the same argument used before, a solution must actually be obtained after a finite number of steps. In the nonseparable case, e(k) can neither be zero nor converge to zero. It may happen that Yt |e(k)| = 0 at some step, which would provide proof of nonseparability. However, it is also possible for the sequence of corrections to go on forever. In this case, it again follows that the sequence e(1) 2 , e(2) 2 , ... must converge to a limiting value e 2 = 0, and that Yt |e(k)| must converge to zero. Thus, we again obtain evidence of nonseparability in the nonseparable case. Before closing this discussion, let us look briefly at the question of choosing and R. The simplest choice for R is the identity matrix, in which case A = 2I - 2 Yt Y. This matrix will be positive definite, thereby assuring convergence, if 0 < < 2/max , where max is the largest eigenvalue of Yt Y. Since the trace of Yt Y is both the sum of the eigenvalues of Yt Y and the sum of the squares of the elements of Y, one can ^ yi 2 in selecting a value for . use the pessimistic bound dmax i A more interesting approach is to change at each step, selecting that value that maximizes e(k) 2 - e(k + 1) 2 . Equations 91 & 92 give e(k) 2 - e(k + 1) 2 = |e(k)|t Y(2R - 2 RYt YR)Yt |e(k)|. (93) By differentiating with respect to , we obtain the optimal value 44 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS (k) = which, for R = I, simplifies to |e(k)|t YRYt |e(k)| , |e(k)|t YRYt YRYt |e(k)| (94) (k) = Yt |e(k)| YYt |e(k)| 2 2 . (95) This same approach can also be used to select the matrix R. By replacing R in Eq. 93 by the symmetric matrix R + R and neglecting second-order terms, we obtain 2 ( e(k) - e(k + 1) 2 ) = |e(k)|Y[Rt (I - Yt YR) + (I - RYt Y)R]Yt |e(k)|. Thus, the decrease in the squared error vector is maximized by the choice R= 1 t -1 (Y Y) (96) and since RYt = Y , the descent algorithm becomes virtually identical with the original Ho-Kashyap algorithm. 5.10 5.10.1 Linear Programming Algorithms Linear Programming objective function The Perceptron, relaxation and Ho-Kashyap procedures are basically gradient descent procedures for solving simultaneous linear inequalities. Linear programming techniques are procedures for maximizing or minimizing linear functions subject to linear equality or inequality constraints. This at once suggests that one might be able to solve linear inequalities by using them as the constraints in a suitable linear programming problem. In this section we shall consider two of several ways that this can be done. The reader need have no knowledge of linear programming to understand these formulations, though such knowledge would certainly be useful in applying the techniques. A classical linear programming problem can be stated as follows: Find a vector u = (u1 , ..., um )t that minimizes the linear (scalar) objective function z = t u subject to the constraint Au , (98) (97) simplex algorithm where is an m-by-1 cost vector, is an l-by-1 vector, and A is an l-by-m matrix. The simplex algorithm is the classical iterative procedure for solving this problem (Fig. 5.18). For technical reasons, it requires the imposition of one more constraint, viz., u 0. If we think of u as being the weight vector a, this constraint is unacceptable, since in most cases the solution vector will have both positive and negative components. However, suppose that we write 5.10. *LINEAR PROGRAMMING ALGORITHMS u3 45 u2 u1 Figure 5.18: Surfaces of constant z = t u are shown in gray, while constraints of the form Au. are shown in red. The simplex algorithm finds an extremum of z given the constraints, i.e., where the gray plan intersects the red at a single point. a a + - a- , where a+ and 1 (|a| + a) 2 (99) (100) 1 (|a| - a). (101) 2 Then both a+ and a- are nonnegative, and by identifying the components of u with the components of a+ and a- , for example, we can accept the constraint u 0. a- 5.10.2 The Linearly Separable Case Suppose that we have a set of n samples y1 , ..., yn and we want a weight vector a that satisfies at yi bi > 0 for all i. How can we formulate this as a linear programming problem? One approach is to introduce what is called an artificial variable 0 by writing at yi + bi . If is sufficiently large, there is no problem in satisfying these constraints; for example, they are satisfied if a = 0 and = maxi bi . However, this hardly solves our original problem. What we want is a solution with = 0, which is the smallest value can have and still satisfy 0. Thus, we are led to consider the following problem: Minimize over all values of and a that satisfy the conditions at yi bi and 0. In the terminology of linear programming, any solution satisfying the constraints is called a feasible solution. A feasible solution for which the number of nonzero variables does not exceed the number of constraints (not counting the simplex requirement for nonnegative variables) is called a basic feasible solution. Thus, the solution a = 0 and = maxi bi is a basic feasible solution. Possession of such a solution simplifies the application of the simplex algorithm. 46 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS If the answer is zero, the samples are linearly separable, and we have a solution. If the answer is positive, there is no separating vector, but we have proof that the samples are nonseparable. Formally, our problem is to find a vector u that minimizes the objective function z = t u subject to the constraints Au and u 0, where A= t y1 t y2 . . . t yn t -y1 t -y2 . . . t -yn 1 1 . . . 1 a+ 0 a- , = 0 , , u = 1 = b1 b2 . . . bn . ^ Thus, the linear programming problem involves m = 2d + 1 variables and l = n constraints, plus the simplex algorithm constraints u 0. The simplex algorithm will find the minimum value of the objective function z = t u = in a finite number ^ of steps, and will exhibit a vector u yielding that value. If the samples are linearly ^ separable, the minimum value of will be zero, and a solution vector a can be obtained ^ from u. If the samples are not separable, the minimum value of will be positive. ^ The resulting u is usually not very useful as an approximateturn to this equation in Chap. ??, but for now we can understand this informally by means of the leave one out bound. Suppose we have n points in the training set, and train a Support Vector Machine on n - 1 of them, and test on the single remaining point. If that remaining point happens to be a support vector for the full n sample case, then there will be an error; otherwise, there will not. Note that if we can find a En [error rate] support vector leave-oneout bound 50 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS transformation () that well separates the data -- so the expected number of support vectors is small -- then Eq. 107 shows that the expected error rate will be lower. y2 R R 2 1 ma xim um ma rgi ma nb xim um ma rgi nb op tim al hy per pla ne y1 Figure 5.19: Training a Support Vector Machine consists of finding the optimal hyperplane, i.e., the one with the maximum distance from the nearest training patterns. The support vectors are those (nearest) patterns, a distance b from the hyperplane. The three support vectors are shown in solid dots. 5.11.1 SVM training We now turn to the problem of training an SVM. The first step is, of course, to choose the nonlinear -functions that map the input to a higher dimensional space. Often this choice will be informed by the designer's knowledge of the problem domain. In the absense of such information, one might choose to use polynomials, Gaussians or yet other basis functions. The dimensionality of the mapped space can be arbitrarily high (though in practice it may be limited by computational resources). We begin by recasting the problem of minimizing the magnitude of the weight vector constrained by the separation into an unconstrained problem by the method of Lagrange undetermined multipliers. Thus from Eq. 106 and our goal of minimizing ||a||, we construct the functional L(a, ) = 1 ||a||2 - 2 n k [zk at yk - 1]. k=1 (108) and seek to minimize L() with respect to the weight vector a, and maximize it with respect to the undetermined multipliers k 0. The last term in Eq. 108 expresses the goal of classifying the points correctly. It can be shown using the so-called KuhnTucker construction (Problem 30) (also associated with Karush whose 1939 thesis addressed the same problem) that this optimization can be reformulated as maximizing n L() = k=1 i - 1 2 n t k j zk zj yj yk , k,j (109) subject to the constraints 5.11. *SUPPORT VECTOR MACHINES 51 n zk k = 0 k=1 k 0, k = 1, ..., n, (110) given the training data. While these equations can be solved using quadratic programming, a number of alternate schemes have been devised (cf. Bibliography). Example 2: SVM for the XOR problem The exclusive-OR is the simplest problem that cannot be solved using a linear discriminant operating directly on the features. The points k = 1, 3 at x = (1, 1)t and (-1, -1)t are in category 1 (red in the figure), while k = 2, 4 at x = (1, -1)t and (-1, 1)t are in 2 (black in the figure). Following the approach of Support Vector Machines, we preprocess the features to map them to a higher dimension space where they can be linearly separated. While many -functions could be used, here we use the simplest expansion up to second order: 1, 2x1 , 2x2 , 2x1 x2 , x2 and x2 , where 1 2 the 2 is convenient for normalization. We seek to maximize Eq. 109, 4 k - k=1 1 2 n t k j zk zj yj yk k,j subject to the constraints (Eq. 110) 1 - 2 + 3 - 4 = 0 k = 1, 2, 3, 4. 0 k It is clear from the symmetry of the problem that 1 = 3 and that 2 = 4 at the solution. While we could use iterative gradient descent as described in Sect. 5.9, for this small problem we can use analytic techniques instead. The solution is a = 1/8, k for k = 1, 2, 3, 4, and from the last term in Eq. 108 this implies that all four training patterns are support vectors -- an unusual case due to the highly symmetric nature of the XOR problem. The final discriminant function is g(x) = g(x1 , x2 ) = x1 x2 , and the decision hyperplane is defined by g = 0, which properly classifies all training patterns. The margth the classic paper by Ronald A. Fisher [4]. The application of linear discriminant function to pattern classification was well described in [7], which posed the problem of optimal (minimum-risk) linear discriminant, and proposed plausible gradient descient procedures to determine a solution from samples. Unfortunately, little can be said about such procedures without knowing the underlying distributions, and even then the situation is analytically complex. The design of multicategory classifiers using two-category procedures stems from [12]. Minsky and Papert's Perceptrons [11] was influential in pointing out the weaknesses of linear classifiers -- weaknesses that were overcome by the methods we shall study in Chap. ??. The Winnow algorithms [8] in the error-free case and [9, 6] and subsequent work in the general case have been useful in the computational learning community, as they allow one to derive convergence bounds. While this work was statistically oriented, many of the pattern recognition papers that appeared in the late 1950s and early 1960s adopted other viewpoints. One viewpoint was that of neural networks, in which individual neurons were modelled as threshold elements, two-category linear machines -- work that had its origins in the famous paper by McCulloch and Pitts [10]. As linear machines have been applied to larger and larger data sets in higher and higher dimensions, the computational burden of linear programming [2] has made this approach less popular. Stochastic approximations, e.g, [15], An early paper on the key ideas in Support Vector Machines is [1]. A more extensive treatment, including complexity control, can be found in [14] -- material we shall visit in Chap. ??. A readable presentation of the method is [3], which provided the inspiration behind our Example 2. The Kuhn-Tucker construction, used in the SVM training method described in the text and explored in Problem 30, is from [5] and used in [13]. The fundamental result is that exactly one of the following three cases holds. 1) The original (primal) conditions have an optimal solution; in that case the dual cases do too, and their objective values are equal, or 2) the primal conditions are infeasible; in that case the dual is either unbounded or itself infeasible, or 3) the primal conditions are unbounded; in that case the dual is infeasible. Problems Section 5.2 58 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS 1. Consider a linear machine with discriminant functions gi (x) = wt x + wi0 , i = 1, ..., c. Show that the decision regions are convex by showing that if x1 Ri and x2 Ri then x1 + (1 - )x2 Ri if 0 1. 2. Figure 5.3 illustrates the two most popular methods for designing a c-category c classifier from linear boundary segments. Another method is to save the full 2 linear i /j boundaries, and classify any point by taking a vote based on all these boundaries. Prove whether the resulting decision regions must be convex. If they need not be convex, construct a non-pathological example yielding at least one non-convex decision region. 3. Consider the hyperplane used for discriminant functions. (a) Show that the distance from the hyperplane g(x) = wt x + w0 = 0 to the point xa is |g(xa )|/ w by minimizing x - xa 2 subject to the constraint g(x) = 0. (b) Show that the projection of xa onto the hyperplane is given by xp = xa - g(xa ) w. w 2 4. Consider the three-category linear machine with discriminant functions gi (x) = t wi x + wi0 , i = 1, 2, 3. (a) For the special case where x is two-dimensional and the threshold weights wi0 are zero, sketch the weight vectors with their tails at the origin, the three lines joining their heads, and the decision boundaries. (b) How does this sketch change when a constant vector c is added to each of the three weight vectors? 5. In the multicategory case, a set of samples is said to be linearly separable if there exists a linear machine that can classify them all correctly. If any samples labelled i can be separated from all others by a single hyperplane, we shall say the samples total are totally lce? Section 5.5 2. Write a program to implement the Perceptron algorithm. (a) Starting with a = 0, apply your program to the training data from 1 and 2 . Note the number of iterations required for convergence. (b) Apply your program to 3 and 2 . Again, note the number of iterations required for convergence. (c) Explain the difference between the iterations required in the two cases. 3. The Pocket algorithm uses the criterion of longest sequence of correctly classified points, and can be used in conjunction a number of basic learning algorithms. For instance, one use the Pocket algorithm in conjunction with the Perceptron algorithm in a sort of ratchet scheme as follows. There are two sets of weights, one for the normal Pocket algorithm the following table. 3 4 x1 x2 x1 x2 -3.0 -2.9 -2.0 -8.4 0.5 8.7 -8.9 0.2 2.9 2.1 -4.2 -7.7 -0.1 5.2 -8.5 -3.2 -4.0 2.2 -6.7 -4.0 -1.3 3.7 -0.5 -9.2 -3.4 6.2 -5.3 -6.7 -4.1 3.4 -8.7 -6.4 -5.1 1.6 -7.1 -9.7 1.9 5.1 -8.0 -6.3 66 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS Perceptron algorithm, and a separate one (not directly used for training) which is kept "in your pocket." Both are randomly chosen at the start. The "pocket" weights are tested on the full data set to find the longest run of patterns properly classified. (At the beginning, this run will be short.) The Perceptron weights are trained as usual, but after every weight update (or after some finite number of such weight updates), the Perceptron weight is tested on data points, randomly selected, to determine the longest run of properly classified points. If this length is greater than the pocket weights, the Perceptron weights replace the pocket weights, and perceptron training continues. In this way, the poscket weights continually improve, classifying longer and longer runs of randomly selected points. (a) Write a pocket algorithm to be employed with Perceptron algorithm. (b) Apply it to the data from 1 and 3 . How often are the pocket weights updated? 4. Start with a randomly chosen a, Calculate 2 (Eq. 21 At the end of training calculate (Eq. 22). Verify k0 (Eq. 25). 5. Show that the first xx points of categories x and x xx. Construct by hand a nonlinear mapping of the feature space to make them linearly separable. Train a Perceptron classifier on them. 6. Consider a version of the Balanced Winnow training algorithm (Algorithm 7). Classification of test data is given by line 2. Compare the converge rate of Balanced Winnow with the fixed-increment, single-sample Perceptron (Algorithm 4) on a problem with large number of redundant features, as follows. (a) Generate a training set of 2000 100-dimensional patterns (1000 from each of two categories) in which only the first ten features are informative, in the following way. For patterns in category 1 , each of the first ten features are chosen randomly and uniformly from the range +1 xi 2, for i = 1, ..., 10. Conversely, for patterns in 2 , each of the first ten features are chosen randomly and uniformly from the range -2 xi -1. All other features from both categories are chosen from the range -2 xi +2. (b) Construct by hand the obvious separating hyperplane. (c) Adjust the learning rates so that your two algorithms have roughly the same convergence rate on the full training set when only the first ten features are considered. That is, assume each of the 2000 training patterns consists of just the first ten features. (d) Now apply your two algorithms to 2000 50-dimensional patterns, in which the first ten features are informative and the remaining 40 are not. Plot the total number of errors versus iteration. (e) Now apply your two algorithms to the full training set of 2000 100-dimensional patterns. (f) Summarize your answers to parts (c) - (e). Section 5.6 7. Consider relaxation methods. 5.12. COMPUTER EXERCISES 67 (a) Implement batch relaxation with margin (Algorithm 8), set b = 0.1 and a(1) = 0 and apply it to the data in 1 and 3 . Plot the criterion function as a function of the number of passes through the training set. (b) Repeat for b = 0.5 and a(1) = 0. Explainning? . . . . . . . 6.8.14 Stopped training . . . . . . . . . . . . . . . . . . . 1 3 3 4 8 8 10 11 15 15 16 16 17 17 19 19 20 21 23 24 25 26 28 29 29 30 31 32 32 32 33 34 34 36 36 37 37 38 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 6.8.15 How many hidden layers? . . . . . . . 6.8.16 Criterion function . . . . . . . . . . . 6.9 *Second-order methods . . . . . . . . . . . . 6.9.1 Hessian matrix . . . . . . . . . . . . . 6.9.2 Newton's method . . . . . . . . . . . . 6.9.3 Quickprop . . . . . . . . . . . . . . . . 6.9.4 Conjugate gradient descent . . . . . . Example 1: Conjugate gradient descent . . . . 6.10 *Additional networks and training methods . 6.10.1 Radial basis function networks (RBF) 6.10.2 Special bases . . . . . . . . . . . . . . 6.10.3 Time delay neural networks (TDNN) . 6.10.4 Recurrent networks . . . . . . . . . . . 6.10.5 Counterpropagation . . . . . . . . . . 6.10.6 Cascade-Correlation . . . . . . . . . . Algorithm 4: Cascade-correlation . . . . . . . 6.10.7 Neocognitron . . . . . . . . . . . . . . 6.11 Regularization and complexity adjustment . . 6.11.1 Complexity measurement . . . . . . . 6.11.2 Wald statistics . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . Bibliographical and Historical Remarks . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . Computer exercises . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 40 41 41 42 42 43 44 46 46 47 47 48 49 50 50 51 51 53 53 54 56 58 64 67 74 Chapter 6 Multilayer Neural Networks 6.1 Introduction modifiable weights to output units. I sisting ofininput units connected abypowerful gradient descent method forThe LMS algorithm, particular, provided reducing the error, even when the patterns are not linearly separable. Unfortunately, the class of solutions that can be obtained from such networks -- hyperplane discriminants -- while surprisingly good on a range or real-world problems, is simply not general enough in demanding applications: there are many problems for which linear discriminants are insufficient for minimum error. With a clever choice of nonlinear functions, however, we can obtain arbitrary decisions, in particular the one leading to minimum error. The central difficulty is, naturally, choosing the appropriate nonlinear functions. One brute force approach might be to choose a complete basis set (all polynomials, say) but this will not work; such a classifier would have too many free parameters to be determined from a limited number of training patterns (Chap. ??). Alternatively, we may have prior knowledge relevant to the classification problem and this might guide our choice of nonlinearity. In the absence of such information, up to now we have seen no principled or automatic method for finding the nonlinearities. What we seek, then, is a way to learn the nonlinearity at the same time as the linear discriminant. This is the approach of multilayer neural networks (also called multilayer Perceptrons): the parameters governing the nonlinear mapping are learned at the same time as those governing the linear discriminant. We shall revisit the limitations of the two-layer networks of the previous chapter, and see how three-layer (and four-layer...) nets overcome those drawbacks -- indeed how such multilayer networks can, at least in principle, provide the optimal solution to an arbitrary classification problem. There is nothing particularly magical about multilayer neural networks; at base they implement linear discriminants, but in a space where the inputs have been mapped nonlinearly. The key power provided by such networks is that they admit fairly simple algorithms where the form of the nonlinearity n the previous chapter we saw a number of methods for training classifiers con- Some authors describe such networks as single layer networks because they have only one layer of modifiable weights, but we shall instead refer to them based on the number of layers of units. 3 4 CHAPTER 6. MULTILAYER NEURAL NETWORKS can be learned from training data. The models are thus extremely powerful, have nice theoretical properties, and apply well to a vast array of real-world applications. One of the most popular methods for training such multilayer networks is based backpropagation on gradient descent in error -- the backpropagation algorithm (or generalized delta rule), a natural extension of the LMS algorithm. We shall study backpropagation in depth, first of all because it is powerful, useful and relatively easy to understand, but also because many other training methods can be seen as modifications of it. The backpropagation training method is simple even for complex models (networks) having hundreds or thousands of parameters. In part because of the intuitive graphical representation and the simplicity of design of these models, practitioners can test different models quickly and easily; neural networks are thus a sort of "poor person's" technique for doing statistical pattern recognition with complicated models. The conceptual and algorithmic simplicity of backpropagation, along with its manifest success on many real-world problems, help to explain why it is a mainstay in adaptive pattern recognition. While the basic theory of backpropagation is simple, a number of tricks -- some a bit subtle -- are often used to improve performance and increase training speed. Choices involving the scaling of input values and initial weights, desired output values, and more can be made based on an analysis of networks and their function. We shall also discuss alternate training schemes, for instance ones that are faster, or adjust their complexity automatically in response to training data. Network architecture or topology plays an important role for neural net classification, and the optimal topology will depend upon the problem at hand. It is here that another great benefit of networks becomes apparent: often knowledge of the problem domain which might be of an informal or heuristic nature can be easily incorporated into network architectures through choices in the number of hidden layers, units, feedback connections, and so on. Thus setting the topology of the network is heuristic model selection. The practical ease in selecting models (network topologies) and estimating parameters (training via backpropagation) enable classifier designers to try out alternate models fairly simply. regularA deep problem in the use of neural network techniques involves regularization, ization complexity adjustment, or model selection, that is, selecting (or adjusting) the complexity of the network. Whereas the number of inputs and outputs is given by the feature space and number of categories, the total number of weights or parameters in the network is not -- or at least not directly. If too many free parameters are used, generalization will be poor; conversely if too few parameters are used, the training data cannot be learned adequately. How shall we adjust the complexity to achieve the best generalization? We shall explore a number of methods for complexity adjustment, and return in Chap. ?? to their theoretical foundations. It is crucial to remember that neural networks do not exempt designers from intimate knowledge of the data and problem domain. Networks provide a powerful and speedy tool for building classifiers, and as with any tool or technique one gains intuition and expertise through analysis and repeated experimentation over a broad range of problems. 6.2 hidden layer Feedforward operation and classification Figure 6.1 shows a simple three-layer neural network. This one consists of an input layer (having two input units), a hidden layer with (two hidden units) and an output 6.2. FEEDFORWARD OPERATION AND CLASSIFICATION 5 layer (a single unit), interconnected by modifiable weights, represented by links between layers. There is, furthermore, a single bias unit that is connected to each unit other than the input units. The function of units is loosely based on properties of biological neurons, and hence they are sometimes called "neurons." We are interested in the use of such networks for pattern recognition, where the input units represent the components of a feature vector (to be learned or to be classified) and signals emitted by output units will be discriminant functions used for classification. bias neuron We call any units that are neither input nor output units "hidden" because their activations are not directly "seen" by the external environment, i.e., the input or output. 6 recall CHAPTER 6. MULTILAYER NEURAL NETWORKS net activation We can clarify our notation and describe the feedforward (or classification or recall) operation of such a network on what is perhaps the simplest nonlinear problem: the exclusive-OR (XOR) problem (Fig. 6.1); a three-layer network can indeed solve this problem whereas a linear machine operating directly on the features cannot. Each two-dimensional input vector is presented to the input layer, and the output of each input unit equals the corresponding component in the vector. Each hidden unit performs the weighted sum of its inputs to form its (scalar) net activation or simply net. That is, the net activation is the inner product of the inputs with the weights at the hidden unit. For simplicity, we augment both the input vector (i.e., append a feature value x0 = 1) and the weight vector (i.e., append a value w0 ), and can then write d d netj = i=1 xi wji + wj0 = i=0 t xi wji wj x, (1) synapse where the subscript i indexes units on the input layer, j for the hidden; wji denotes the input-to-hidden layer weights at the hidden unit j. In analogy with neurobiology, such weights or connections are sometimes called "synapses" and the value of the connection the "synaptic weights." Each hidden unit emits an output that is a nonlinear function of its activation, f (net), i.e., yj = f (netj ). The example shows a simple threshold or sign (read "signum") function, f (net) = Sgn(net) 1 -1 if net 0 if net < 0, (3) (2) transfer function but as we shall see, other functions have more desirable properties and are hence more commonly used. This f () is sometimes called the transfer function or merely "nonlinearity" of a unit, and serves as a function discussed in Chap. ??. We have assumed the same nonlinearity is used at the various hidden and output units, though this is not crucial. Each output unit similarly computes its net activation based on the hidden unit signals as nH nH netk = j=1 yj wkj + wk0 = j=0 t yj wkj = wk y, (4) where the subscript k indexes units in the output layer (one, in the figure) and nH denotes the number of hidden units (two, in the figure). We have mathematically treated the bias unit as equivalent to one of the hidden units whose output is always y0 = 1. Each output unit then computes the nonlinear function of its net, emitting zk = f (netk ). (5) where in the figure we assume that this nonlinearity is also a sign function. It is these final output signals that represent the different discriminant functions. We would typically have c such output units and the classification decision is to label the input pattern with the label corresponding to the maximum yk = gk (x). In a two-category case such as XOR, it is traditional to use a single output unit and label a pattern by the sign of the output z. 6.2. FEEDFORWARD OPERATION AND CLASSIFICATION 7 z 1 0 -1 0 -1 0 1 -1 1 zk output k y1 1 0 -1 0 -1 0 1 -1 1 y2 -1 .7 -1.5 -.4 wkj hidden j 1 1 wji input i 1 0 -1 0 -1 0 1 -1 1 bias .5 1 1 x1 x2 x2 z= -1 R2 z=1 R1 x1 R2 z= -1 Figure 6.1: The two-bit parity or exclusive-OR problem can be solved by a three-layer network. At the bottom is the two-dimensional feature space x1 - x2 , and the four patterns to be classified. The three-layer network is shown in the middle. The input units are linear and merely distribute their (feature) values through multiplicative weights to the hidden units. The hidden and output units here are linear threshold units, each of which forms the linear sum of its inputs times their associated weight, and emits a +1 if this sum is greater than or equal to 0, and -1 otherwise, as shown by the graphs. Positive ("excitatory") weights are denoted by solid lines, negative ("inhibitory") weights by dashed lines; the weight magnitude is indicated by the relative thickness, and is labeled. The single output unit sums the weighted signals from the hidden units (and bias) and emits a +1 if that sum is greater than or equal to 0 and a -1 otherwise. Within each unit we show a graph of its input-output or transfer function -- f (net) vs. net. This function is linear for the input units, a constant for the bias, and a step or sign function elsewhere. We say that this network has a 2-2-1 fully connected topology, describing the number of units (other than the bias) in successive layers. 8 CHAPTER 6. MULTILAYER NEURAL NETWORKS It is easy to verify that the three-layer network with the weight values listed indeed solves the XOR problem. The hidden unit computing y1 acts like a Perceptron, and computes the boundary x1 + x2 + 0.5 = 0; input vectors for which x1 + x2 + 0.5 0 lead to y1 = 1, all other inputs lead to y1 = -1. Likewise the other hidden unit computes the boundary x1 + x2 - 1.5 = 0. The final output unit emits z1 = +1 if and only if both y1 and y2 have value +1. This gives to the appropriate nonlinear decision region shown in the figure -- the XOR problem is solved. 6.2.1 General feedforward operation expressive power From the above example, it should be clear that nonlinear multilayer networks (i.e., ones with input units, hidden units and output units) have greater computational or expressive power than similar networks that otherwise lack hidden units; that is, they can implement more functions. Indeed, we shall see in Sect. 6.2.2 that given sufficient number of hidden units of a general type any function can be so represented. Clearly, we can generalize the above discussion to more inputs, other nonlinearities, and arbitrary number of output units. For classification, we will have c output units, one for each of the categories, and the signal from each output unit is the discriminant function gk (x). We gather the results from Eqs. 1, 2, 4, & 5, to express such discriminant functions as: gk (x) zk = f nH d wkj f j=1 i=1 wji xi + wj0 + wk0 . (6) This, then, is the class of functions that can be implemented by a three-layer neural network. An even broader generalization would allow transfer functions at the output layer to differ from those in the hidden layer, or indeed even different functions at each individual unit. We will have cause to use such networks later, but the attendant notational complexities would cloud our presentation of the key ideas in learning in networks. 6.2.2 Expressive power of multilayer networks It is natural to ask if every decision can be implemented by such a three-layer network (Eq. 6). The answer, due ulpractical because for most problems we know ahead of time neither the number of hidden units required, nor the proper weight values. Even if there were a constructive proof, it would be of little use in pattern recognition since we do not know the desired function anyway -- it is related to the training patterns in a very complicated way. All in all, then, these results on the expressive power of networks give us confidence we are on the right track, but shed little practical light on the problems of designing and training neural networks -- their main benefit for pattern recognition (Fig. 6.3). x2 Two layer fl R1 R2 x1 x2 x2 x1 Three layer R1 R2 ... R2 R1 x1 x2 x1 Figure 6.3: Whereas a two-layer network classifier can only implement a linear decision boundary, given an adequate number of hidden units, three-, four- and higher-layer networks can implement arbitrary decision boundaries. The decision regions need not be convex, nor simply connected. 6.3 Backpropagation algorithm We have just seen that any function from input to output can be implemented as a three-layer neural network. We now turn to the crucial problem of setting the weights based on training patterns and desired output. 6.3. BACKPROPAGATION ALGORITHM 11 Backpropagation is one of the simplest and most general methods for supervised training of multilayer neural networks -- it is the natural extension of the LMS algorithm for linear systems we saw in Chap. ??. Other methods may be faster or have other desirable properties, but few are more instructive. The LMS algorithm worked for two-layer systems because we had an error (proportional to the square of the difference between the actual output and the desired output) evaluated at the output unit. Similarly, in a three-layer net it is a straightforward matter to find how the output (and thus error) depends on the hidden-to-output layer weights. In fact this dependency is the same as in the analogous two-layer case, and thus the learning rule is the same. But how should the input-to-hidden weights be learned, the ones governing the nonlinear transformation of the input vectors? If the "proper" outputs for a hidden unit were known for any pattern, the input-to-hidden weights could be adjusted to approximate it. However, there is no explicit teacher to state what the hidden unit's output should be. This is called the credit assignment problem. The power of backpropagation is that it allows us to calculate an effective error for each hidden unit, and thus derive a learning rule for the input-to-hidden weights. Networks have two primary modes of operation: feedforward and learning. Feedforward operation, such as illustrated in our XOR example above, consists of presenting a pattern to the input units and passing the signals through the network in order to yield outputs from the output units. Supervised learning consists of presenting an input pattern as well as a desired, teaching or target pattern to the output layer and changing the network parameters (e.g., weights) in order to make the actual output more similar to the target one. Figure 6.4 shows a three-layer network and the notation we shall use. credit assignment target pattern 6.3.1 Network learning The basic approach in learning is to start with an untrained network, present an input training pattern and determine the output. The error or criterion function is some scalar function of the weights that is minimized when the network outputs match the desired outputs. The weights are adjusted to reduce this measure of error. Here we present the learning rule on a per pattern basis, and return to other protocols later. We consider the training error on a pattern to be the sum over output units of the squared difference between the desired output tk (given by a teacher) and the actual output zk , much as we had in the LMS algorithm for two-layer nets: c training error J(w) 1/2 k=1 (tk - zk )2 = 1/2(t - z)2 , (8) where t and z are the target and the network output vectors of length c; w represents all the weights in the network. cognition accuracy (Computer exercise ??). We describe the overall amount of pattern presentations by epoch -- the number of presentations of the full training set. For other variables being constant, the number of epochs is an indication of the relative amount of learning. The basic stochastic and batch protocols of backpropagation for n patterns are shown in the procedures below. Algorithm 1 (Stochastic backpropagation) 1 2 3 4 5 6 7 epoch begin initialize network topology (# hidden units), w, criterion , , m 0 do m m + 1 xm randomly chosen pattern wij wij + j xi ; wjk wjk + k yj until J(w) < return w end In the on-line version of backpropagation, line 3 of Algorithm 1 is replaced by sequential selection of training patterns (Problem 9). Line 5 makes the algorithm end when the change in the criterion function J(w) is smaller than some pre-set value . While this is perhaps the simplest meaningful stopping criterion, others generally lead to Some on-line training algorithms are considered models of biological learning, where the organism is exposed to the environment and cannot store all input patterns for multiple "presentations." The notion of epoch does not apply to on-line training, where instead the number of pattern presentations is a more appropriate measure. stopping criterion 16 CHAPTER 6. MULTILAYER NEURAL NETWORKS better performance, as we shall discuss in Sect. 6.8.14. In the batch version, all the training patterns are presented first and their corresponding weight updates summed; only then are the actual weights in the network updated. This process is iterated until some stopping criterion is met. So far we have considered the error on a single pattern, but in fact we want to consider an error defined over the entirety of patterns in the training set. With minor infelicities in notation we can write this total training error as the sum over the errors on n individual patterns: n J= p=1 Jp . (21) In stochastic training, a weight update may reduce the error on the single pattern being presented, yet increase the error on the full training set. Given a large number of such individual updates, however, the total error as given in Eq. 21 decreases. Algorithm 2 (Batch backpropagation) 1 2 3 4 5 6 7 8 9 10 11 begin initialize network topology (# hidden units), w, criterion , , r 0 do r r + 1 (increment epoch) m 0; wij 0; wjk 0 do m m + 1 xm select pattern wij wij + j xi ; wjk wjk + k yj until m = n wij wij + wij ; wjk wjk + wjk until J(w) < return w end In batch backpropagation, we need not select pattern randomly, since the weights are updated only after all patterns have been presented once. We shall consider the merits and drawbacks of each protocol in Sect. 6.8. 6.3.3 Learning curves validation error Because the weights are initialized with random values, error on the training set is large; through learning the error becomes lower, as shown in a learning curve (Fig. 6.6). The (per pattern) training error ultimately reaches an asymptotic value which depends upon the Bayes error, the amount of training data and the expressive power (e.g., the number of weights) in the network -- the higher the Bayes error and the fewer the number of such weights, the higher this asymptotic value is likely to be (Chap. ??). Since batch backpropagation performs gradient descent in the criterion function, these training error decreases monotonically. The average error on an independent test set is virtually always higher than on the training set, and while it generally decreases, it can increase or oscillate. Figure 6.6 also shows the average error on a validation set -- patterns not used directly for gradient descent training, and thus indirectly representative of novel patterns yet to be classified. The validation set can be used in a stopping criterion in both batch and stochastic protocols; gradient descent training on the training set is stopped when a minimum is reached in the validation error (e.g., near epoch 5 in 6.4. ERROR SURFACES J/n 17 validation tra ini test ng epochs 1 2 3 4 5 6 7 8 9 10 11 Figure 6.6: A learning curve shows the criterion function as a function of the amount of training, typically indicated by the number of epochs or presentations of the full n training set. We plot the average error per pattern, i.e., 1/n p=1 Jp . The validation error and the test (or generalization) error per pattern are virtually always higher than the training error. In some protocols, training is stopped at the minimum of the validation set. the figure). We shall return in Chap. ?? to understand in greater depth why this version of cross validation stopping criterion often leads to networks having improved recognition accuracy. cross validation 6.4 Error surfaces Since backpropagation is based on gradient descent in a criterion function, we can gain understanding and intuition about the algorithm by studying error surfaces themselves -- the function J(w). Of course, such an error surface depends upon the training and classification task; nevertheless there are some general properties of error surfaces that seem to hold over a broad range of real-world pattern recognition problems. One of the issues that concerns us are local minima; if many local minima plague the error landscape, then it is unlikely that the network will find the global minimum. Does this necessarily lead to poor performance? Another issue is the presence of plateaus -- regions where the error varies only slightly as a function of weights. If such plateaus are plentiful, we can expect training according to Algorithms 1 & 2 to be slow. Since training typically begins with small weights, the error surface in the neighborhood of w 0 will determine the general direction of descent. What can we say about the error in this region? Most interesting real-world problems are of high dimensionality. Are there any general properties of high dimensional error functions? We now explore these issues in some illustrative systems. 6.4.1 Some small networks Consider the simplest three-layer nonlinear network, here solving a two-category problem in one dimension; this 1-1-1 sigmoidal network (and bias) is shown in Fig. 6.7. The data shown are linearly separable, and the optimal decision boundary (a point somewhat below x1 = 0) separates the two categories. During learning, the weights descends to the global minimum, and the problem is solved. 18 CHAPTER 6. MULTILAYER NEURAL NETWORKS z1 J(w) w2 y1 w0 x0,y0 1 0.75 0.5 w1 x1 0.25 0 0 -100 -20 40 20 w1 w0 0 -40 100 x1 -4 -3 -2 -1 0 x* 1 2 3 4 R1 R2 Figure 6.7: Six one-dimensional patterns (three in each of two classes) are to be learned by a 1-1-1 network with sigmoidal hidden and output units (and bias). The error surface as a function of w1 and w2 is also shown (for the case where the bias weights have their final values). The network starts with random weights, and through (stochastic) training descends to the global minimum in error, as shown by the trajectory. Note especially that a low error solution exists, which in fact leads to a decision boundary separating the training points into their two categories. Here the error surface has a single (global) minimum, which yields the decision point separating the patterns of the two categories. Different plateaus in the surface correspond roughly to different numbers of patterns properly classified; the maximum number of such misclassified patterns is three in this example. The plateau regions, where weight change does not lead to a change in error, here correspond to sets of weights that lead to roughly the same decision point in the input space. Thus as w1 increases and w2 becomes more negative, the surface shows that the error does not change, a result that can be informally confirmed by looking at the network itself. Now consider the same network applied to another, harder, one-dimensional problem -- one that is not linearly separable (Fig. 6.8). First, note that overall the error surface is slightly higher than in Fig. 6.7 because even the best solution attainable with this network leads to one pattern being misclassified. As before, the different plateaus in error correspond to different numbers of training patterns properly learned. However, one must not confuse the (squared) error measure with classification error (cf. Chap. ??, Fig. ??). For instance here there are two general ways to misclassify exactly two patterns, but these have different errors. Incidentally, a 1-3-1 network (but not a 1-2-1 network) can solve this problem (Computer exercise 3). From these very simple examples, where the correspondences among weight values, decision boundary and error are manifest, we can see how the error of the global minimum is lower when the problem can be solved and that there are plateaus corresponding to sets of weights that lead to nearly the same decision boundary. Furthermore, the surface near w 0 (the traditional region for starting learning) has high 6.4. ERROR SURFACES z1 J(w) 1 19 w2 y1 0.75 0.5 w0 x0,y0 0.25 40 20 w1 x1 0 0 -100 -20 w1 w0 x1 0 -40 100 -4 -3 -2 -1 0 x* 1 2 3 4 R1 R2 Figure 6.8: As in Fig. 6.7, except here the patterns are not linearly separable; the error surface is slightly higher than in that figure. error and happens in this case to have a large slope; if the starting point had differed somewhat, the network would descend to the same final weight values. 6.4.2 XOR A somewhat more complicated problem is the XOR problem we have already considered. Figure ?? shows several two-dimensional slices through the nine-dimensional weight space of the 2-2-1 sigmoidal network (with bias). The slices shown include a global minimum in the error. Notice first that the error varies a bit more gradually as a function of a single weight than does the error in the networks solving the problems in Figs. 6.7 & 6.8. This is because in a large network any single weight has on average a smaller relative contribution to the output. Ridges, valleys and a variety of other shapes can all be seen in the surface. Several local minima in the high-dimensional weight space exist, which here correspond to solutions that classify three (but not four) patterns. Although it is hard to show it graphically, the error surface is invariant with respect to certain discrete permutations. For instance, if the labels on the two hidden units are exchanged (and the weight values changed appropriately), the shape of the error surface is unaffected (Problem ??). 6.4.3 Larger networks Alas, the intuition we gain from considering error surfaces for small networks gives only hints of what is going on in large networks, and at times can be quite misleading. Figure 6.10 shows a network with many weights solving a complicated high-dimensional two-category pattern classification problem. Here, the error varies quite gradually as a single weight is changed though we can get troughs, valleys, canyons, and a host of 20 CHAPTER 6. MULTILAYER NEURAL NETWORKS w'1 w'2 w'0 w02 w01 w11 w12 w21 w22 J 4 -4 -3 -2 1 0 -4 -3 -1 -1 0 0 -2 3 2 -3 -2 -4 1 2 J 4 3 0 -4 -3 -1 -2 -1 0 0 w'2 w'1 w'0 w'1 J 4 -4 -3 -2 1 0 0 1 -1 3 0 4 2 3 2 0 1 2 1 2 J 4 3 0 -4 -3 3 -1 4 0 -2 w21 w11 w22 w12 Figure 6.9: Two-dimensional slices through the nine-dimensional error surface after extensive training for a 2-2-1 network solving the XOR problem. shapes. Whereas in low dimensional spaces local minima can be plentiful, in high dimension, the problem of local minima is different: the high-dimensional space may afford more ways (dimensions) for the system to "get around" a barrier or local maximum during learning. In networks with many superfluous weights (i.e., more than are needed to learn the training set), one is less likely to get into local minima. However, networks with an unnecessarily large number of weights are undesirable because of the dangers of overfitting, as we shall see in Sect. 6.11. 6.4.4 How important are multiple minima? The possibility of the presence of multiple local minima is one reason that we resort to iterative gradient descent -- analytic methods are highly unlikely to find a single global minimum, especially in high-dimensional weight spaces. In computational practice, we do not want our network to be caught in a local minimum having high training error since this usually indicates that key features of the problem have not been learned by the network. In such cases it is traditional to re-initialize the weights and train again, possibly also altering other parameters in the net (Sect. 6.8). In many problems, convergence to a non-global minimum is acceptable, if the error is nevertheless fairly low. Furthermore, common stopping criteria demand that training terminate even before the minimum is reached and thus it is not essential that the network be converging toward the global minimum for acceptable performance (Sect. 6.8.14). 6.5. BACKPROPAGATION AS FEATURE MAPPING 21 Figure 6.10: A network with xxx weights trained on data from a complicated pattern recognition problem xxx. 6.5 Backpropagation as feature mapping Since the hidden-to-output layer leads to a linear discriminant, the novel computational power provided by multilayer neural nets can be attributed to the nonlinear warping of the input to the representation at the hidden units. Let us consider this transformation, again with the help of the XOR problem. Figure 6.11 shows a three-layer net addressing the XOR problem. For any input pattern in the x1 - x2 space, we can show the corresponding output of the two hidden units in the y1 - y2 space. With small initial weights, the net activation of each hidden unit is small, and thus the linear portion of their transfer function is used. Such a linear transformation from x to y leaves the patterns linearly inseparable (Problem 1). However, as learning progresses and the input-to-hidden weights increase in magnitude, the nonlinearities of the hidden units warp and distort the mapping from input to the hidden unit space. The linear decision boundary at the end of learning found by the hidden-to-output weights is shown by the straight dashed line; the nonlinearly separable problem at the inputs is transformed into a linearly separable at the hidden units. We can illustrate such distortion in the three-bit parity problem, where the output = +1 if the number of 1s in the input is odd, and -1 otherwise -- a generalization of the XOR or two-bit parity problem (Fig. 6.12). As before, early in learning the hidden units operate in their linear range and thus the representation after the hidden units remains linearly inseparable -- the patterns from the two categories lie at alternating vertexes of a cube. After learning and the weights have become larger, the nonlinearities of the hidden units are expressed and patterns have been moved and can be linearly separable, as shown. Figure 6.13 shows a two-dimensional two-category problem and the pattern representations in a 2-2-1 and in a 2-3-1 network of sigmoidal hidden units. Note that 22 CHAPTER 6. MULTILAYER NEURAL NETWORKS x1 1 y1 y2 -1 1 x2 -1 y2 2 bias x1 x2 1.5 1 ry da 0 fin -0.5 -1 al de ci sio n bo un 0.5 Epoch 1 15 30 45 60 -1.5 y1 -1.5 -1 -0.5 0 0.5 1 1.5 J 2 1.5 1 0.5 Epoch 10 20 30 40 50 60 Figure 6.11: A 2-2-1 backpropagation network (with bias) and the four patterns of the XOR problem are shown at the top. The middle figure shows the outputs of the hidden units for each of the four patterns; these outputs move across the y1 - y2 space as the full network learns. In this space, early in training (epoch 1) the two categories are not linearly separable. As the input-to-hidden weights learn, the categories become linearly separable. Also shown (by the dashed line) is the linear decision boundary determined by the hidden-to-output weights at the end of learning -- indeed the patterns of the two classes are separated by this boundary. The bottom graph shows the learning curves -- the error on individual patterns and the total error as a function of epoch. While the error on each individual pattern does not decrease monotonically, the total training error does decrease monotonically. 6.5. BACKPROPAGATION AS FEATURE MAPPING 2 23 Error 4 1 0 3 2 -1 1 -2 Epoch 25 50 75 100 125 150 -1 0 1 2 2 y2 0 -1 Error 4 1 3 y3 0 -1 -2 1 -2 -1 Epoch 10 20 30 40 50 60 70 y1 0 1 2 -2 2 1 2 Figure 6.12: A 3-3-1 backpropagation network (plus bias) can indeed solve the threebit parity problem. The representation of the eight patterns at the hidden units (y1 - y2 - y3 space) as the system learns and the (planar) decision boundary found by the hidden-to-output weights at the end of learning. The patterns of the two classes are separated by this plane. The learning curve shows the error on individual patterns and the total error as a function of epoch. in the two-hidden unit net, the categories are separated somewhat, but not enough for error-free classification; the expressive power of the net is not sufficiently high. In contrast, the three-hidden unit net can separate the patterns. In general, given sufficiently many hidden units in a sigmoidal network, any set of different patterns can be learned in this way. 6.5.1 Representations at the hidden layer -- weights In addition to focusing on the transformation of patterns, we can also consider the representation of learned weights themselves. Since the hidden-to-output weights merely leads to a linear discriminant, it is instead the input-to-hidden weights that are most instructive. In particular, such weights at a single hidden unit describe the input pattern that leads to maximum activation of that hidden unit, analogous to a "matched filter." Because the hidden unit transfer functions are nonlinear, the correspondence with classical methods such as matched filters (and principal components, Sect. ??) is not exact; nevertheless it is often convenient to think of the hidden units as finding feature groupings useful for the linear classifier implemented by the hidden-to-output layer weights. Figure 6.14 shows the input-to-hidden weights (displayed as patterns) for a simple task of character recognition. Note that one hidden unit seems "tuned" for a pair of horizontal bars while the other to a single lower bar. Both of these feature groupings are useful building blocks for the patterns presented. In complex, high-dimensional problems, however, the pattern of learned weights may not appear to be simply related to the features we suspect are appropriate for the task. This could be because we may be mistaken about which are the true, relevant feature groupings; nonlinear matched filter 24 y2 CHAPTER 6. MULTILAYER NEURAL NETWORKS y3 y2 y1 -1 0 -1 1 1 0 2 1.5 1 0.5 0 y1 2-2-1 x2 5 2-3-1 4 input 3 2 1 1 2 3 4 5 x1 Figure 6.13: Seven patterns from a two-dimesional two-category nonlinearly separable classification problem are shown at the bottom. The figure at the top left shows the hidden unit representations of the patterns in a 2-2-1 sigmoidal network (with bias) fully trained to the global error minimum; the linear boundary implemented by the hidden-to-output weights is also shown. Note that the categories are almost linearly separable in this y1 - y2 space, but one training point is misclassified. At the top right is the analogous hidden unit representation for a fully trained 2-3-1 network (with bias). Because of the higher dimension of the hidden layer representation, the categories are now linearly separable; indeed the learned hidden-to-output weights implement a plane that separates the categories. interactions between features may be significant in a problem (and such interactions are not manifest in the patterns of weights at a single hidden unit); or the network may have too many weights (degrees of freedom), and thus the feature selectivity is low. It is generally much harder to represent the hidden-to-output layer weights in terms of input features. Not only do the hidden units themselves already encode a somewhat abstract pattern, there is moreover no natural ordering of the hidden units. Together with the fact that the output of hidden units are nonlinearly related to the inputs, this makes analyzing hidden-to-output weights somewhat problematic. Often the best we can do is list the patterns of input wed in a single learning step. Thus, for rapid and uniform learning, we should calculate the second derivative of the criterion function with respect to each weight and set the optimal learning rate separately for each weight. We shall return in Sect. ?? to calculate second derivatives in networks, and to alternate descent and training methods such as Quickprop that give fast, uniform learning. For typical problems addressed with sigmoidal networks and parameters discussed throughout this section, it is found that a learning rate 36 CHAPTER 6. MULTILAYER NEURAL NETWORKS of 0.1 is often adequate as a first choice, and lowered if the criterion function diverges, or raised if learning seems unduly slow. 6.8.10 Momentum Error surfaces often have plateaus -- regions in which the slope dJ(w)/dw is very small -- for instance because of "too many" weights. Momentum -- loosely based on the notion from physics that moving objects tend to keep moving unless acted upon by outside forces -- allows the network to learn more quickly when plateaus in the error surface exist. The approach is to alter the learning rule in stochastic backpropagation to include some fraction of the previous weight update: w(m + 1) = w(m) + w(m) + w(m - 1) gradient descent momentum (36) Of course, must be less than 1.0 for stability; typical values are 0.9. It must be stressed that momentum rarely changes the final solution, but merely allows it to be found more rapidly. Momentum provides another benefit: effectively "averaging out" stochastic variations in weight updates during stochastic learning and thereby speeding learning, even far from error plateaus (Fig. 6.20). J(w) c sto w. ha m sti c en tu m om w1 Figure 6.20: The incorporation of momentum into stochastic gradient descent by Eq. 36 (white arrows) reduces the variation in overall gradient directions and speeds learning, especially over plateaus in the error surface. Algorithm 3 shows one way to incorporate momentum into gradient descent. Algorithm 3 (Stochastic backpropagation with momentum) 1 2 3 4 5 6 begin initialize topology (# hidden units), w, criterion, (< 1), , , m 0, bji 0, bkj 0 do m m + 1 xm randomly chosen pattern bji j xi + bji ; bkj k yj + bkj wji wji + bji ; wkj wkj + bkj until J(w) < 6.8. PRACTICAL TECHNIQUES FOR BACKPROPAGATION 7 8 37 return w end 6.8.11 Weight decay One method of simplifying a network and avoiding overfitting is to impose a heuristic that the weights should be small. There is no principled reason why such a method of "weight decay" should always lead to improved network performance (indeed there are occasional cases where it leads to degraded performance) but it is found in most cases that it helps. The basic approach is to start with a network with "too many" weights (or hidden units) and "decay" all weights during training. Small weights favor models that are more nearly linear (Problems 1 & 41). One of the reasons weight decay is so popular is its simplicity. After each weight update every weight is simply "decayed" or shrunk according to: wnew = wold (1 - ), (37) where 0 < < 1. In this way, weights that are not needed for reducing the criterion function become smaller and smaller, possibly to such a small value that they can be eliminated altogether. Those weights that are needed to solve the problem cannot decay indefinitely. In weight decay, then, the system achieves a balance between pattern error (Eq. 60) and some measure of overall weight. It can be shown (Problem 43) that the weight decay is equivalent to gradient descent in a new effective error or criterion function: Jef = J(w) + 2 t w w. (38) The second term on the right hand side of Eq. 38 preferentially penalizes a single large weight. Another version of weight decay includes a decay parameter that depends upon the value of the weight itself, and this tends to distribute the penalty throughout the network: mr = /2 2 (1 + wmr ) 2. (39) We shall discuss principled methods for setting , and see how weight decay is an instance of a more general regularization procedure in Chap. ??. 6.8.12 Hints Often we have insufficient training data for adequate classification accuracy and we would like to add information or constraints to improve the network. The approach of learning with hints is to add output units for addressing an ancillary problem, one related to the classification problem at hand. The expanded network is trained on the classification problem of interest and the ancillary one, possibly simultaneously. For instance, suppose we seek to train a network to classify c phonemes based on some acoustic input. In a standard neural network we would have c output units. In learning with hints, we might add two ancillary output units, one which represents vowels and the other consonants. During training, the target vector must be lengthened to include components for the hint outputs. During classification the hint units are not used; they and their hidden-to-output weights can be discarded (Fig. 6.21). 38 CHAPTER 6. MULTILAYER NEURAL NETWORKS categories 1 2 3 c h1 hints h2 output ... hidden input Figure 6.21: In learning with hints, the output layer of a standard network having c units (discriminant functions) is augmented with hint units. During training, the target vectors are also augmented with signals for the hint units. In this way the input-to-hidden weights learn improved feature groupings. During classification the hint units are not used, and thus they and their hidden-to-output weights are removed from the trained network. The benefit provided by hints is in improved feature selection. So long as the hints are related to the classification problem at hand, the feature groupings useful for the hint task are likely to aid category learning. For instance, the feature groupings useful for distinguishing vowel sounds from consonants in general are likely to be useful for distinguishing the /b/ from /oo/ or the /g/ from /ii/ categories in particular. Alternatively, one can train just the hint units in order to develop improved hidden unit representations (Computer exercise 16). Learning with hints illustrates another benefit of neural networks: hints are more easily incorporated into neural networks than into classifiers based on other algorithms, such as the nearest-neighbor or MARS. 6.8.13 On-line, stochastic or batch training? Each of the three leading training protocols described in Sect. 6.3.2 has strengths and drawbacks. On-line learning is to be used when the amount of training data is so large, or that memory costs are so high, that storing the data is prohibitive. Most practical neural network classification problems are addressed instead with batch or stochastic protocols. Batch learning is typically slower than stochastic learning. To see this, imagine a training set of 50 patterns that consists of 10 copies each of five patterns (x1 , x2 , ..., x5 ). In batch learning, the presentations of the duplicates of x1 provide as much information as a single presentation of x1 in the stochastic case. For example, suppose in the batch case the learning rate is set optimally. The same weight change can be achieved with just a single presentation of each of the five different patterns in the batch case (with learning rate correspondingly greater). Of course, true problems do not have exact duplicates of individual patterns; nevertheless, true data sets are generally highly redundant, and the above analysis holds. For most applications -- especially ones employing large redundant training sets -- stochastic training is hence to be preferred. Batch training admits some second- 6.8. PRACTICAL TECHNIQUES FOR BACKPROPAGATION 39 order techniques that cannot be easily incorporated into stochastic learning protocols and in some problems should be preferred, as we shall see in Sect. ??. 6.8.14 Stopped training In three-layer networks having many weights, excessive training can lead to poor generalization, as the net implements a complex decision boundary "tuned" to the specific training data rather than the general properties of the underlying distributions. In training the two-layer networks of Chap. ??, we could train as long as we like without fear that it would degrade final recognition accuracy because the complexity of the decision boundary is not changed -- it is always simply a hyperplane. This example shows that the general phenomenon should be called "overfitting," and not "overtraining." Because the network weights are initialized with small values, the units operate in their linear range and the full network implements linear discriminants. As training progresses, the nonlinearities of the units are expressed and the decision boundary warps. Qualitatively speaking, stopping the training before gradient descent is complete can help avoid overfitting. In practice, the elementary criterion of stopping when the error function decreases less than some preset value (e.g., line ?? in Algorithm ??), does not lead reliably to accurate classifiers as it is hard to know beforehand what an appropriate threshold should be set. A far more effective method is to stop training when the error on a separate validation set reaches a minimum (Fig. ??). We shall explore the theory underlying this version of cross validation in Chap. ??. We note in passing that weight decay is equivalent to a form of stopped training (Fig. 6.22). w2 learning stopped initial weights w1 Figure 6.22: When weights are initialized with small magnitudes, stopped training is equivalent to a form of weight decay since the final weights are smaller than they would be after extensive training. 6.8.15 How many hidden layers? The backpropagation algorithm applies equally well to networks with three, four, or more layers, so long as the units in such layers have differentiable transfer functions. Since, as we have seen, three layers suffice to implement any arbitrary function, we 40 CHAPTER 6. MULTILAYER NEURAL NETWORKS would need special problem conditions or requirements recommend the use of more than three layers. One possible such requirement is translation, rotation or other distortion invariances. If the input layer represents the pixel image in an optical character recognition problem, we generally want such a recognizer to be invariant with respect to such transformations. It is easier for a three-layer net to accept small translations than to accept large ones. In practice, then, networks with several hidden layers distribute the invariance task throughout the net. Naturally, the weight initialization, learning rate, data preprocessing arguments apply to these networks too. The Neocognitron network architecture (Sec. 6.10.7) has many layers for just this reason (though it is trained by a method somewhat different than backpropagation). It has been found empirically that networks with multiple hidden layers are more prone to getting caught in undesirable local minima. In the absence of a problem-specific reason for multiple hidden layers, then, it is simplest to proceed using just a single hidden layer. 6.8.16 Criterion function The squared error criterion of Eq. 8 is the most common training criterion because it is simple to compute, non-negative, and simplifies the proofs of some theorems. Nevertheless, other training criteria occasionally have benefits. One popular alternate is the cross entropy which for n patterns is of the form: n c J(w)ce = m=1 k=1 tmk ln(tmk /zmk ), (40) Minkowski error where tmk and zmk are the target and the actual output of unit k for pattern m. Of course, this criterion function requires both the teaching and the output values in the range (0, 1). Regularization and overfitting avoidance is generally achieved by penalizing complexity of models or networks (Chap. ??). In regularization, the training error and the complexity penalty should be of related functional forms. Thus if the pattern error is the sum of squares, then a reasonable network penalty would be squared length of the total weight vector (Eq. 38). Likewise, if the model penalty is some description length (measured in bits), then a pattern error based on cross entropy would be appropriate (Eq. 40). e rule: w(m + 1) = dJ dw m dJ dw m-1 - dJ dw m w(m). (51) If the third- and higher-order terms in the error are non-negligible, or if the assumption of weight independence does not hold, then the computed error minimum will not equal the true minimum, and further weight updates will be needed. When a number of obvious heuristics are imposed -- to reduce the effects of estimation error when the surface is nearly flat, or the step actually increases the error -- the method can be significantly faster than standard backpropagation. Another benefit is that each weight has, in effect, its own learning rate, and thus weights tend to converge at roughly the same time, thereby reducing problems due to nonuniform learning. 6.9. *SECOND-ORDER METHODS J(w) dJ dw 43 w(m) m-1 w(m+1) dJ dw m w w* Figure 6.23: The quickprop weight update takes the error derivatives at two points separated by a known amount, and by Eq. 51 makes its next weight value. If the error can be fully expressed as a second-order function, then the weight update leads to the weight (w ) leading to minimum error. 6.9.4 Conjugate gradient descent Another fast learning method is conjugate gradient descent, which employs a series of line searches in weight or parameter space. One picks the first descent direction (for instance, determined by the gradient) and moves along that direction until the minimum in error is reached. The second descent direction is then computed: this direction -- the "conjugate direction" -- is the one along which the gradient does not change its direction, but merely its magnitude during the next descent. Descent along this direction will not "spoil" the contribution from the previous descent iterations (Fig. ??). (1 ) re ct io n w2 HI w(2) ce de s nt w n fl w2 ectio h dir aries) searc tion v line c poor ient dire d (gra H=I w di (1 w (2 ) ) st w(1) con j (gra ugate d doe dient d irectio s no n i t va rection w(2 ) ry) fl fir w1 w1 Figure 6.24: Conjugate gradient descent in weight space employs a sequence of line searches. If w(1) is the first descent direction, the second direction obeys wt (1)Hw(2) = 0. Note especially that along this second descent, the gradient changes only in magnitude, not direction; as such the second descent does not "spoil" the contribution due to the previous line search. In the case where the Hessian is diagonal (right), the directions of the line searches are orthogonal. More specifically, if we let w(m - 1) represent the direction of a line search on 44 CHAPTER 6. MULTILAYER NEURAL NETWORKS step m - 1. (Note especially that this is not an overall magnitude of change, which is determined by the line search). We demand that the subsequent direction, w(m), obey wt (m - 1)Hw(m) = 0, (52) where H is the Hessian matrix. Pairs of descent directions that obey Eq. 52 are called "conjugate." If the Hessian is proportional to the identity matrix, then such directions are orthogonal in weight space. Conjugate gradient requires batch training, since the Hessian matrix is defined over the full training set. The descent direction on iteration m is in the direction of the gradient plus a component along the previous descent direction: w(m) = -J(w(m)) + m w(m - 1), (53) and the relative proportions of these contributions is governed by . This proportion can be derived by insuring that the descent direction on iteration m does not spoil that from direction m - 1, and indeed all earlier directions. It is generally calculated in one of two ways. The first formula (Fletcher-Reeves) is m = [J(w(m))]t J(w(m)) [J(w(m - 1))]t J(w(m - 1)) (54) A slightly preferable formula (Polak-Ribiere) is more robust in non-quadratic error functions is: m = [J(w(m))]t [J(w(m)) - J(w(m - 1))] . [J(w(m - 1))]t J(w(m - 1)) (55) Equations 53 & 36 show that conjugate gradient descent algorithm is analogous to calculating a "smart" momentum, where plays the role of a momentum. If the error function is quadratic, then the convergence of conjugate gradient descent is guaranteed when the number of iterations equals the total number of weights. Example 1: Conjugate gradient descent Consider finding the miminimum of a simple quadratic criterion function centered 2 2 on the origin of weight space, J(w) = 1/2(.2w1 + w2 ) = wt Hw, where by simple .2 0 differentiation the Hessian is found to be H = 0 1 . We start descent descent at a randomly selected position, which happens to be w(0) = -8 , as shown in the figure. -4 The first descent direction is determined by a simple gradient, which is easily found to 1 (0) be -J(w(0)) = - .4w2 (0) = 3.2 . In typical complex problems in high dimensions, 8 2w the minimum along this direction is found using a line search, in this simple case the minimum can be found be calculus. We let s represent the distance along the first descent direction, and find its value for the minimum of J(w) according to: -8 3.2 +s -4 8 t d ds .2 0 01 -8 3.2 +s -4 8 =0 which has solution s = 0.562. Therefore the minimum along this direction is 6.9. *SECOND-ORDER METHODS 45 w(1) = w(0) + 0.562(-J(w(0))) -8 3.2 -6.202 = + 0.562 = . -4 8 0.496 Now we turn to the use of conjugate gradients for the next descent. The simple gradient evaluated at w(1) is -J(w(1)) = - .4w1 (1) 2w2 (1) 2.48 . -0.99 = (It is easy to verify that this direction, shown as a black arrow in the figure, does not point toward the global minimum at w = 0 .) We use the Fletcher-Reeves formula 0 (Eq. 54) to construct the conjugate gradient direction: (-2.48 .99) -2.48 7.13 [J(w(1))]t J(w(1)) .99 = = 0.096. = -3.2 [J(w(0))]t J(w(0)) 74 (-3.2 8) 8 1 = Incidentally, for this quadratic error surface, the Polak-Ribiere formula (Eq. 55) would give the same value. Thus the conjugate descent direction is w(1) = -J(w(1)) + 1 3.2 8 2.788 . -.223 = 7.5 5 2.5 )) w(0 w2 w(1) 0 -J (w( -2.5 -5 -7.5 -J ( 1)) w(0) -7.5 -5 -2.5 0 2.5 5 7.5 w1 Conjugate gradient descent in a quadratic error landscape, shown in contour plot, starts at a random point w(0) and descends by a sequence of line searches. The first direction is given by the standard gradient and terminates at a minimum of the error -- the point w(1). Standard gradient descent from w(1) would be along the black vector, "spoiling" some of the gains made by the first descent; it would, furthermore, miss the global minimum. Instead, the conjugate gradient (red vector) does not spoil the gains from the first descent, and properly passes through the global error minimum at w = 0 . 0 46 CHAPTER 6. MULTILAYER NEURAL NETWORKS As above, rather than perform a traditional line search, we use calculus to find the error minimum along this second descent direction: d t [w(1) + sw(1)] H [w(1) + sw(1)] ds t .2 0 -6.202 2.788 -6.202 2.788 +s +s 01 0.496 -.223 0.496 -.223 = = 0 d ds which has solution s = 2.231. This yields the next minimum to be -6.202 2.788 + 2.231 0.496 -.223 0 . 0 w(2) = w(1) + sw(1) = = Indeed, the conjugate gradient search finds the global minimum in this quadratic error function in two search steps -- the number of dimensions of the space. 6.10 *Additional networks and training methods The elementary method of gradient descent used by backpropagation can be slow, even with straightforward improvements. We now consider some alternate networks and training methods. 6.10.1 Radial basis function networks (RBF) We have already considered several classifiers, such as Parzen windows, that employ densities estimated by localized basis functions such as Gaussians. In light of our discussion of gradient descent and backpropagation in particular, we now turn to a different method for training such networks. A radial basis function network with linear output unit implements nH zk (x) = j=0 wkj j (x). (56) where we have included a j = 0 bias unit. If we define a vector whose components are the hidden unit outputs, and a matrix W whose entries are the hidden-to-output weights, then Eq. 56 can be rewritten as: z(x) = W. Minimizing the criterion function J(w) = 1 (y(xm ; w) - tm )2 2 m=1 n (57) is formally equivalent to the linear problem we saw in Chap. ??. Weous translation constraint is also imposed between the hidden and output layer units. 6.10.4 Recurrent networks Up to now we have considered only networks which use feedforward flow of information during classification; the only feedback flow was of error signals during training. Now we turn to feedback or recurrent networks. In their most general form, these have found greatest use in time series prediction, but we consider here just one specific type of recurrent net that has had some success in static classification tasks. Figure 6.26 illustrates such an architecture, one in which the output unit values are fed back and duplicated as auxiliary inputs, augmenting the traditional feature values. During classification, a static pattern x is presented to the input units, the feedforward flow computed, and the outputs fed back as auxiliary inputs. This, in turn, leads to a different set of hidden unit activations, new output activations, and so on. Ultimately, the activations stabilize, and the final output values are used for classification. As such, this recurrent architecture, if "unfolded" in time, is equivalent to the static network shown at the right of the figure, where it must be emphasized that many sets of weights are constrained to be the same (weight sharing), as indicated. This unfolded representation shows that recurrent networks can be trained via standard backpropagation, but with the weight sharing constraint imposed, as in TDNNs. hid de n ou tpu t 6.10. *ADDITIONAL NETWORKS AND TRAINING METHODS 49 z(3) wkj y(3) wji z(2) wkj y(2) wji z(1) y(1) x x x wkj wji z y x Figure 6.26: The form of recurrent network most useful for static classification has the architecture shown at the bottom, with the recurrent connections in red. It is functionally equivalent to a static network with many hidden layers and extensive weight sharing, as shown above. Note that the input is replicated. 6.10.5 Counterpropagation Occasionally, one wants a rapid prototype of a network, yet one that has expressive power greater than a mere two-layer network. Figure 6.27 shows a three-layer net, which consists of familiar input, hidden and output layers. When one is learning the weights for a pattern in category i , In this way, the hidden units create a Voronoi tesselation (cf. Chap. ??), and the hidden-to-output weights pool information from such centers of Voronoi cells. The processing at the hidden units is competitive learning (Chap. ??). The speedup in counterpropagation is that only the weights from the single most active hidden unit are adjusted during a pattern presentation. While this can yield suboptimal recognition accuracy, counterpropagation can be orders of magnitude faster than full backpropagation. As such, it can be useful during preliminary data exploration. Finally, the learned weights often provide an excellent starting point for refinement by subsequent full training via backpropagation. It is called "counterpropagation" for an earlier implementation that employed five layers with signals that passed bottom-up as well as top-down. 50 CHAPTER 6. MULTILAYER NEURAL NETWORKS Figure 6.27: The simplest version of a counterpropagation network consists of three layers. During training, an input is presented and the most active hidden unit is determined. The only weights that are modified are the input-to-hidden weights leading to this most active hidden unit and the single hidden-to-output weight leading to the proper category. Weights can be trained using an LMS criterion. 6.10.6 Cascade-Correlation The central notion underlying the training of networks by cascade-correlation is quite simple. We begin with a two-layer network and train to minimum of an LMS error. If the resulting training error is low enough, training is stopped. In the more common case in which the error is not low enough, we fix the weights but add a single hidden unit, fully connected from inputs and to output units. Then these new weights are trained using an LMS criterion. If the resulting error is not sufficiently low, yet anoe-correlation and counterpropagation are generally faster than backpropagation. Complexity adjustment: weight decay, Wald statistic, which for networks is optimal brain damage and optimal brain surgeon, which use the second-order approximation to the true saliency as a pruning criterion. 56 CHAPTER 6. MULTILAYER NEURAL NETWORKS Bibliographical and Historical Remarks McCulloch and Pitts provided the first principled mathematical and logical treatment of the behavior of networks of simple neurons [49]. This pioneering work addressed non-recurrent as well as recurrent nets (those possessing "circles," in their terminology), but not learning. Its concentration on all-or-none or threshold function of neurons indirectly delayed the consideration of continuous valued neurons that would later dominate the field. These authors later wrote an extremely important paper on featural mapping (cf. Chap. ??), invariances, and learning in nervous systems and thereby advanced the conceptual development of pattern recognition significantly [56]. Rosenblatt's work on the (two-layer) Perceptron (cf. Chap. ??) [61, 62] was some of the earliest to address learning, and was the first to include rigorous proofs about convergence. A number of stochastic methods, including Pandemonium [66, 67], were developed for training networks with several layers of processors, though in keeping with the preoccupation with threshold functions, such processors generally computed logical functions (AND or OR), rather than some continuous functions favored in later neural network research. The limitations of networks implementing linear discriminants -- linear machines -- were well known in the 1950s and 1960s and discussed by both their promoters [62, cf., Chapter xx, "Summary of Three-Layer Series-Coupled Systems: Capabilities and Deficiencies"] and their detractors [51, cf., Chapter 5, "CON N ECT ED : A Geometric Property with Unbounded Order"]. A popular early method was to design by hand three-layer networks with fixed input-to-hidden weights, and then train the hidden-to-output weight [80, for a review]. Much of the difficulty in finding learning algorithms for all layers in a multilayer neural network came from the prevalent use of linear threshold units. Since these do not have useful derivatives throughout their entire range, the current approach of applying the chain rule for derivatives and the resulting "backpropagation of errors" did not gain more adherents earlier. The development of backpropagation was gradual, with several steps, not all of which were appreciated or used at the time. The earliest application of adaptive methods that would ultimately become backpropagation came from the field of control. Kalman filtering from electrical engineering [38, 28] used an analog error (difference between predicted and measured output) for adjusting gain parameters in predictors. Bryson, Denham and Dreyfus showed how Lagrangian methods could train multilayer networks for control, as described in [6]. We saw in the last chapter the work of Widrow, Hoff and their colleagues [81, 82] in using analog signals and the LMS training criterion applied to pattern recognition in two-layer networks. Werbos [77][78, Chapter 2], too, discussed a method for calculating the derivatives of a function based on a sequence of samples (as in a time series), which, if interpreted carefully carried the key ideas of backpropagation. Parker's early "Learning logic" [53, 54], developed independently, showed how layers of linear units could be learned by a sufficient number of input-output pairs. This work lacked simulations on representative or challenging problems (such as XOR) and was not appreciated adequately. Le Cun independently developed a learning algorithm for three-layer networks [9, in French] in which target values are propagated, rather than derivatives; the resulting learning algorithm is equivalent to standard backpropagation, as pointed out shortly thereafter [10]. Without question, the paper by Rumelhart, Hinton and Williams [64], later expanded into a full and readable chapter [65], brought the backpropagation method to the attention of the widest audience. These authors clearly appreciated the power of 6.11. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 57 the method, demonstrated it on key tasks (such as the exclusive OR), and applied it to pattern recognition more generally. An enormous number of papers and books of applications -- from speech production and perception, optical character recognition, data mining, finance, game playing and much more -- continues unabated. One novel class of for such networks includes generalization for production [20, 21]. One view of the history of backpropagation is [78]; two collections of key papers in the history of neural processing more generally, including many in pattern recognition, are [3, 2]. Clear elementary papers on neural networks can be found in [46, 36], and several good textbooks, which differ from the current one in their emphasis on neural networks over other pattern recognition techniques, can be recommended [4, 60, 29, 27]. An extensive treatment of the mathematical aspects of networks, much of which is beyond that needed for mastering the use of networks for pattern classification, can be found in [19]. There is continued exploration of the strong links between networks and more standard statistical methods; White presents and overview [79], and books such as [8, 68] explore a number of close relationships. The important relation of multilayer Perceptrons to Bayesian methods and probability estimation can be found in [23, 59, 43, 5, 13, 63, 52].posterior probability!and backpropagation Original papers on projection pursuit and MARS, can be found in [15] and [34], respectively, and a good overview in [60]. Shortly after its wide dissemination, the backpropagation algorithm was criticized for its lack of biological plausibility; in particular, Grossberg [22] discussed the non-local nature of the algorithm, i.e., that synaptic weight values were transported without physical means. Somewhat later Stork devised a local implementation of backpropagation was [71, 45], and pointed out that it was nevertheless highly implausible as a biological model. The discussions and debates over the relevance of Kolmogorov's Theorem [39] to neural networks, e.g. [18, 40, 41, 33, 37, 12, 42], have centered on the expressive power. The proof of the univerasal expressive power of three-layer nets based on bumps and Fourier ideas appears in [31]. The expressive power of networks having non-traditional transfer functions was explored in [72, 73] and elsewhere. The fact that three-layer networks can have local minima in the criterion function was explored in [50] and some of the properties of error surfaces illustrated in [35]. The Levenberg-Marquardt approximation and deeper analysis of second-order methods can be found in [44, 48, 58, 24]. Three-layer networks trained via cascadecorrelation have been shown to perform well compared to standard three-layer nets trained via backpropagation [14]. Our presentation of counterpropagation networks focussed on just three of the five layers in a full such network [30]. Although there was little from a learning theory new presented in Fukushima's Neocognitron [16, 17], its use of many layers and mixture of hand-crafted feature detectors and learning groupings showed how networks could address shift, rotation and scale invariance. Simple method of weight decay was introduced in [32], and gained greater acceptance due to the work of Weigend and others [76]. The method of hints was introduced in [1]. While the Wald test [74, 75] has been used in traditional statistical research [69], its application to multilayer network pruning began with the work of Le Cun et al's Optimal Brain Damage method [11], later extended to include non-diagonal Hessian matrices [24, 25, 26], including some speedup methods [70]. A good review of the computation and use of second order derivatives in networks can be found in [7] and of pruning algorithms in [58]. 58 CHAPTER 6. MULTILAYER NEURAL NETWORKS Problems Section 6.2 1. Show that if the transfer function of the hidden units is linear, a three-layer network is equivalent to a two-layer one. Explain why, therefore, that a three-layer network with linear hidden units cannot solve a non-linearly separable problem such as XOR or n-bit parity. 2. Fourier's Theorem can be used to show that a three-layer neural net with sigmoidal hidden units can approximate to arbitrary accuracy any posterior function. Consider two-dimensional input and a single output, z(x1 , x2 ). Recall that Fourier's Theorem states that, given weak restrictions, any such functions can be written as a possibly infinite sum of cosine functions, as z(x1 , x2 ) f1 f2 Af1 f2 cos(f1 x1 ) cos(f2 x2 ), with coefficients Af1 f2 . (a) Use the trigonometric identity cos cos = 1 1 cos( + ) + cos( - ) 2 2 to write z(x1 , x2 ) as a linear combination of terms cos(f1 x1 + f2 x2 ) and cos(f1 x1 - f2 x2 ). (b) Show that cos(x) or indeed any continuous function f (x) can be approximated to any accuracy by a linear combination of sign functions as: N f (x) f (x0 ) + i=0 [f (xi+1 ) - f (xi )] 1 + Sgn(x - xi ) , 2 where the xi are sequential values of x; the smaller xi+1 - xi , the better the approximation. (c) Put your results together to show that z(x1 , x2 ) can be expressed as a linear combination of step functions or sign functions whose arguments are themselves linear combinations of the input variables x1 and x2 . Explain, in turn, why this implies that a three-layer network with sigmoidal hidden units and a linear output unit can implement any function that can be expressed by a Fourier series. (d) Does your construction guarantee that the derivative df (x)/dx can be well approximated too? Section 6.3 3. Consider an d - nH - c network trained with n patterns for me epochs. (a) What is the space complexity in this problem? (Consider both the storage of network parameters as well as the storage of patterns, but not the program itself.) 6.11. PROBLEMS 59 (b) Suppose the network is trained in stochastic mode. What is the time complexity? Since this is dominated by the number of multiply-accumulations, use this as a measure of the time complexity. (c) Suppose the network is trained in batch mode. What is the time complexity? 4. Prove that the formula for the sensitivity for a hidden unit in a three-layer net (Eq. 20) generalizes to a hidden unit in a four- (or higher-) layer network, where the sensitivity is the weighted sum of sensitivities of units in the next higher layer. 5. Explain in words why the backpropagation rule for training input-to-hidden weights makes intuitive sense by considering the dependency upon each of the terms in Eq. 20. 6. One might reason that the the dependence of the backpropagation learning rules (Eq. ??) should be roughly inversely related to f (net); i.e., that weight change should be large where the output does not vary. In fact, of course, the learning rule is linear in f (net). What, therefore, is wrong with the above view? 7. Show that the learning rule described in Eqs. 16 & 20 works for bias, where x0 = y0 = 1 is treated as another input and hidden unit. 8. Consider a standard three-layer backpropagation net with d input units, nH hidden units, c output units, and bias. (a) How many weights are in the net? (b) Consider the symmetry in the value of the weights. In particular, show that if the sign if flipped on every weight, the network function is unaltered. (c) Consider now the hidden unit exchange symmetry. There are no labels on the hidden units, and thus they can be exchanged (along with corresponding weights) and leave network function unaffected. Prove that the number of such equivalent labellings -- the exchange symmetry factor -- is thus nH 2nH . Evaluate this factor for the case nH = 10. 9. Using the style of procedure, write the procedure for on-line version of backpropagation training, being careful to distinguish it from stochastic and batch procedures. 10. Express the derivative of a sigmoid in terms of the sigmoid itself in the following two cases (for positive constants a and b. . . . . 7.6 *Genetic Programming . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliographical and Historical Remarks . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computer exercises . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 5 5 8 9 11 12 13 17 20 20 20 23 25 27 27 31 31 32 33 35 36 41 42 47 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 CONTENTS Chapter 7 Stochastic Methods 7.1 Introduction the general approach is to specify model having parameters L seen, estimate their values from training adata. When the one or morefairly simple and then models are and of low dimension, we can use analytic methods such as computing derivatives and performing gradient descent to find optimal model parameters. If the models are somewhat more complicated, we may calculate local derivatives and use gradient methods, as in neural networks and some maximum-likelihood problems. In most high-dimensional and complicated models, there are multiple maxima and we must use a variety of tricks -- such as performing the search multiple times from different starting conditions -- to have any confidence that an acceptable local maximum has been found. These methods become increasingly unsatisfactory as the models become more complex. A naive approach -- exhaustive search through solution space -- rapidly gets out of hand and is completely impractical for real-world problems. The more complicated the model, the less the prior knowledge, and the less the training data, the more we must rely on sophisticated search for finding acceptable model parameters. In this chapter we consider stochastic methods for finding parameters, where randomness plays a crucial role in search and learning. The general approach is to bias the search toward regions where we expect the solution to be and allow randomness -- somehow -- to help find good parameters, even in very complicated models. We shall consider two general classes of such methods. The first, exemplified by Boltzmann learning, is based on concepts and techniques from physics, specifically statistical mechanics. The second, exemplified by genetic algorithms, is based on concepts from biology, specifically the mathematical theory of evolution. The former class has a highly developed and rigorous theory and many successes in pattern recognition; hence it will command most of our effort. The latter class is more heuristic yet affords flexibility and can be attractive when adequate computational resources are available. We shall generally illustrate these techniques in cases that are simple, and which might also be addressed with standard gradient procedures; nevertheless we emphasize that these stochastic methods may be preferable in complex problems. 3 earning plays a central role in the construction of pattern classifiers. As we have 4 CHAPTER 7. STOCHASTIC METHODS The methods have high computational burden, and would be of little use without computers. 7.2 Stochastic search energy We begin by discussing an important and general quadratic optimization problem. Analytic approaches do not scale well to large problems, however, and thus we focus here on methods of search through different candidate solutions. We then consider a form of stochastic search that finds use in learning for pattern recognition. Suppose we have a large number of variables si , i = 1, . . . , N where each variable can take one of two discrete values, for simplirgetically unfavorable and the full system explores configurations that have high energy. Annealing proceeds by gradually lowering the temperature of the system -- ultimately toward zero and thus no randomness -- so as to allow the system to relax into a low-energy configuration. Such annealing is effective because even at moderately high temperatures, the system slightly favors regions in the configuration space that are overall lower in energy, and hence are more likely to contain the global minimum. the As temperature is lowered, the system has increased probability of finding the optimum configuration. This method is successful in a wide range of energy functions or energy "landscapes," though there are pathological cases such as the "golf course" landscape in Fig. 7.2 where it is unlikely to succeed. Fortunately, the problems in learning we shall consider rarely involve such pathological functions. 7.2.2 The Boltzmann factor The statistical properties of large number of interacting physical components at a temperature T , such as molecules in a gas or magnetic atoms in a solid, have been thoroughly analyzed. A key result, which relies on a few very natural assumptions, is visible -1 6 E CHAPTER 7. STOCHASTIC METHODS E x2 x1 x2 x1 Figure 7.2: The energy function or energy "landscape" on the left is meant to suggest the types of optimization problems addressed by simulated annealing. The method uses randomness, governed by a control parameter or "temperature" T to avoid getting stuck in local energy minima and thus to find the global minimum, like a small ball rolling in the landscape as it is shaken. The pathological "golf course" landscape at the right is, generally speaking, not amenable to solution via simulated annealing because the region of lowest energy is so small and is surrounded by energetically unfavorable configurations. The configuration space of the problems we shall address are discrete and thus the continuous x1 - x2 space shown here is a bit misleading. the following: the probability the system is in a (discrete) configuration indexed by having energy E is given by P () = Boltzmann factor partition function e-E /T , Z(T ) (2) where Z is a normalization constant. The numerator is the Boltzmann factor and the denominator the partition function, the sum over all possible configurations Z(T ) = e-E /T , (3) which guarantees Eq. 2 represents a true probability. The number of configurations is very high, 2N , and in physical systems Z can be calculated only in simple cases. Fortunately, we need not calculate the partition function, as we shall see. Because of the fundamental importance of the Boltzmann factor in our discussions, it pays to take a slight detour to understand it, at least in an informal way. Consider a different, but nontheless related system: one consisting of a large number of non-interacting magnets, that is, without interconnecting weights, in a uniform external magnetic field. If a magnet is pointing up, si = +1 (in the same direction as the field), it contributes a small positive energy to the total system; if the magnet is pointing down, a small negative energy. The total energy of the collection is thus proportional to the total number of magnets pointing up. The probability the system has a particular total energy is related to the number of configurations that have that energy. Consider the highest energy configuration, with all magnets pointing up. There is only N = 1 configuration that has this energy. The next to highest N In the Boltzmann factor for physical systems there is a "Boltzmann constant" which converts a temperature into an energy; we can ignore this factor by scaling the temperature in our simulations. 7.2. STOCHASTIC SEARCH 7 energy comes with just a single magnet pointing down; there are N = N such con1 figurations. The next lower energy configurations have two magnets pointing down; there are N = N (N - 1)/2 of these configurations, and so on. The number of states 2 declines exponentially with increasing energy. Because of the statistical independence of the magnets, for large N the probability of finding the state in energy E also decays exponentially (Problem 7). In sum, then, the exponential form of the Boltzmann factor in Eq. 2 is due to the exponential decrease in the number of accessible configurations with increasing energy. Further, at high temperature there is, roughly speaking, more energy available and thus an increased probability of higher-energy states. This describes qualitatively the dependence of the probability upon T in the Boltzmann factor -- at high T , the probability is distributed roughly evenly among all configurations while at low T , it is concentrated at the lowest-energy configurations. If we move from the collection of independent magnets to the case of magnets interconnected by weights, the situation is a bit more complicated. Now the energy associated with a magnet pointing up or down depends upon the state of others. Nonetheless, in the case of large N , the number of configurations decays exponentially with the energy of the configuration, as described by the Boltzmann factor of Eq. 2. Simulated annealing algorithm The above discussion and the physical analogy suggest the following simulated annealing method for finding the optimum configuration to our general optimization problem. Start with randomized states throughout the network, si (1), and select a high initial "temperature" T (1). (Of course in the simulation T is merely a control parameter which will control the randomness; it is not a true physical temperature.) Next, choose a node i randomly. Suppose its state is si = + 1. Calculate the system energy in this configuration, Ea ; next recalculate the energy, Eb , for a candidate new state si = - 1. If this candidate state has a lower energy, accept this change in state. If however the energy is higher, accept this change with a probability equal to e-Eab /T , (4) where Eab = Eb -Ea . This occasional acceptance of a state that is energetically less favorable is crucial to the success of simulated annealing, and is in marked distinction to naive gradient descent and the greedy approach mentioned above. The key benefit is that it allows the system to jump out of unacceptable local energy minima. For example, at very high temperatures, every configuration has a Boltzmann factor e-E/T e0 roughly equal. After normalization by the partition function, then, every configuration is roughly equally likely. This implies every node is equally likely to be in either of its two states (Problem 6). The algorithm continues polling (selecting and testing) the nodes randomly several times and setting their states in this way. Next lower the temperature and repeat the polling. Now, according to Eq. 4, there will be a slightly smaller probability that a candidate higher energy state will be accepted. Next the algorithm polls all the nodes until each node has been visited several times. Then the temperature is lowered further, the polling repeated, and so forth. At very low temperatures, the probability that an energetically less favorable state will be accepted is small, and thus the search becomes more like a greedy algorithm. Simulated annealing terminates when the temperature is very low (near zero). If this cooling has been sufficiently slow, the system then has a high probability of being in a low energy state -- hopefully the global energy minimum. polling 8 CHAPTER 7. STOCHASTIC METHODS Because it is the difference in energies between the two states that determines the acceptance probabilities, we need only consider nodes connected to the one being polled -- all the units not connected to the polled unit are in the same state and contribute the same total amount to the full energy. We let Ni denote the set of nodes connected with non-zero weights to node i; in a fully connected net would include the complete set of N - 1 remaining nodes. Further, we let Rand[0, 1) denote a randomly selected positive real number less than 1. With this notation, then, the randomized or stochastic simulated annealing algorithm is: Algorithm 1 (Stochastic simulated annealing) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 begin initialize T (k), kmax , si (1), wij for i, j = 1, . . . , N k0 do k k + 1 do select node i randomly; suppose its state is si Ea -1/2 Ni j wij si sj Eb -Ea if Eb < Ea then si -si else if e-(Eb -Ea )/T (k) > Rand[0, 1) then si -si until all nodes polled several times until k = kmax or stopping criterion met return E, si , for i = 1, . . . , N end annealing schedule Because units are polled one at a time, the algorithm is occasionally called sequential simulated annealing. Note that in line 5, we define Ea based only on those units connected to the polled one -- a slightly different convention than in Eq. 1. Changing the usage in this way has no effect, since in line 9 it is the difference in energies that determines transition probabilities. There are several aspects of the algorithm that must be considered carefully, in particular the starting temperature, ending temperature, the rate at which the temperature is decreased and the stopping criterion. This function is called the cooling schedule or more frequently the annealing schedule, T (k), where k is an iteration index. We demand T (1) to be sufficiently high that all configurations have roughly equal probability. This demands the temperature be larger than the maximum difference in energy between any configurations. Such a high temperature allows the system to move to any configuration which may be needed, since the random initial configuration may be far from the optimal. The decrease in temperature must be both gradual and slow enough that the system can move to any part of the state space before being trapped in an unacceptable local minimum, points we shall consider below. At the very least, annealing must allow N/2 transitions, since a global optimum never differs from any configuration by more than this number of steps. (In practice, annealing can require polling several orders of magnitude more times than this number.) The final temperature must be low enough (or equivalently kmax must be large enough or a stopping criterion must be good enough) that there is a negligible probability that if the system is in a global minimum it will move out. Figure 7.3 shows that early in the annealling process when the temperature is high, the system explores a wide range of configurations. Later, as the temperature 7.2. STOCHASTIC SEARCH 9 is lowered, only states "close" to the global minimum are tested. Throughout the process, each transition corresponds to the change in state of a single unit. A typical choice of annealing schedule is T (k + 1) = cT (k) with 0 < c < 1. If computational resources are of no concern, a high initial temperature, large c < 1, and large kmax are most desirable. Values in the range 0.8 < c < 0.99 have been found to work well in many real-world problems. In practice the algorithm is slow, requiring many iterations and many passes through all the nodes, though for all but the smallest problems it is still faster than exhaustive search (Problem 5). We shall revisit the issue of parameter setting in the context of learning in Sect. 7.3.4. While Fig. 7.3 displayed a single trajectory through the configuration space, a more relevant property is the probability of being in a configuration as the system is annealed gradually. Figure 7.4 shows such probability distributions at four temperatures. Note especially that at the final, low temperature the probability is concentrated at the global minima, as desired. While this figure shows that for positive temperature all states have a non-zero probability of being visited, we must recognize that only a small fraction of configurations are in fact visited in any anneal. In short, in the vast majority of large problems, annealing does not require that all configurations be explored, and hence it is more efficient than exhaustive search. 7.2.3 Deterministic simulated annealing Stochastic simulated annealing is slow, in part because of the discrete nature of the search through the space of all configurations, i.e., an Non, and it is worthwhile explaining it carefully. Figure 7.9 illustrates in greater detail the learning of the single training pattern in Fig. 7.8. Because s1 and s2 are clamped throughout, EQ [s1 s2 ]i o clamped = 1 = E[s1 s2 ]i clamped , and thus the weight w12 is not changed, as indeed given by Eq. 14. Consider a more general case, involving s1 and s7 . During the learning phase both units are clamped at +1 and thus the correlation is EQ [s1 s7 ] = +1. During the unlearning phase, the output s7 is free to vary and the correlation is lower; in fact it happens to be negative. Thus, the learning rule seeks to increase the magnitude of w17 so that the input s1 = +1 leads to s7 = +1, as can be seen in the matrix on the right. Because hidden units are only weakly correlated (or anticorrelated), the weights linking hidden units are changed only slightly. In learning a training set of many patterns, each pattern is presented in turn, and the weights updated as just described. Learning ends when the actual output matches the desired output for all patterns (cf. Sect. 7.3.4). 7.3. BOLTZMANN LEARNING 17 before training E 1 output 2 s6 s3 s1 - + s7 s5 s2 s7 s6 s5 s4 s3 s2 s1 after training s4 -----++ ----+++ ---++++ ---+-++ --++-++ --+++++ --+-+++ --+--++ -++--++ -++-+++ -++++++ -+++-++ -+-+-++ -+-++++ -+--+++ -+---++ ++---++ ++--+++ ++-++++ ++-+-++ ++++-++ +++++++ +++-+++ +++--++ +-+--++ +-+-+++ +-+++++ +-++-++ +--+-++ +--++++ +---+++ +----++ +-...++ + input + training pattern these configurations become more probable after training Figure 7.8: The fully connected seven-unit network at the left is being trained via the Boltzmann learning algorithm with the input pattern s1 = +1, s2 = +1, and the output values s6 = -1 and s7 = +1, representing categories 1 and 2 , respectively. All 25 = 32 configurations with s1 = +1, s2 = +1 are shown at the right, along with their energy (Eq. 1). The black curve shows the energy before training; the red curve shows the energy after training. Note particularly that after training all configurations that represent the full training pattern have been lowered in energy, i.e., have become more probable. Consequently, patterns that do not represent the training pattern become less probable after training. Thus, after training, if the input pattern s1 = +1, s2 = +1 is presented and the remaining network annealed, there is an increased chance of yielding s6 = -1, s7 = +1, as desired. 7.3.2 Missing features and category constraints A key benefit of Boltzmann training (including its preferred implementation, described in Sect. 7.3.3, below) is its ability to deal with missing features, both during training and during classification. If a deficient binary pattern is used for training, input units corresponding to missing features are allowed to vary -- they are temporarily treated as (unclamped) hidden units rather than clamped input units. As a result, during annealing such units assume values most consistent with the rest of the input pattern and the current state of the network (Problem 14). Likewise, when a deficient pattern is to be classified, any units corresponding to missing input features are not clamped, and are allowed to assume any value. Some subsidiary knowledge or constraints can be incorporated into a Boltzmann network during classification. Suppose in a five-category problem it is somehow known that a test pattern is neither in category 1 nor 4 . (Such constraints could come from context or stages subsequent to the classifier itself.) During classification, then, the output units corresponding to 1 and 4 are clamped at si = -1 during the anneal, and the final category read as usual. Of course in this example the possible categories are then limited to the unclamped output units, for 2 , 3 and 5 . Such constraint imposition may lead to an improved classification rate (Problem 15). 18 CHAPTER 7. STOCHASTIC METHODS -1 -0.5 EQ[s1 s2]i o clamped s1 s 2 s 3 s4 s5 s6 s7 s1 s2 s3 s4 s5 s6 s7 s1 s2 s3 s4 s5 s6 s7 0 +0.5 +1 w = EQ[s1 s2]i o clamped - E[s1 s2]i clamped s 1 s2 s3 s4 s5 s6 s7 s1 s2 s3 s4 s5 s6 s7 E[s1 s2]i clamped s1 s2 s3 s4 s5 s6 s7 learning unlearning weight update Figure 7.9: Boltzmann learning of a single pattern is illustrated for the seven-node network of Fig. 7.8. The (symmetric) matrix on the left shows the correlation of units for the learning component, where the input units are clamped to s1 = +1, s2 = +1, and the outputs to s6 = -1, s7 = +1. The middle matrix shows the unlearning component, where the inputs are clamped but outputs are free to vary. The difference between those matrices is shown on the right, and is proportional to the weight update (Eq. 14). Notice, for instance, that because the correlation between s1 and s2 is large in both the learning and unlearning components (because those variables are clamped), there is no associated weight change, i.e., w12 = 0. However, strong correlations between s1 and s7 in the learning but not in the unlearning component implies that the weight w17 should be increased, as can be seen in the weight update matrix. Pattern completion The problem of pattern completion is to estimate the full pattern given just a part of that pattern; as such, it is related to the problem of classification with missing features. Pattern completion is naturally addressed in Boltzmann networks. A fully interconnected network, with or without hidden units, is trained with a set of representative patterns; as before, the visible units correspond to the feature components. When a deficient pattern is presented, a subset of the visible units are clamped to the components of a partial pattern, and the network annealed. The estimate of the unknown features appears on the remaining visible units, as illustrated in Fig. 7.10 (Computer exercise 3). Such pattern completion in Boltzmann networks can be more accurate when known category information is imposed at the output units. Boltzmann networks without hidden or category units are related to so-called Hopfield networks or Hopfield auto-association networks (Problem 12). Such networks store patterns but not their category labels. The learning rule for such networks does not require the full Boltzmann learning of Eq. 14. Instead, weights are set to be proportional to the correlation of the feature vectors, averaged over the training set, Hopfield network 7.3. BOLTZMANN LEARNING 19 learned patterns s1 s2 s4 s6 s3 s1 s8 s9 hidden s10 s11 s12 deficient pattern presented pattern completed by network s5 s 7 s2 s3 s4 s5 s6 s7 - visible + Figure 7.10: A Boltzmann network can be used for pattern completion, i.e., filling in unknown features of a deficient pattern. Here, a twelve-unit network with five hidden units has been trained with the 10 numeral patterns of a seven-segment digital display. The diagram at the lower left shows the correspondence between the display segments and nodes of the network; a black segment is represented by a +1 and a light gray segment as a -1. Consider the deficient pattern consisting of s2 = -1, s5 = +1. If these units are clamped and the full network annealed, the remaining five visible units will assume values most probable given the clamped ones, as shown at the right. wij EQ [si sj ], (15) with wii = 0; further, there is no need to consider temperature. Such learning is of course much faster than true Boltzmann learning using annealing. If a network fully trained by Eq. 15 is nevertheless annealed, as in full Boltzmann learning, there is no guarantee that the equilibrium correlations in the learning and unlearning phases are equal, i.e., that wij = 0 (Problem 13). The successes of such Hopfield networks in true pattern recognition have been modest, partly because the basic Hopfield network does not have as natural an output representation for categorization problems. Occassionally, though they can be used in simple low-dimensional pattern completion or auto-association problems. One of their primary drawbacks is their limited capacity, analogous to the fact that a two-layer network cannot implement arbitrary decision boundaries as can a thnegative to all other output units. The resulting internal representation is closely related to that in the probabilistic neural network implementation of Parzen windows (Chap. ??). Naturally, this representation is undesirable as the number of weights grows exponentially with the number of patterns. Training becomes slow; furthermore generalization tends to be poor. Since the states of the hidden units are binary valued, and since it takes log2 n bits to specify n different items, there must be at least log2 n hidden units if there is to be a distinct hidden configuration for each of the n patterns. Thus a lower bound on the number of hidden units is log2 n , which is necessary for a distinct hidden configuration for each pattern. Nevertheless, this bound need not be tight, as there may be no set of weights insuring a unique representation (Problem 16). Aside from these bounds, it is hard to make firm statements about the number of hidden units needed -- this number depends upon the inherent difficulty of the classification problem. It is traditional, then, to start with a somewhat large net and use weight decay. Much as we saw in backpropagation (Chap. ??), a Boltzmann network with "too many" hidden units and weights can be improved by means of weight decay. During training, a small increment is added to wij when si and sj are both positive or both negative during learning phase, but subtracted in the unlearning phase. It is traditional to decrease throughout training. Such a version of weight decay tends to reduce the effects on the weights due to spurious random correlations in units and to eliminate unneeded weights, thereby improving generalization. One of the benefits of Boltzmann networks over backpropagation networks is that "too many" hidden units in a backpropagation network tend to degrade performance more than "too many" in a Boltzmann network. This is because during learning, there is stochastic averaging over states in a Boltzmann network which tends to smooth decision boundaries; backpropagation networks have no such equivalent averaging. Of course, this averaging comes at a higher computational burden for Boltzmann networks. The next matter to consider is weight initialization. Initializing all weights to zero is acceptable, but leads to unnecessarily slow learning. In the absence of information otherwise, we can expect that roughly half the weights will be positive and half negative. In a network with fully interconnected hidden units there is nothing to differentiate the individual hidden units; thus we can arbitrarily initialize roughly half of the weights to have positive values and the rest negative. Learning speed is increased if weights are initialized with random values within a proper range. Assume a fully interconnected network having N units (and thus N - 1 N connections to each unit). Assume further that at any instant each unit has an equal chance of being in state si = +1 or si = -1. We seek initial weights that will make the net force on each unit a random variable with variance 1.0, roughly the useful range shown in Fig. 7.5. This implies weights should be initialized randomly throughout the range - 3/N < wij < +1 3/N (Problem 17). As mentioned, annealing schedules of the form T (k + 1) = cT (k) for 0 < c < 1 are generally used, with 0.8 < c < 0.99. If a very large number of iterations -- several thousand -- are needed, even c = 0.99 may be too small. In that case we can write c = e-1/k0 , and thus 22 CHAPTER 7. STOCHASTIC METHODS T (k) = T (1)e-k/k0 , and k0 can be interpreted as a decay constant. The initial temperature T (1) should be set high enough that virtually all candidate state transitions are accepted. While this condition can be insured by choosing T (1) extremely high, in order to reduce training time we seek the lowest adequate value of T (1). A lower bound on the acceptable initial temperature depends upon the problem, but can be set empirically by monitoring state transitions in short simulations at candidate temperatures. Let m1 be the number of energy-decreasing traemporal scales. Correlations are learned by the weights linking the hidden units, here labeled E. It is somewhat more difficult to train linked Hidden Markov Models to learn structure at different time scales. mentation on massively parallel computers. In broad overview, such methods proceed as follows. First, we create several classifiers -- a population -- each varying somewhat from the other. Next, we judge or score each classifier on a representative version of the classification task, such as accuracy on a set of labeled examples. In keeping with the analogy with biological evolution, the resulting (scalar) score is sometimes called the fitness. Then we rank these classifiers according to their score and retain the best classifiers, some portion of the total population. Again, in keeping with biological terminology, this is called survival of the fittest. We now stochastically alter the classifiers to produce the next generation -- the children or offspring. Some offspring classifiers will have higher scores than their parents in the previous generation, some will have lower scores. The overall process is then repeated for subsequent generation: the classifiers are scored, the best ones retained, randomly altered to give yet another generation, and so on. In part because of the ranking, each generation has, on average, a slightly higher score than the previous one. The process is halted when the single best classifier in a generation has a score that exceeds a desired criterion value. The method employs stochastic variations, and these in turn depend upon the fundamental representation of each classifier. There are two primary representations we shall consider: a string of binary bits (in basic genetic algorithms), and snippets of computer code (in genetic programming). In both cases, a key property is that population score fitness survival of the fittest offspring parent .. . . . .. .. . . . C .. . . . 7.5. *EVOLUTIONARY METHODS 27 occasionally very large changes in classifier are introduced. The presence of such large changes and random variations implies that evolutionary methods can find good classifiers even in extremely complex discontinuous spaces or "fitness landscapes" that are hard to address by techniques such as gradient descent. 7.5.1 Genetic Algorithms chromosome In basic genetic algorithms, the fundamental representation of each classifier is a binary string, called a chromosome. The mapping from the chromosome to the features and other aspects of the classifier depends upon the problem domain, and the designer has great latitude in specifying this mapping. In pattern classification, the score is usually chosen to be some monotonic function of the accuracy on a data set, possibly with penalty term to avoid overfitting. We use a desired fitness, , as the stopping criterion. Before we discuss these points in more depth, we first consider more specifically the structure of the basic genetic algorithm, and then turn to the key notion of genetic operators, used in the algorithm. Algorithm 4 (Basic Genetic algorithm) 1 2 3 4 5 6 7 8 9 10 11 begin initialize , Pco , Pmut , L N -bit chromosomes do Determine fitness of each chromosome, fi , i = 1, . . . , L Rank the chromosomes do Select two chromosomes with highest score if Rand[0, 1) < Pco then crossover the pair at a randomly chosen bit else change each bit with probability Pmut Remove the parent chromosomes until N offspring have been created until Any chromosome's score f exceeds return Highest fitness chromosome (best classifier) end Figure 7.13 shows schematically the evolution of a population of classifiers given by Algorithm 4. Genetic operators There are three primary genetic operators that govern reproduction, i.e., producing offspring in the next generation described in lines 5 & 6 of Algorithm 4. The last two of these introduce variation into the chromosomes (Fig. 7.14): Replication: A chromosome is merely reproduced, unchanged. Crossover: Crossover involves the mixing -- "mating" -- of two chromosomes. A split point is chosen randomly along the length of either chromosome. The first part of chromosome A is spliced to the last part of chromosome B, and vice versa, thereby yielding two new chromosomes. The probability a given pair of chromosomes will undergo crossover is given by Pco in Algorithm 4. Mutation: Each bit in a single chromosome is given a small chance, Pmut , of being changed from a 1 to a 0 or vice versa. mating 28 CHAPTER 7. STOCHASTIC METHODS if f > then end N 11010100101000 01001110111011 11100010110100 00001110010100 11001010101010 00101100100100 11110100101011 10001001010001 11010110101000 11110101101001 score fi 15 11 29 36 54 73 22 92 84 27 rank survival of the fittest and reproduction 10001001010001 11010110101000 00101100100100 11001010101010 00001110010100 11100010110100 11110101101001 11110100101011 11010100101000 01001110111011 10001001010001 11010110101000 00101100100100 11001010101010 00001110010100 11100010110100 11110101101001 11110100101011 11010100101000 01001110111011 92 84 73 54 36 29 27 22 15 11 generation k parents generation k+1 offspring Figure 7.13: A basic genetic algorithm is a stochastic iterative search method. Each of the L classifiers in the population in generation k is represented by a string of bits of length N , called a chromosome (on the left). Each classifier is judged or scored according its performance on a classification task, giving L scalar values fi . The chromosomes are then ranked according to these scores. The chromosomes are considered in descending order of score, and operated upon by the genetic operators of replication, crossover and mutation to form the next generation of chromosomes -- the offspring. The cycle repeats until a classifier exceeds the criterion score . Other genetic operators may be employed, for instance inversion -- where the chromosome is reversed front to back. This operator is used only rarely since inverting a chromosome with a high score nearly always leads to one with very low score. Below we shall briefly consider another operator, insertions. Representation When designing a classifier by means of genetic algorithms we must specify the mapping from a chromosome to properties of the classifier itself. Such mapping will depend upon the form of classifier and problem domain, of course. One of the earliest and simplest approaches is to let the bits specify features (such as pixels in a character recognition problem) in a two-layer Perceptron with fixed weights (Chap. ??). The primary benefit of this particular mapping is that different segments of the chromosome, which generally remain undisturbed under the crossover operator, may evolve to recognize different portions of the input space such as the descender (lower) or the ascender (upper) portions of typed characters. As a result, occasionally the crossover operation will append a good segment for the ascender region in one chromosome to a good segment for the descender region in another, thereby yielding an excellent overall classifier. Another mapping is to let different segments of the chromosome represent the weights in a multilayer neural net with a fixed topology. Likewise, a chromosome could represent a network topology itself, the presence of an individual bit implying two particular neurons are interconnected. One of the most natural representations is for the bits to specify properties of a decision tree classifier (Chap. ??), as shown in Fig. 7.15. 7.5. *EVOLUTIONARY METHODS 29 parents (generation k) A 11010100101001010101111010100011111010010 B 00101100001010001010100001010110101001110 11010100101001010101111011010110101001110 00101100001010001010100000100011111010010 11010100101001010101111010100011111010010 11010100101001010101111010100011111010010 11010100101001010101111010100011111010010 11011100100001110101111110110011101010010 replication crossover mutation offspring (generation k+1) Figure 7.14: Three basic genetic operations are used to transform a population of chromosomes at one generation to form a new generation. In replication, the chromosome is unchanged. Crosi) = efi /T , E[efi /T ] (24) where the expectation is over the current generation and T is a control parameter loosely referred to as a temperature. Early in the evolution the temperature is set high, giving all chromosomes roughly equal probability of being selected. Late in the evolution the temperature is set lower so as to find the chromosomes in the region of the optimal classifier. We can express such search by analogy to biology: early in the search the population remains diverse and explores the fitness landscape in search of promising areas; later the population exploits the specific fitness opportunities in a small region of the space of possible classifiers. 7.5.2 Further heuristics There are many additional heuristics that can occasionally be of use. One concerns the adaptation of the crossover and mutation rates, Pco and Pmut . If these rates are too low, the average improvement from one generation to the next will be small, and the search unacceptably long. Conversely, if these rates are too high, the evolution is undirected and similar to a highly inefficient random search. We can monitor the average improvement in fitness of each generation and the mutation and crossover rates as long as such improvement is rapid. In practice, this is done by encoding the rates in the chromosomes themselves and allowing the genetic algorithm to select the proper values. Another heuristic is to use a ternary, or n-ary chromosomes rather than the traditional binary ones. These representations provide little or no benefit at the algorithmic level, but may make the mapping to the classifier itself more natural and easier to compute. For instance, a ternary chromosome might be most appropriate if the classifier is a decision tree with three-way splits. Occasionally the mapping to the classifier will work for chromosomes of different length. For example, if the bits in the chromosome specify weights in a neural network, then longer chromosomes would describe networks with a larger number of hidden units. In such a case we allow the insertion operator, which with a small probability inserts bits into the chromosome at a randomly chosen position. This so-called "messy" genetic algorithm method has a more appropriate counterpart in genetic programming, as we shall see in Sect. 7.6. insertion 7.5.3 Why do they work? Because there are many heuristics to choose as well as parameters to set, it is hard to make firm theoretical statements about building classifiers by means of evolutionary methods. The performance and search time depend upon the number of bits, the size of a population, the mutation and crossover rates, choice of features and mapping from chromosomes to the classifier itself, the inherent difficulty of the problem and possibly parameters associated with other heuristics. A genetic algorithm restricted to mere replication and mutation is, at base, a version of stochastic random search. The incorporation of the crossover operator, which mates two chromosomes, provides a qualitatively different search, one that has no counterpart in stochastic grammars (Chap. ??). Crossover works by finding, rewarding and recombining "good" segments of chromosomes, and the more faithfully the segments of the chromosomes represent such functional building blocks, the better 32 CHAPTER 7. STOCHASTIC METHODS we can expect genetic algorithms to perform. The only way to insure this is with prior knowledge of the problem domain and the desired form of classifier. 7.6 *Genetic Programming Genetic programming shares the same algorithmic structure of basic genetic algorithms, but differs in the representation of each classifier. Instead of chromosomes consisting of strings of bits, genetic programming uses snippets of computer programs made up of mathematical operators and variables. As a result, the genetic operators are somewhat different; moreover a new operator plays a significant role in genetic programming. The four principal operators in genetic programming are (Fig. 7.16): Replication: A snippet is merely reproduced, unchanged. mating Crossover: Crossover involves the mixing -- "mating" -- of two snippets. A split point is chosen from allowable locations in snippet A as well as from snippet B. The first part of snippet A is spliced to the back part of chromosome B, and vice versa, thereby yielding two new snippets. Mutation: Each bit in a single snippet is given a small chance of being changed to a different value. Such a change must be compatible with the syntax of the total snippet. For instance, a number can be replaced by another number; a mathematical operator that takes a single argument can be replaced by another such operator, and so forth. insertion Insertion: Insertion consists in replacing a single element in the snippet with another (short) snippet randomly chosen from a set. In the c-category problem, it is simplest to form c dichotomizers just as in genetic algorithms. If the output of the classifier is positive, the test pattern belongs to category i , if negative, then it is NOT in i . Representation A program must be expressed in some language, and the choice affects the complexity of the procedure. Syntactically rich languages such as C or C++ are complex and somwhat difficult to work with. Here the syntactic simplicity of a language such asLisp is advantageous. Many Lisp expressions can be written in the form (<operator> <operand> <operand>), where an <operand> can be a constant, a variable or another parenthesized expression. For example, (+ X 2) and (* 3 (+ Y 5)) are valid Lisp expressions for the arithmetic expressions x + 2 and 3(y + 5), respectively. These expressions are easily represented by a binary tree, with the operator being specified at the node and the operands appearing as the children (Fig. 7.17). Whatever language is used, genetic programming operators used for mutation should replace variables and constants with variables and constants, and operators with functionally compatible operators. They should aslo be required to produce syntactically valid results. Nevertheless, occassionally an ungrammatical code snippet may be produced. For that reason, it is traditional to employ a wrapper -- a routine that decides whether the classifier is meaningful, and eliminates them if not. wrapper 7.6. SUMMARY parents (generation k) A (OR (AND (NOT X0)(NOT X1))(AND X0 X1)) B (OR (AND (X2)(NOT X0))(AND X2 X0)) (OR (AND (NOT X0)(NOT X1))(AND X2 X0)) (OR (AND (X2)(NOT X0))(AND X0 X1)) (NOT X2) (OR (AND (NOT X0)(NOT X1))(AND X0 X1)) (OR (OR (NOT X1)(NOT X1))(AND X2 X1)) 33 (OR (AND (NOT X0)(NOT X1))(AND X0 X1)) (OR (AND (NOT X0)(NOT X1))(AND X0 X1)) (OR (AND (NOT X0)(NOT X1))(AND X0 X1)) (OR (AND (NOT X0)(NOT X1))(AND (NOT X2) X1)) replication crossover mutation insertion offspring (generation k+1) Figure 7.16: Four basic genetic operations are used to transform a population of snippets of code at one generation to form a new generation. In replication, the snippet is unchanged. Crossover involves the mixing or "mating" of two snippets to yield two new snippets. A position along the snippet A is randomly chosen from the allowable locations (red vertical line); likewise one is chosen for snippet B. Then the front portion of A is spliced to the back portion of B and vice versa. In mutation, each element is given a small chance of being changed. There are several different types of elements, and replacements must be of the same type. For instance, only a number can replace another number; only a numerical operator that takes a single argument can replace a similar operator, and so on. In insertion, a randomly selected element is replaced by a compatible snippet, keeping the entire snippet grammatically well formed and meaningful. It is nearly impossible to make sound theoretical statements about genetic programming and even the rules of thumb learned from simulations in one domain, such as control or function optimization are of little value in another domain, such as classification problems. Of course, the method works best in problems that are matched by the classifier representation, as simple operations such as multiplication, division, square roots, logical NOT, and so on. Nevertheless, we can state that as computation continues to decrease in cost, more of the burden of solving classification problems will be assumed by computation rather than careful analysis, and here techniques such as evolutionary ones will be of use in classification research. Summary When a pattern recognition problem involves a model that is discrete or of such high complexity that analytic or gradient descent methods are unlikely to work, we may employ stochastic techniques -- ones that at some level rely on randomness to find model parameters. Simulated annealing, based on physical annealing of metals, consists in randomly perturbing the system, and gradually decreasing the randomness to a low final level, in order to find an optimal solution. Boltzmann learning trains the weights in a network so that the probability of a desired final output is increased. Such learning is based on gradient descent in the Kullback-Liebler divergence between two distributions of visible states at the output units: one distribution describes these units when clamped at the known category information, and the other when they are free to assume values based on the activations throughout the network. Some 34 CHAPTER 7. STOCHASTIC METHODS + + * / + * parents (generation k) / * * / X0 X1 X2 X4 X3 X0 X2 X2 X0 X2 X4 X3 X4 X1 (- (+ (+ X0 X1) (* X2 (- X4 X0)))(+ (/ X3 X2)(* X2 X0))) (/ (* X2 X4) (* X3 (/ X4 X1))) + + * X4 X0 X1 X2 X4 X1 X2 X4 / * / * + * X3 / X0 X3 X2 X2 X0 (- (+ (+ X0 X1) (* X2 (- X4 X0)))(/ X4 X1)) (/ (* X2 X4) (* X3 (+ (/ X3 X2)(* X2 X0)))) offspring (generation k+1) Figure 7.17: Unlike the decision trees of Fig. 7.15 and Chap. ??, the trees shown here are merely a representation using the syntax of Lisp that implements a single function. x2 For instance, the upper-right (parent) tree implements x3 (x4x4 1 ) . Such functions are /x used with an implied threshold or sign function when used for classification. Thus the function will operate on the features of a test pattern and emit category i if the function is positive, and NOT i otherwise. graphical models, such as hidden Markov models and Bayes belief networks, have counterparts in structured Boltzmann networks, and this leads to new applications of Boltzmann learning. Search methods based on evolution -- genetic algorithms and genetic programming -- perform highly parallel stochastic searches in a space set by the designer. The fundamental representation used in genetic algorithms is a string of bits, or chromosome; the representation in genetic programming is a snippet of computer code. Variation is introduced by means of crossover, mutation and insertion. As with all classification methods, the better the features, the better the solution. There are many heuristics that can be employed and parameters that must be set. As the cost of computation contiues to decline, computationally intensive methods, such as Boltzmann networks and evolutionary methods, should become increasingly popular. 7.6. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 35 Bibliographical and Historical Remarks The general problem of search is of central interest in computer science and artificial intelligence, and is far to expansive to treat here. Nevertheless, techniques such as depth first, breadth first, branch-and-bound, A* [19], occassionally find use in fields touching upon pattern recognition, and practitioners should have at least a passing knowledge of them. Good overviews can be found in [33] and a number of textbooks on artificial intelligence, such as [46, 67, 55]. For rigor and completeness, Knuth's book on the subject is without peer [32]. The infinite monkey theorem, attributed to Sir Arthur Eddington, states that if there is a sufficiently large number of monkeys typing at typewriters, eventually one will bang out the script to Hamlet. It reflects one extreme of the tradeoff between prior knowledge about the location of a solution on the one hand and the effort of search required to fit relatively little on pattern recognition. There are several collections of papers on evolutionary techniques in pattern recognition, including [48]. An intriguing effect due to the interaction of learning and evolution is the Baldwin effect, where learning can influence the rate of evolution [22]; it has been shown that too much learning (as well as too little learning) leads to slower evolution [28]. Evolutionary methods can lead to "non-optimal" or inelegant solutions, and there is computational evidence that this occurs in nature [61, 62]. Problems Section 7.1 1. One version of the infinite monkey theorem states that a single (immortal) monkey typing randomly will ultimately reproduce the script of Hamlet. Estimate the time needed for this, assuming the monkey can type two characters per second, that the play has 50 pages, each containing roughly 80 lines, and 40 characters per line. Assume there are 30 possible characters (a through z), space, period, exclamation point and carriage return. Compare this time to the estimated age of the universe, 1010 years. Section 7.2 2. Prove that for any optimization problem of the form of Eq. 1 having a nonsymmetric connection matrix, there is an equivalent optimization problem in which the matrix is replaced by its symmetric part. 3. The complicated energy landscape in the left of Fig. 7.2 is misleading for a number of reasons. (a) Discuss the difference between the continuous space shown in that figure with the discrete space for the true optimization problem. (b) The figure shows a local minimum near the middle of the space. Given the nature of the discrete space, are any states closer to any "middle"? (c) Suppose the axes referred to continuous variables si (as in mean-field annealing). If each si obeyed a sigmoid (Fig. 7.5), could the energy landscape be nonmonotonic, as is shown in Fig. 7.2? 4. Consider exhaustive search for the minimum of the energy given in Eq. 1 for binary units and arbitrary connections wij . Suppose that on a uniprocessor it takes 7.6. PROBLEMS 37 10-8 seconds to calculate the energy for each configuration. How long will it take to exhaustively search the space for N = 100 units? How long for N = 1000 units? 5. Suppose it takes a uniprocessor 10-10 seconds to perform a single multiplyaccumulate, wij si sj , in the calculation of the energy E = -1/2 wij si sj given in ij Eq. 1. (a) Make some simplifying assumptions and write a formula for the total time required to search exhaustively for the minimum energy in a fully connected network of N nodes. (b) Plot your function using a log-log scale for N = 1, . . . , 105 . (c) What size network, N , could be searched exhaustively in a day? A year? A century? 6. Make and justify any necessary mathematical assumptions and show analytically that at high temperature, every configuration in a network of N units interconnected by weights is equally likely (cf. Fig. 7.1). 7. Derive the exponential form of the Boltzmann factor in the following way. Consider an isolated set of M + N independent magnets, each of which can be in an si = +1 or si = -1 state. There is a uniform magnetic field applied and this means that the energy of the si = +1 state has some positive energy, which we can arbitrarily set to 1; the si = -1 state has energy -1. The total energy of the system is therefore the sum of the number pointing up, ku , minus the number pointing down, kd ; that is, ET = ku - kd . (Of course, ku + kd = M + N regardless of the total energy.) The fundamental statistical assumptions describing this system are that the magnets are independent, and that the probability a subsystem (viz., the N magnets), has a particular energy is proportional to the number of configurations that have this energy. (a) Consider the subsystem of N magnets, which has energy EN . Write an expression for the number of configurations K(N, EN ) that have energy EN . (b) As in part (a), write a general expression for the number of configurations in the subsystem M magnets at energy EM , i.e., K(M, EM ). (c) Since the two subsystems consist of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 42 44 44 46 56 60 66 Chapter 8 Non-metric Methods 8.1 Introduction e have considered pattern recognition based on feature vectors of real-valued numbers, and in all there has been a natural measure W and discrete-valued vectors. For instancecasesthe nearest-neighbor classifier the of distance between such in notion figures conspicuously -- indeed it is the core of the technique -- while for neural networks the notion of similarity appears when two input vectors sufficiently "close" lead to similar outputs. Most practical pattern recognition methods address problems of this sort, where feature vectors are real-valued and there exists some notion of metric. But suppose a classification problem involves nominal data -- for instance descriptions that are discrete and without any natural notion of similarity or even ordering. Consider the use of information about teeth in the classification of fish and sea mammals. Some teeth are small and fine (as in baleen whales) for straining tiny prey from the sea. Others (as in sharks) coming in multiple rows. Some sea creatures, such as walruses, have tusks. Yet others, such as squid, lack teeth altogether. There is no clear notion of similarity (or metric) for this information about teeth: it is meaningless to consider the teeth of a baleen whale any more similar to or different from the tusks of a walrus, than it is the distinctive rows of teeth in a shark from their absence in a squid, for example. Thus in this chapter our attention turns away from describing patterns by vectors of real numbers and towardusing lists of attributes. A common approach is to specify the values of a fixed number of properties by a property d-tuple For example, consider describing a piece of fruit by the four properties of color, texture, taste and smell. Then a particular piece of fruit might be described by the 4-tuple {red, shiny, sweet, small}, which is a shorthand for color = red, texture = shiny, taste = sweet and size = small. Another common approach is to describe the pattern by a variable length string of nominal attributes, such as a sequence of base pairs in a segment of DNA, e.g., "AGCTTCAGATTCCA." Such lists or strings might be themselves the output of other component classifiers of the type we have seen elsewhere. For instance, we might train a neural network to recognize different component brush nominal data property d-tuple string We often put strings between quotation marks, particularly if this will help to avoid ambiguities. 3 4 CHAPTER 8. NON-METRIC METHODS strokes used in Chinese and Japanese characters (roughly a dozen basic forms); a classifier would then accept as inputs a list of these nominal attributes and make the final, full character classification. How can we best use such nominal data for classification? Most importantly, how can we efficiently learn categories using such non-metric data? If there is structure in strings, how can it be represented? In considering such problems, we move beyond the notion of continuous probability distributions and metrics toward discrete problems that are addressed by rule-based or syntactic pattern recognition methods. 8.2 Decision trees root node link branch leaf descendent sub-tree It is natural and intuitive to classify a pattern through a sequence of questions, in which the next question asked depends on the answer to the current question. This "20-questions" approach is particularly useful for non-metric data, since all of the questions can be asked in a "yes/no" or "true/false"or "value(property) set of values" style that does not require any notion of metric. Such a sequence of questions is displayed in a directed decision tree or simply tree, where by convention the first or root node is displayed at the top, connected by successive (directional) links or branches to other nodes. These are similarly connected until we reach terminal or leaf nodes, which have no further links (Fig. 8.1). Sections 8.3 & 8.4 describe some generic methods for creating such trees, but let us first understand how they are used for classification. The classification of a particular pattern begins at the root node, which asks for the value of a particular property of the pattern. The different links from the root node corresopnd to the different possible values. Based on the answer we follow the appropriate link to a subsequent or descendent node. In the trees we shall discuss, the links must be mutually distinct and exhaustive, i.e., one and only one link will be followed. The next step is to make the decision at the appropriate subsequent node, which can be considered the root of a sub-tree. We continue this way until we reach a leaf node, which has no further question. Each leaf node bears a category label and the test pattern is assigned the category of the leaf node reached. The simple decision tree in Fig. 8.1 illustrates one benefit of trees over many other classifiers such as neural networks: interpretability. It is a straightforward matter to render the information in such a tree as logical expressions. Such interpretability has two manifestations. First, we can easily interpret the decision for any particular test pattern as the conjunction of decisions along the path to its corresponding leaf node. Thus if the properties are {taste, color, shape, size}, the pattern x = {sweet, yellow, thin, medium} is classified as Banana because it is (color = yellow) AND (shape = thin). Second, we can occasionally get clear interpretations of the categories themselves, by creating logical descriptions using conjunctions and disjunctions (Problem 8). For instance the tree shows Apple = (green AND medium) OR (red AND medium). Rules derived from trees -- especially large trees -- are often quite complicated and must be reduced to aid interpretation. For our example, one simple rule describes Apple = (medium AND NOT yellow). Another benefit of trees is that they lead to We retain our convention of representing patterns in boldface even though they need not be true vectors, i.e., they might contain nominal data that cannot be added or multiplied the way vector components can. For this reason we use the terms "attribute" to represent both nominal data and real-valued data, and reserve "feature" for real-valued data. 8.3. CART 5 root Color? green yellow red level 0 Size? medium Shape? small round thin Size? medium small level 1 big Watermelon Apple Grape big Size? Banana Apple Taste? sour Grape level 2 small sweet Cherry Grapefruit Lemon level 3 Figure 8.1: Classification in a basic decision tree proceeds from top to bottom. The questions asked at each node concern a particular property of the pattern, and the downward links correspond to the possible values. Successive nodes are visited until a terminal or leaf node is reached, where the category label is read. Note that the same question, Size?, appears in different places in the tree, and that different questions can have different numbers of branches. Moreover, different leaf nodes, shown in pink, can be labeled by the same category (e.g., Apple). rapid classification, employing a sequence of typically simple queries. Finally, we note that trees provide a natural way to incorporate prior knowledge from human experts. In practice, though, such expert knowledge if of greatest use when the classification problem is fairly simple and the training set is small. 8.3 CART Now we turn to the matter of using training data to create or "grow" a decision tree. We assume that we have a set D of labeled training data and we have decided on a set of properties that can be used to discriminate patterns, but do not know how to organize the tests into a tree. Clearly, any decision tree will progressively split the set of training examples into smaller and smaller subsets. It would be ideal if all the samples in each subset had the same category label. In that case, we would say that each subset was pure, and could terminate that portion of the tree. Usually, however, there is a mixture of labels in each subset, and thus for each branch we will have to decide either to stop splitting and accept an imperfect decision, or instead select another property and grow the tree further. This suggests an obvious recursive tree-growing process: given the data represented at a node, either declare that node to be a leaf (and state what category to assign to it), or find another property to use to split the data into subsets. However, this is only one example of a more generic tree-growing methodology know as CART (Classification and Regression Trees). CART provides a general framework that can be instatiated in various ways to produce different decision trees. In the CART approach, six general kinds of questions arise: 1. Should the properties be restricted to binary-valued or allowed to be multi- 6 split CHAPTER 8. NON-METRIC METHODS valued? That is, how many decision outcomes or splits will there be at a node? 2. Which property should be tested at a node? 3. When should a node be declared a leaf? 4. If the tree becomes "too large," how can it be made smaller and simpler, i.e., pruned? 5. If a leaf node is impure, how should the category label be assigned? 6. How should missing data be handled? We consider each of these questions in turn. 8.3.1 Number of splits branching factor Each decision outcome at a node is called a split, since it corresponds to splitting a subset of the training data. The root node splits the full training set; each successive decision splits a proper subset of the data. The number of splits at a node is closely related to question 2, specifying which particular split will be made at a node. In general, the number of splits is set by the designer, and could vary throughout the tree, as we saw in Fig. 8.1. The number of links descending from a node is sometimes called the node's branching factor or branching ratio, denoted B. However, every decision (and hence every tree) can be represented using just binary decisions (Problem 2). Thus the root node querying fruit color (B = 3) in our example could be replaced by two nodes: the first would ask fruit = green?, and at the end of its "no" branch, another node would ask fruit = yellow?. Because of the universal expressive power of binary trees and the comparative simplicity in training, we shall concentrate on such trees (Fig. 8.2). 8.3.2 Test selection and node impurity purity Much of the work in designing trees focuses on deciding which property test or query should be performed at each node. With non-numeric data, there is no geometrical interpretation of how the test at a node splits the data. However, for numerical data, there is a simple way to visualize the decision boundaries that are produced by decision trees. For example, suppose that the test at each node has the form "is xi xis ?" This leads to hyperplane decision boundaries that are perpendicular to the coordinate axes, and to decision regions of the form illustrated in Fig. 8.3. The fundamental principle underlying tree creation is that of simplicity: we prefer decisions that lead to a simple, compact tree with few nodes. This is a version of Occam's razor, that the simplest model that explains data is the one to be preferred (Chap. ??). To this end, we seek a property test T at each node N that makes the data reaching the immediate descendent nodes as "pure" as possible. In formalizing this notion, it turns out to be more conveninet to define the impurity, rather than The problem is further complicated by the fact that there is no reason why the test at a node has to involve only one property. One might well consider logical combinations of properties, such as using (size = medium) AND (NOT (color = yellow))? as a test. Trees in which each test is based on a single property are called monothetic; if the query at any of the nodes involves two or more properties, the tree is called polythetic. For simplicity, we generally restrict our treatment to monothetic trees. In all cases, the key requirement is that the decision at a node be well-defined and unambiguous so that the response leads down one and only one branch. 8.3. CART color = Green? yes no 7 size = big? yes no yes color = yellow? no Watermelon size = medium? yes no shape = round? yes no size = small? yes no Apple Grape size = big? yes no Banana taste = sweet? yes no Apple Grapefruit Lemon Cherry Grape Figure 8.2: A tree with arbitrary branching factor at different nodes can always be represented by a functionally equivalent binary tree, i.e., one having branching factor B = 2 throughout. By convention the "yes" branch is on the left, the "no" branch on the right. This binary tree contains the same information and implements the same classification as that in Fig. 8.1. the purity of a node. Several different mathematical measures of impurity have been proposed, all of which have basically the same behavior. Let i(N ) denote the impurity of a node N . In all cases, we want i(N ) to be 0 if all of the patterns that reach the node bear the same category label, and to be large if the categories are equally represented. The most popular measure is the entropy impurity (or occasionally information impurity): i(N ) = - j entropy impurity P (j ) log2 P (j ), (1) where P (j ) is the fraction of patterns at node N that are in category j . By the well-known properties of entropy, if all the patterns are of the same category, the impurity is 0; otherwise it is positive, with the greatest value occuring when the different classes are equally likely. Another definition of impurity is particularly useful in the two-category case. Given the desire to have zero impurity when the node represents only patterns of a single category, the simplest polynomial form is: i(N ) = P (1 )P (2 ). (2) variance impurity This can be interpreted as a variance impurity since under reasonable assumptions it ^ Here we are a bit sloppy with notation, since we normally reserve P for probability and P for ^ frequency ratios. We could be even more precise by writing P (x j |N ) -- i.e., the fraction of training patterns x at node N that are in category j , given that they have survived all the previous decisions that led to the node N -- but for the sake of simplicity we sill avoid such notational overhead. 8 CHAPTER 8. NON-METRIC METHODS x2 R1 R2 x3 R1 R2 R1 R2 R1 x2 R1 R2 R1 R2 x1 x1 Figure 8.3: Monothetic decision trees create decision boundaries with portions perpendicular to the feature axes. The decision regions are marked R1 and R2 in these two-dimensional and three-dimensional two-category examples. With a sufficiently large tree, any decision boundary can be approximated arbitrarily well. is related to the variance of a distribution associated with the two categories (Problem 10). A generalization of the variance impurity, applicable to two or more categories, is the Gini impurity: i(N ) = i=j Gini impurity P (i )P (j ) = 1 - j P 2 (j ). (3) misclassification impurity This is just the expected error rate at node N if the category label is selected randomly from the class distribution present at N . This criterion is more strongly peaked at equal probabilities than is the entropy impurity (Fig. 8.4). The misclassification impurity can be written as i(N ) = 1 - max P (j ), j (4) and measures the minimum probability that a training pattern would be misclassified at N . Of the impurity measures typically considered, this measure is the most strongly peaked at equal probabilities. It has a discontinuous derivative, though, and this can present problems when searching for an optimal decision over a continuous parameter space. Figure 8.4 shows these impurity functions for a two-category case, as a function of the probability of one of the categories. We now come to the key question -- given a partial tree down to node N , what value s should we choose for the property test T ? An obvious heuristic is to choose the test that decreases the impurity as much as possible. The drop in impurity is defined by i(N ) = i(N ) - PL i(NL ) -hen the misclassification remains at 0.1 for all splits. Now consider a split which sends 70 1 patterns to the right along with 0 2 patterns, and sends 20 1 and 10 2 to the left. This is an attractive split but the misclassification impurity is still 0.1. On the other hand, the Gini impurity for this split is less than the Gini for the parent node. In short, the Gini impurity shows that this as a good split while the misclassification rate does not. In multiclass binary tree creation, the twoing criterion may be useful. The overall goal is to find the split that best splits groups of the c categories, i.e., a candidate "supercategory" C1 consisting of all patterns in some subset of the categories, and candidate "supercategory" C2 as all remaining patterns. Let the class of categories be C = {1 , 2 , . . . , c }. At each node, the decision splits the categories into C1 = {i1 , i2 , . . . , ik } and C2 = C - C1 . For every candidate split s, we compute a change in impurity i(s, C1 ) as though it corresponded to a standard two-class problem. That is, we find the split s (C1 ) that maximizes the change in impurity. Finally, we find the supercategory C1 which maximizes i(s (C1 ), C1 ). The benefit of this impurity is that it is strategic -- it may learn the largest scale structure of the overall problem (Problem 4). It may be surprising, but the particular choice of an impurity function rarely seems to affect the final classifier and its accuracy. An entropy impurity is frequently used because of its computational simplicity and basis in information theory, though the Gini impurity has received significant attention as well. In practice, the stopping criterion and the pruning method -- when to stop splitting nodes, and how to merge leaf nodes -- are more important than the impurity function itself in determining final classifier accuracy, as we shall see. Multi-way splits Although we shall concentrate on binary trees, we briefly mention the matter of allowing the branching ratio at each node to be set during training, a technique will return to in a discussion of the ID3 algorithm (Sect. 8.4.1). In such a case, it is tempting to use a multi-branch generalization of Eq. 5 of the form B i(s) = i(N ) - k=1 Pk i(Nk ), (6) where Pk is the fraction of training patterns sent down the link to node Nk , and B Pk = 1. However, the drawback with Eq. 6 is that decisions with large B are k=1 inherently favored over those with small B whether or not the large B splits in fact represent meaningful structure in the data. For instance, even in random data, a The twoing criterion is not a true impurity measure. 8.3. CART 11 high-B split will reduce the impurity more than will a low-B split. To avoid this drawback, the candidate change in impurity of Eq. 6 must be scaled, according to iB (s) = - i(s) B . (7) Pk log2 Pk k=1 a method based on the gain ratio impurity (Problem 17). Just as before, the optimal split is the one maximizing iB (s). gain ratio impurity 8.3.3 When to stop splitting Consider now the problem of deciding when to stop splitting during the training of a binary tree. If we continue to grow the tree fully until each leaf node corresponds to the lowest impurity, then the data has typically been overfit (Chap. ??). In the extreme but rare case, each leaf corresponds to a single training point and the full tree is merely a convenient implementation of a lookup table; it thus cannot be expected to generalize well in (noisy) problems having high Bayes error. Conversely, if splitting is stopped too early, then the error on the training data is not sufficiently low and hence performance may suffer. How shall we decide when to stop splitting? One traditional approach is to use techniques of Chap. ??, in particular cross-validation. That is, the tree is trained using a subset of the data (for instance 90%), with the remaining (10%) kept as a validation set. We continue splitting nodes in successive layers until the error on the validation data is minimized. Another method is to set a (small) threshold value in the ry or confidence level, such as .01 or .05. The critical values of the confidence depend upon the number of degrees of freedom, which in the case just described is 1, since for a given probability P the single value n1L specifies all other values (n1R , n2L and n2R ). If the "most significant" split at a node does not yield a 2 exceeding the chosen confidence level threshold, splitting is stopped. 8.3.4 horizon effect Pruning Occassionally, stopped splitting suffers from the lack of sufficient look ahead, a phenomenon called the horizon effect. The determination of the optimal split at a node N is not influenced by decisions at N 's descendent nodes, i.e., those at subsequent levels. In stopped splitting, node N might be declared a leaf, cutting off the possibility of beneficial splits in subsequent nodes; as such, a stopping condition may be met "too early" for overall optimal recognition accuracy. Informally speaking, the stopped splitting biases the learning algorithm toward trees in which the greatest impurity reduction is near the root node. The principal alternative approach to stopped splitting is pruning. In pruning, a tree is grown fully, that is, until leaf nodes have minimum impurity -- beyond any 8.3. CART 13 putative "horizon." Then, all pairs of neighboring leaf nodes (i.e., ones linked to a common antecedent node, one level above) are considered for elimination. Any pair whose elimination yields a satisfactory (small) increase in impurity is eliminated, and the common antecedent node declared a leaf. (This antecedent, in turn, could itself be pruned.) Clearly, such merging or joining of the two leaf nodes is the inverse of splitting. It is not unusual that after such pruning, the leaf nodes lie in a wide range of levels and the tree is unbalanced. Although it is most common to prune starting at the leaf nodes, this is not necessary: cost-complexity pruning can replace a complex subtree with a leaf directly. Further, C4.5 (Sect. 8.4.2) can eliminate an arbitrary test node, thereby replacing a subtree by one of its branches. The benefits of pruning are that it avoids the horizon effect; further, since there is no training data held out for cross-validation, it directly uses all information in the training set. Naturally, this comes at a greater computational expense than stopped splitting, and for problems with large training sets, the expense can be prohibitive (Computer exercise ??). For small problems, though, these computational costs are low and pruning is generally to be preferred over stopped splitting. Incidentally, what we have been calling stopped training and pruning are sometimes called pre-pruning and post-pruning, respectively. A conceptually different pruning method is based on rules. Each leaf has an associated rule -- the conjunction of the individual decisions from the root node, through the tree, to the particular leaf. Thus the full tree can be described by a large list of rules, one for each leaf. Occasionally, some of these rules can be simplified if a series of decisions is redundant. Eliminating the irrelevant precondition rules simplifies the description, but has no influence on the classifier function, including its generalization ability. The predominant reason to prune, however, is to improve generalization. In this case we therefore eliminate rules so as to improve accuracy on a validation set (Computer exercise 6). This technique may even allow the elimination of a rule corresponding to a node near the root. One of the benefits of rule pruning is that it allows us to distinguish between the contexts in which any particular node N is used. For instance, for some test pattern x1 the decision rule at node N is necessary; for another test pattern x2 that rule is irrelevant and thus N could be pruned. In traditional node pruning, we must either keep N or prune it away. In rule pruning, however, we can eliminate it where it is not necessary (i.e., for patterns such as x1 ) and retain it for others (such as x2 ). A final benefit is that the reduced rule set may give improved interpretability. Although rule pruning was not part of the original CART approach, such pruning can be easily applied to CART trees. We shall consider an example of rule pruning in Sect. 8.4.2. merging 8.3.5 Assignment of leaf node labels Assigning category labels to the leaf nodes is the simplest step in tree construction. If successive nodes are split as far as possible, and each leaf node corresponds to patterns in a single category (zero impurity), then of course this category label is assigned to the leaf. In the more typical case, where either stopped splitting or pruning is used and the leaf nodes have positive impurity, each leaf should be labeled by the category that has most points represented. An extremely small impurity is not necessarily desirable, since it may be an indication that the tree is overfitting the training data. Example 1 illustrates some of these steps. 14 CHAPTER 8. NON-METRIC METHODS Example 1: A simple tree classifier Consider the following n = 16 points in two dimensions for training a binary CART tree (B = 2) using the entropy impurity (Eq. 1). 1 (black) x1 x2 .15 .83 .09 .55 .29 .35 .38 .70 .52 .48 .57 .73 .73 .75 .47 .06 1 x2 x1 < 0.6 .8 x1 .10 .08 .23 .70 .62 .91 .65 .75 2 (red) x2 .29 .15 .16 .19 .47 .27 .90 .36* (.32 ) 1.0 R1 .88 .6 x2 < 0.32 x2 < 0.61 .65 R1 .4 R2 * R2 R1 x1 .2 .4 .6 .8 1 x2 .81 x1 < 0.35 1 2 x1 < 0.69 1.0 .2 2 1 2 1 0 1 x2 < 0.33 .8 1.0 .76 .92 R1 .6 R2 .59 R1 x2 < 0.09 x1 < 0.6 1 2 1 R2 R1 x1 < 0.69 .4 2 1 .2 0 .2 .4 .6 .8 1 x1 Training data and associated (unpruned) tree are shown at the top. The entropy impurity at non-terminal nodes is shown in red and the impurity at each leaf is 0. If the single training point marked * were instead slightly lower (marked ), the resulting tree and decision regions would differ significantly, as shown at the bottom. 8.3. CART The impurity of the root node is 2 15 i(Nroot ) = - i=1 P (i )log2 P (i ) = -[.5log2 .5 + .5log2 .5] = 1.0. For simplicity we consider candidate splits parallel to the feature axes, i.e., of the form "is xi < xis ?". By exhaustive search of the n - 1 positions for the x1 feature and n - 1 positions for the x2 feature we find by Eq. 5 that the greatest reduction in the impurity occurs near x1s = 0.6, and hence this becomes the decision criterion at the root node. We continue for each sub-tree until each final node represents a single category (and thus has the lowest impurity, 0), as shown in the figure. If pruning were invoked, the pair of leaf nodes at the left would be the first to be deleted (gray shading) since there the impurity is increased the least. In this example, stopped splitting with the proper threshold would also give the same final network. In general, however, with large trees and many pruning steps, pruning and stopped splitting need not lead to the same final tree. This particular training set shows how trees can be sensitive to details of the training points. If the 2 point marked * in the top figure is moved slightly (marked ), the tree and decision regions differ significantly, as shown at the bottom. Such instability is due in large part to the discrete nature of decisions early in the tree learning. Example 1 illustrates the informal notion of instability or sensitivity to training points. Of course, if we train any common classifier with a slightly different training set the final classification decisions will differ somewhat. If we train a CART classifier, however, the alteration of even a single training point can lead to radically different decisions overall. This is a consequence of the discrete and inherently greedy nature of such tree creation. Instability often indicates that incremental and off-line versions of the method will yield significantly different classifiers, even when trained on the same data. stability 8.3.6 Computational complexity Suppose we have n training patterns in d dimensions in a two-category problem, and wish to construct a binary tree based on splits parallel to the feature axes using an entropy impurity. What are the time and the space complexities? At the root node (level 0) we must first sort the training data, O(nlogn) for each of the d features or dimensions. The entropy calculation is O(n) + (n - 1)O(d) since we examine n - 1 possible splitting points. Thus for the root node the time complexity is O(dnlogn). Consider an average case, where roughly half the training points are sent to each of the two branches. The above analysis implies that splitting each node in level 1 has complexity O(d n/2 log(n/2)); since there are two such nodes at that level, the total complexity is O(dnlog(n/2)). Similarly, for the level 2 we have O(dnlog(n/4)), and so on. The total number of levels is O(log n). We sum the terms for the levels and find that the total average time complexity is O(dn (log n)2 ). The time complexity for recall is just the depth of the tree, i.e., the total number of levels, is O(log n). The space complexity is simply the number of nodes, which, given some simplifying assumptions (such as a single training point per leaf node), is 1 + 2 + 4 + ... + n/2 n, that is, O(n) (Problem 9). 16 CHAPTER 8. NON-METRIC METHODS We stress that these assumptions (for instance equal splits at each node) rarely hold exactly; moreover, heuristics can be used to speed the search for splits during training. Nevertheless, the result that for fixed dimension d the training is O(dn2 log n) and classification O(log n) is a good rule of thumb; it illustrates how training is far more computationally expensive than is classification, and that on average this discrepancy grows as the problem gets larger. There are several techniques for reducing the complexity during the training of trees based on real-valued data. One of the simplest heuristics is to begin the search for splits xis at the "middle" of the range of the training set, moving alternately to progressively higher and lower values. Optimal splits always occur for decision thresholds between adjacent points from different categories and thus one should test only such ranges. These and related techniques generally provide only moderate reductions in computation (Computer exercise ??). When the patterns consist of nominal data, candidate splits could be over every subset of attributes, or just a single entry, and the computational burden is best lowered using insight into features (Problem 3). 8.3.7 Feature choice As with most pattern recognition techniques, CART and other tree-based methods work best if the "proper" features are used (Fig. 8.5). For real-valued vector data, most standard preprocessing techniques can be used before creating a tree. Preprocessing by principal components (Chap. ??) can be effective, since it finds the "important" axes, and this generally leads to simple decisions at the nodes. If however the principal axes in one region differ significantly from those in another region, then no single choice of axes overall will suffice. In that case we may need to employ the techniques of Sect. 8.3.8, for instance allowing splits to be at arbitrary orientation, often giving smaller and more compact trees. 8.3.8 Multivariate decision trees If the "natural" splits of real-valued data do not fall parallel to the feature axes or the full training data set differs significantly from simple or accommodating distributions, then the above methods may be rather inefficient and lead to poor generalization (Fig. 8.6); even pruning may be insufficient to give a good classifier. The simplest solution is to allow splits that are not parallel to the feature axes, such as a general linear classifier trained via gradient descent on a classification or sum-squared-error criterion (Chap. ??). While such training may be slow for the nodes near the root if the training set is large, training will be faster at nodes closer to the leafs since less training data is used. Recall can remain quite fast since the linear functions at each node can be computed rapidly. 8.3.9 Priors and costs Up to now we have tacitly assumed that a category i is represented with the same frequency in both the training and the test data. If this is not the case, we need a method for controlling tree creation so as to have lower error on the actual final classification task when the frequencies are different. The most direct method is to "weight" samples to correct for the prior frequencies (Problem 16). Furthermore, we may seek to minimize a general cost, rather than a strict misclassification or 0-1 8.3. CART 17 x2 x1 < 0.27 1 R1 x2 < 0.32 x2 < 0.6 .8 x1 < 0.07 .6 1 2 1 x1 < 0.55 .4 R2 1 2 x2 < 0.86 .2 2 x1 .2 x2 .4 .6 .8 1 x1 < 0.81 0 1 - 1.2 x1 + x2 < 0.1 2 1 R1 .8 2 R2 1 .6 .4 .2 0 .2 .4 .6 .8 1 x1 Figure 8.5: If the class of node decisions does not match the form of the training data, a very complicated decision tree will result, as shown at the top. Here decisions are parallel to the axes while in fact the data is better split by boundaries along another direction. If however "proper" decision forms are used (here, linear combinations of the features), the tree can be quite simple, as shown at the bottom. cost. As in Chap. ??, we represent such information in a cost matrix ij -- the cost of classifying a pattern as i when it is actually j . Cost information is easily incorporated into a Gini impurity, giving the following weighted Gini impurity, weighted Gini impurity i(N ) = ij ij P (i )P (j ), (10) which should be used during training. Costs can be incorporated into other impurity measures as well (Problem 11). 18 x2 CHAPTER 8. NON-METRIC METHODS 1 x2 < 0.5 0.8 R1 0.6 x1 < 0.95 x2 < 0.56 R2 R1 R2 0.4 2 1 1 x2 < 0.54 1 2 0.2 0 0.2 0.4 0.6 0.8 1 x1 x2 0.04 x1 + 0.16 x2 < 0.11 1 0.8 0.27 x1 - 0.44 x2 < -0.02 1 R1 0.6 0.96 x1 - 1.77x2 < -0.45 2 0.4 R2 0.2 5.43 x1 - 13.33 x2 < -6.03 2 0 0.2 0.4 0.6 0.8 1 x1 1 2 Figure 8.6: One form of multivariate tree employs general linear decisions at each node, giving splits along arbitrary directions in the feature space. In virtually all interesting cases the training data is not linearly separable, and thus the LMS algorithm is more useful than methods that require the data to be linearly separable, even though the LMS need not yield a minimum in classification error (Chap. ??). The tree at the bottom can be simplified by methods outlined in Sect. 8.4.2. 8.3.10 Missing attributes deficient pattern Classification problems might have missing attributes during training, during classification, or both. Consider first training a tree classifier despite the fact that some training patterns are missing attributes. A naive approach would be to delete from consideration any such deficient patterns; however, this is quite wasteful and should be employed only if there are many complete patterns. A better technique is to proceed as otherwise described above (Sec. 8.3.2), but instead calculate impurities at a node N using only the attribute information present. Suppose there are n training points at N and that each has three attributes, except one pattern that is missing attribute x3 . To find the best split at N , we calculate possible splits using all n points using attribute x1 , then all n points for attribute x2 , then the n - 1 non-deficient points for attribute x3 . Each such split has an associated reduction in impurity, calculated as before, though here with different numbers of patterns. As always, the desired split is the one which gives the greatest decrease in impurity. The generalization of this procedure to more features, to multiple patterns with missing attributes, and even to 8.3. CART 19 patterns with several missing attributes is straightforward, as is its use in classifying non-deficient patterns (Problem 14). Now consider how to create and use trees that can classify a deficient pattern. The trees described above cannot directly handle test patterns lacking attributes (but see Sect. 8.4.2), and thus if we suspect that such deficient test patterns will occur, we must modify the training procedure discussed in Sect. 8.3.2. The basic approach during classification is to use the traditional ("primary") decision at a node whenever possible (i.e., when the queries involves a feature that is present in the deficient test pattern) but to use alternate queries whenever the test pattern is missing that feature. During training then, in addition to the primary split, each non-terminal node N is given an ordered set of surrogate splits, consisting of an attribute label and a rule. The first such surrogate split maximizes the "predictive association" with the primary split. A simple measure of the predictive association of two splits s1 and s2 is merely the numerical count of patterns that are sent to the "left" by both s1 and s2 plus the count of the patterns sent to the "right" by both the splits. The second surrogate split is defined similarly, being the one which uses another feature and best approximates the primary split in this way. Of course, during classification of a deficient test pattern, we use the first surrogate split that does not involve the test pattern's missing attributes. This missing value strategy corresponds to a linear model replacing the pattern's missing value by the value of the non-missing attribute most strongly correlated with it (Problem ??). This strategy uses to maximum advantage the (local) associations among the attributes to decide the split when attribute values are missing. A method closely related to surrogate splits is that of virtual values, in which the missing attribute is assigned its most likely value. Example 2: Surrogate splits and missing attributes Consider the creation of a monothetic tree using an entropy impurity and the following ten training points. Since the tree will be used to classify test patterns with missing features, we will give each node surrogate splits. x1 x2 x3 x4 x5 0 1 2 4 5 7 , 8 , 9 , 1 , 2 8 9 0 1 2 y1 y2 y3 y4 y5 3 6 7 8 9 3 , 0 , 4 , 5 , 6 . 3 4 5 6 7 surrogate split predictive association virtual value 1 : 2 : Through exhaustive search along all three features, we find the primary split at the root node should be "x1 < 5.5?", which sends {x1 , x2 , x3 , x4 , x5 , y1 } to the left and {y2 , y3 , y4 , y5 } to the right, as shown in the figure. We now seek the first surrogate split at the root node; such a split must be based on either the x2 or the x3 feature. Through exhaustive search we find that the split "x3 < 3.5?" has the highest predictive association with the primary split -- a value of 8, since 8 patterns are sent to matching directions by each rule, as shown in the figure. The second surrogate split must be along the only remaining feature, x2 . We find that for this feature the rule "x2 < 3.5?" has the highest predictive association 20 CHAPTER 8. NON-METRIC METHODS with the primary split, a value of 6. (This, incidentally, is not the optimal x2 split for impurity reduction -- we use it because it best approximates the preferred, primary split.) While the above describes the training of the root node, training of other nodes is conceptually the same, though computationally less complex because fewer points need be considered. primary split x1<5.5? first surrogate split x3<3.5? second surrogate split x2<3.5? x1, x2, x3, x4, x5, y1 y2, y3, y4, y5 x3, x4, x5, y1 y2, y3, y4, y5, x1, x2 predictive association with primary split = 8 x4, x5, y1, y3, y4, y5, y2 x1, x2, x3 predictive association with primary split = 6 Of all possible splits based on a single feature, the primary split, "x1 < 5.5?", minimizes the entropy impurity of the full training set. The first surrogate split at the root node must use a feature other than x1 ; its threshold is set in order to best approximate the action of the primary split. In this case "x3 < 3.5?" is the first surrogate split. Likewise, here the second surrogate split must use the x2 feature; its threshold is chosen to best approximate the action of the primary split. In this case "x2 < 3.5?" is the second surrogate split. The pink shaded band marks those patterns sent to the matching direction as the primary split. The number of patterns in the shading is thus the predictive association with the primary split. During classification, any test pattern containing feature x1 would be queried using the primary split, "x1 5.5?" Consider though the deficient test pattern (, 2, 4)t , where * is the missing x1 feature. Since the primary split cannot be used, we turn instead to the first surrogate split, "x3 3.5?", which sends this point to the right. Likewise, the test pattern (, 2, )t would be queried by the second surrogate split, "x2 3.5?", and sent to the left. Sometimes the fact that an attribute is missing can be informative. For instance, in medical diagnosis, the fact that an attribute (such as blood sugar level) is missing might imply that the physician had some reason not to measure it. As such, a missing attribute could be represented as a new feature, and used in classification. 8.4 Other tree methods Virtually all tree-based classification techniques can incorporate the fundamental techniques described above. In fact that discussion expanded beyond the core ideas in the earliest presentations of CART. While most tree-growing algorithms use an entropy impurity, there are many choices for stopping rules, for pruning methods and for the treatment of missing attributes. Here we discuss just two other popular tree algorithms. 8.4.1 ID3 ID3 received its name because it was the third in a series of identification or "ID" procedures. It is intended for use with nominal (unordered) inputs only. If the problem 8.4. OTHER TREE METHODS 21 involves real-valued variables, they are first binned into intervals, each interval being treated as an unordered nominal attribute. Every split has a branching factor Bj , where Bj is the number of discrete attribute bins of the variable j chosen for splitting. In practice these are seldom binary and thus a gain ratio impurity should be used (Sect. 8.3.2). Such trees have their number of levels equal to the number of input variables. The algorithm continues until all nodes are pure or there are no more variables to split on. While there is thus no pruning in standard presentations of the ID3 algorithm, it is straightforward to incorporate pruning along the ideas presented above (Computer exercise 4). 8.4.2 C4.5 The C4.5 algorithm, the successor and refinement of ID3, is the most popular in a series of "classification" tree methods. In it, real-valued variables are treated the same as in CART. Multi-way (B > 2) splits are used with nominal data, as in ID3 with a gain ratio impurity based on Eq. 7. The algorithm uses heuristics for pruning derived based on the statistical significance of splits. A clear difference between C4.5 and CART involves classifying patterns with missing features. During training there are no special accommodations for subsequent classification of deficient patterns in C4.5; in particular, there are no surrogate splits precomputed. Instead, if node N with branching factor B queries the missing feature in a deficient test pattern, C4.5 follows all B possible answers to the descendent nodes and ultimately B leaf nodes. The final classification is based on the labels of the B leaf nodes, weighted by the decision probabilities at N . (These probabilities are simply those of decisions at N on the training data.) Each of N 's immediate descendent nodes can be considered the root of a sub-tree implementing part of the full classification model. This missing-attribute scheme corresponds to weighting these sub-models by the probability any training pattern at N would go to the corresponding outcome of the decision. This method does not exploit statistical correlations between different features of the training points, whereas the method of surrogate splits in CART does. Since C4.5 does not compute surrogate splits and hence does not need to store them, this algorithm may be preferred over CART if space complexity (storage) is a major concern. The C4.5 algorithm has the provision for pruning based on the rules derived from the learned tree. Each leaf node has an associated rule -- the conjunction of the decisions leading from the root node, through the tree, to that leaf. A technique called C4.5Rules deletes redundant antecedents in such rules. To understand this, consider the left-most leaf in the tree at the bottom of Fig. 8.6, which corresponds to the rule IF AN D AN D AN D T HEN (0.40x1 + 0.16x2 (0.27x1 - 0.44x2 (0.96x1 - 1.77x2 (5.43x1 - 13.33x2 x 1 . < 0.11) < -0.02) < -0.45) < -6.03) C4.5Rules This rule can be simplified to give IF ( 0.40x1 + 0.16x2 < 0.11) 22 CHAPTER 8. NON-METRIC METHODS AN D (5.43x1 - 13.33x2 T HEN < -6.03) x 1 , as should be evident in that figure. Note especially that information corresponding to nodes near the root can be pruned by C4.5Rules. This is more general than impurity based pruning methods, which instead merge leaf nodes. 8.4.3 Which tree classifier is best? In Chap. ?? we shall consider the problem of comparing different classifiers, including trees. Here, rather than directly comparing typical implementations of CART, ID3, C4.5 and other numerous tree methods, it is more instructive to consider variations within the different component steps. After all, with care one can generate a tree using any reasonable feature processing, impurity measure, stopping criterion or pruning method. Many of the basic principles applicable throughout pattern classification guide us here. Of course, if the designer has insight into feature preprocessing, this should be exploited. The binning of real-valued features used in early versions of ID3 does not take full advantage of order information, and thus ID3 should be applied to such data only if computational costs are otherwise too high. It has been found that an entropy impurity works acceptably in most cases, and is a natural default. In general, pruning is to be preferred over stopped training and cross-validation, since it takes advantage of more of the information in the training set; however, pruning large training sets can be computationally expensive. The pruning of rules is less useful for problems that have high noise and are at base statistical in nature, but such pruning can often simplify classifiers for problems where the data were generated by rules themselves. Likewise, decision trees are poor at inferring simple concepts, for instance whether more than half of the binary (discrete) attributes have value +1. As with most classification methods, one gains expertise and insight through experimentation on a wide range of problems. No single tree algorithm dominates or is dominated by others. It has been found that trees yield classifiers with accuracy comparable to other methods we have discussed, such as neural networks and nearest-neighbor classifiers, especially when specific prior information about the appropriate form of classifier is lacking. Tree-based classifiers are particularly useful with non-metric data and as such they are an important tool in pattern recognition research. 8.5 *Recognition with strings character word Suppose the patterns are represented as ordered sequences or strings of discrete items, as in a sequence of letters in an English word or in DNA bases in a gene sequence, such as "AGCTTCGAATC." (The letters A, G, C and T stand for the nucleic acids adenine, guanine, cytosine and thymine.) Pattern classification based on such strings of discrete symbols differs in a number of ways from the more commonly used techniques we have addressed up to here. Because the string elements -- called characters, letters or symbols -- are nominal, there is no obvious notion of distance between strings. There is a further difficulty arising from the fact that strings need not be of the same length. While such strings are surely not vectors, we nevertheless broaden our familiar boldface notation to now apply to strings as well, e.g., x = "AGCTTC," though we will often refer to them as patterns, strings, templates or general words. (Of course, 8.5. *RECOGNITION WITH STRINGS 23 there is no requirement that these be meaningful words in a natural language such as English or French.) A particularly long string is denoted text. Any contiguous string that is part of x is called a substring, segment, or more frequently a factor of x. For example, "GCT" is a factor of "AGCTTC." There is a large number of problems in computations on strings. The ones that are of greatest importance in pattern recognition are: String matching: Given x and text, test whether x is a factor of text, and if so, where it appears. Edit distance: Given two strings x and y, compute the minimum number of basic operations -- character insertions, deletions and exchanges -- needed to transform x into y. String matching with errors: Given x and text, find the locations in text where the "cost" or "distance" of x to any factor of text is minimal. String matching with the "don't care" symbol: This is the same as basic string matching, but with a special symbol, / , the don't care symbol, which can match any other symbol. We should begin by understanding the several ways in which these string operations are used in pattern classification. Basic string matching can be viewed as an extreme case of template matching, as in finding a particular English word within a large electronic corpus such as a novel or digital repository. Alternatively, suppose we have a large text such as Herman Melville's Moby Dick, and we want to classify it as either most relevant to the topic of fish or to the topic of hunting. Test strings or keywords for the fish topic might include "salmon," "whale," "fishing," "ocean," while those for hunting might include "gun," "bullet," "shoot," and so on. String matching would determine the number of occurrences of such keywords in the text. A simple count of the keyword occurrences could then be used to classify the text according to topic. (Other, more sophisticated methods for this latter stage would generally be preferable.) The problem of string matching with the don't care symbol is closely related to standard string matching, even though the best algorithms for the two types of problems differ, as we shall see. Suppose, for instance, that in DNA sequence analysis we have a segment of DNA, such as x = "AGCCG / / / / / GACTG," where the first and last sections (called motifs) are important for coding a protein while the middle section, which consists of five characters, is nevertheless known to be inert and to have no function. If we are given an extremely long DNA sequence (the text), string matching with the don't care symbol using the pattern x containing / symbols would determine if text is in the class of sequences that could yield the particular protein. The string operation that finds greatest use in pattern classification is based on edit distance, and is best understood in terms of the nearest-neighbor algorithm (Chap. ??). Recall that in that algorithm each training pattern or prototype is stored along with its category label; an unknown test pattern is then classified by its nearest prototype. Suppose now that the prototypes are strings and we seek to classify a novel test string by its "nearest" stored string. For instance an acoustic speech recognizer might label every 10-ms interval with the most likely phoneme present in an utterance, giving a string of discrete phoneme labels such as "tttoooonn." Edit distance would then be used to find the "nearest" stored training pattern, so that its category label can be read. text factor don't care symbol keyword 24 CHAPTER 8. NON-METRIC METHODS The difficulty in this approach is that there is no obvious notion of metric or distance between strings. In order to proceed, then, we must introduce some measure of distance between the strings. The resulting edit distance is the minimum number of fundamental operations needed to transform the test string into a prototype string, as we shall see. The string-matching-with-errors problem contains aspects of both the basic string matching and the edit distance problems. The goal is to find all locations in text where x is "close" to the substring or factor of text. This measure of closeness is chosen to be an edit distance. Thus the strn be safely increased without missing a valid shift; the larger of these proposed shifts is selected and s is increased accordingly. The bad-character heuristic utilizes the rightmost character in text that does not match the aligned character in x. Because character comparisons proceed right-toleft, this "bad character" is found as efficiently as possible. Since the current shift s is invalid, no more character comparisons are needed and a shift increment can be made. The bad-character heuristic proposes incrementing the shift by an amount to align the rightmost occurrence of the bad character in x with the bad character identified in text. This guarantees that no valid shifts have been skipped (Fig. 8.8). badcharacter heuristic 26 bad character p r o b a b i l i t i e CHAPTER 8. NON-METRIC METHODS good suffix s _ f o r _ e s t i m a t e s s e s t i m a t e s p r o b a b i l i t i e s _ f o r _ e s t i m a t e s s+3 proposed by bad-character heuristic e s t i m a t e s p r o b a b i l i t i e s _ f o r _ e s t i m a t e s s+7 proposed by good-suffix heuristic e s t i m a t e s Figure 8.8: String matching by the Boyer-Moore algorithm takes advantage of information obtained at one shift s to propose the next shift; the algorithm is generally much less computationally expensive than naive string matching, which always increments shifts by a single character. The top figure shows the alignment of text and pattern x for an invalid shift s. Character comparisons proceed right to left, and the first two such comparisons are a match -- the good suffix is "es." The first (right-most) mismatched character in text, here "i," is called the bad character. The bad-character heuristic proposes incrementing the shift to align the right-most "i" in x with the bad character "i" in text -- a shift increment of 3, as shown in the middle figure. The bottom figure shows the effect of the good-suffix heuristic, which proposes incrementing the shift the least amount that will align the good suffix, "es" in x, with that in text -- here an increment of 7. Lines 11 & 12 of the Boyer-Moore algorithm select the larger of the two proposed shift increments, i.e., 7 in this case. Although not shown in this figure, after the mismatch is detected at shift s + 7, both the bad-character and the good-suffix heuristics propose an increment of yet another 7 characters, thereby finding a valid shift. suffix prefix good suffix Now consider the good-suffix heuristic, which operates in parallel with the badcharacter heuristic, and also proposes a safe shift increment. A general suffix of x is a factor or substring of x that contains the final character in x. (Likewise, a prefix contains the initial character in x.) At shift s the rightmost contiguous characters in text that match those in x are called the good suffix, or "matching suffix." As before, because character comparisons are made right-to-left, the good suffix is found with the minimum number of comparisons. Once a character mismatch has been found, the good-suffix heuristic proposes to increment the shift so as to align the next occurrence of the good suffix in x with that identified in text. This insures that no valid shift has been skipped. Given the two shift increments proposed by the two heuristics, line 12 goodsuffix heuristic 8.5. *RECOGNITION WITH STRINGS 27 of the Boyer-Moore algorithm chooses the larger. These heuristics rely on the functions F and G. The last-occurrence function, F(x), is merely a table containing every letter in the alphabet and the position of its rightmost occurrence in x. For the pattern in Fig. 8.8, the table would contain: a, 6; e, 8; i, 4; m, 5; s, 9; and t, 8. All 20 other letters in the English alphabet are assigned a value 0, signifying that they do not appear in x. The construction of this table is simple (Problem 22) and need be done just once; it does not significanly affect the computational cost of the Boyer-Moore algorithm. The good-suffix function, G(x), creates a ta 11 is the location. Two minor heuristics for reducing computational effort are relevant to the stringmatching-with-errors problem. The first is that except in highly unusual cases, the length of the candidate factors of text that need be considered are roughly equal to length[x]. Second, for each candidate shift, the edit-distance calculation can be terminated if it already exceeds the current minimum. In practice, this latter heuristic can reduce the computational burden significantly. Otherwise, the algorithm for string matching with errors is virtually the same as that for edit distance (Computer exercise 10). 8.5.5 String matching with the "don't-care" symbol String matching with the "don't-care" symbol, / , is formally the same as basic string matching, but the / in either x or text is said to match any character (Fig. 8.11). text x r c h _ p a / t t e r / / s _ i n _ l o n g / s t r / n g s p a t / r s pattern match Figure 8.11: String matching with don't care symbol is the same as basic string matching except the / symbol -- in either text or x -- matches any character. The figure shows the only valid shift. An obvious approach to string matching with the don't care symbol is to modify the naive string-matching algorithm to include a condition for matching the don't care symbol. Such an approach, however, retains the computational inefficiencies of naive string matching (Problem 29). Further, extending the Boyer-Moore algorithm to include / is somewhat difficult and inefficient. The most effective methods are based on fundamental methods in computer arithmetic and, while fascinating, would take us away from our central concerns of pattern recognition (cf. Bibliography). The use of this technique in pattern recognition is the same as string matching, with a particular type of "tolerance." While learning is a general and fundamental technique throughout pattern recog- 8.6. GRAMMATICAL METHODS 31 nition, it has found limited use in recognition with basic string matching. This is because the designer typically knows precisely which strings are being sought -- they do not need to be learned. Learning can, of course, be based on the outputs of a string-matching algorithm, as part of a larger pattern recognition system. 8.6 Grammatical methods Up to here, we have not been concerned with any detailed models that might underly the generation of the sequence of characters in a string. We now turn to the case where rules of a particular sort were used to generate the strings and thus where their structure is fundamental. Often this structure is hierarchical, where at the highest or most abstract level a sequence is very simple, but at subsequent levels there is greater and greater complexity. For instance, at its most abstract level, the string "The history book clearly describes several wars" is merely a sentence. At a somewhat more detailed level it can be described as a noun phrase followed by a verb phrase. The noun phrase can be expanded at yet a subsequent level, as can the verb phrase. The expansion ends when we reach the words "The," "history," and so forth -- items that are considered the "characters," atomic and without further structure. Consider too strings representing valid telephone numbers -- local, national and international. Such numbers conform to a strict structure: either a country code is present or it is not; if not, then the domestic national code may or may not be present; if a country code is present, then there is a set of permissible city codes and for each city there is a set of permissible area codes and individual local numbers, and so on. As we shall see, such structure is easily specified in a grammar, and when such structure is present the use of a grammar for recognition can improve accuracy. For instance, grammatical methods can be used to provide constraints for a full system that uses a statistical recognizer as a component. Consider an optical character recognition system that recognizes and interprets mathematical equations based on a scanned pixel image. The mathematical symbols often have specific "slots" that can be filled with certain other symbols; this can be specified by a grammar. Thus an integral sign has two slots, for upper and lower limits, and these can be filled by only a limited set of symbols. (Indeed, a grammar is used in many mathematical typesetting programs in order to prevent authors from creating meaningless "equations.") A full system that recognizes the integral sign could use a grammar to limit the number of candidate categories for a particular slot, and this increases the accuracy of the full system. Similarly, consider the problem of recognizing phone numbers within acoustic speech in an automatic dialing application. A statistical or Hidden-Markov-Model acoustic recognizer might perform word spotting and pick out number words such as "eight" and "hundred." A subsequent stage based on a formal grammar would then exploit the fact that telephone numbers are highly constrained, as mentioned. We shall study the case when crisp rules specify how the representation at one level leads to a more expanded and complicated representation at the next level. We sometimes call a string generated by a set of rules a sentence; the rules are specified by a grammar, denoted G. (Naturally, there is no requirement that these be related in any way to sentences in natural language such as English.) In pattern recognition, we are given a sentence and a grammar, and seek to determine whether the sentence was generated by G. sentence 32 CHAPTER 8. NON-METRIC METHODS 8.6.1 Grammars The notion of a grammar is very general and powerful. Formally, a grammar G consists of four components: symbols: Every sentence consists of a string of characters (which are also called primitive symbols, terminal symbols or letters), taken from an alphabet A. For bookkeeping, it is also convenient to include the null or empty string denoted , which has length zero; if is appended to any string x, the result is again x. variables: These are also called non-terminal symbols, intermediate symbols or occasionally internal symbols, and are taken from a set I. root symbol root symbol: The root symbol or starting symbol is a special internal symbol, the source from which all sequences are derived. The root symbol is taken from a set S. productions: The set of production rules, rewrite rules, or simply rules, denoted P, specify how to transform a set of variables and symbols into other variables and symbols. These rules determine the core structures that can be produced by the grammar. For instance if A is an internal symbol and c a terminal symbol, the rewrite rule cA cc means that any time the segment cA appears in a string, it can be replaced by cc. Thus we denote a general grammar by its alphabet, its variables, its particular root symbol, and the rewrite rules: G = (A, I, S, P). The language generated by grammar, denoted L(G), is the set of all strings (possibly infinite in number) that can be generated by G. Consider two examples; the is quite simple and abstract. Let A = {a, b, c}, first p2 : AB BA p1 : S aSBA OR aBA p4 : bA bc p3 : bB bb S = S, I = {A, B, C}, and P = . p5 : cA cc p6 : aB ab (In order to make the list of rewrite rules more compact, we shall condense rules having the same left hand side by means of the OR on the right hand side. Thus rule p1 is a condensation of the two rules S aSBA and S aBA.) If we start with S and apply the rewrite rules in the following orders, we have the following two cases: root p1 p6 p4 S aBA abA abc root p1 p1 p6 p2 p3 p4 p5 S aSBA aaBABA aabABA aabBAA aabbAA aabbcA aabbcc null string production rule language production After the rewrite rules have been applied in these sequences, no more symbols match the left-hand side of any rewrite rule, and the process is complete. Such a transformation from the root symbol to a final string is called a production. These two productions show that abc and aabbcc are in the language generated by G. In fact, it can be shown (Problem 38) that this grammar generates the language L(G) = {an bn cn |n 1}. 8.6. GRAMMATICAL METHODS 33 A much more complicated grammar underlies the English language, of course. The alphabet consists of all English words, A = {the, history, book, sold, over, 1000, copies, . . . }, and the intermediate symbols are the parts of speech: I = { noun , verb , noun phrase , verb phrase , adjective , adverb , adverbial phrase }. The root symbol here is S = sentence . A restricted set of the production rules in English includes: sentence noun phrase verb phrase noun phrase adjective noun phrase verb phrase verb phrase adverbial phrase P= noun book OR theorem OR . . . verb describes OR buys OR holds OR . . . adverb over OR . . . This subset of the rules of English grammar does not prevent the generation of meaningless sentences, of course. For instance, the nonsense sentence "Squishy green dreams hop heuristically" can be derived in this subset of English grammar. Figure 8.12 shows the steps of a production in a derivation tree, where the root symbol is displayed at the top and the terminal symbols at the bottom. <sentence> derivation tree <noun phrase> <verb phrase> <adjective> <noun phrase> <verb> <adverbial phrase> The <adjective> history <noun> book <noun phrase> sold <preposition> <noun phrase> over <adjective> 1000 <noun> copies <noun phrase> Figure 8.12: This derivation tree illustrates how a portion of English grammar can transform the root symbol, here sentence , into a particular sentence or string of elements, here English words, which are read from left to right. 8.6.2 Types of string grammars There are four main types of grammar, arising from different types of structure in the productions. As we have seen, a rewrite rule is of the form , where and are strings made up of intermediate and terminal symbols. Type 0: Free or unrestricted Free grammars have no restrictions on the rewrite rules and thus they provide no constraints or structure on the strings they can 34 CHAPTER 8. NON-METRIC METHODS produce. While in principle they can express an arbitrary set of rules, this generality comes at the tremendous expense of possibly unbounded learning time. Knowing that a string is derived from a type 0 grammar provides no information and as such, type 0 grammars in general have but little use in pattern recognition. Type 1: Context-sensitive A grammar is called context-sensitive if every rewrite rule is of the form I x where and are any strings made up of intermediate and terminal symbols, I is an intermediate symbol and x is an intermediate or terminal symbol (other than ). We say that "I can be rewritten as x in the context of on the left and on the right." Type 2: Context-free A grammar is called context free if every production is of the form Ix where I is an intermediate symbol and x an intermediate or terminal symbol (other than ). Clearly, unlike a type 1 grammar, here there is no need for a "context" for the rewriting of I by x. Type 3: Finite State or Regular A grammar is called regular if every rewrite rule is of the form z OR z where and are made up of intermediate symbols and z is a terminal symbol (other than ). Such grammars are also called finite state because they can be generated by a finite state machine, which we shall see in Fig. 8.16. A language generated by a grammar of type i is called a type i language. It can be shown that the class of grammars of type i includes all grammars of type i + 1; thus there is a strict hierarchy in grammars. Any context-free grammar can be converted into one in Chomsky normal form (CNF). Such a grammar has all rules of the form A BC and Az Chomsky normal form where A, B and C are intermediate symbols (that is, they are in I) and z is a terminal symbol. For every context-free grammar G, there is another G in Chomsky normal form such that L(G) = L(G ) (Problem 36). Example 3: A grammar for pronouncing numbers In order to understand these issues better, consider a grammar that yields pronunciation of any number between 1 and 999,999. The alphabet has 29 basic terminal symbols, i.e., the spoken words A = {one, two, . . . , ten, eleven, . . . , twenty, thirty, . . . , ninety, hundred, thousand}. 8.6. GRAMMATICAL METHODS 35 There are six non-terminal symbols, corresponding to general six-digit, three-digit, and two-digit numbers, the numbers between ten and nineteen, and so forth, as will be clear below: I = {digits6, digits3, digits2, digit1, teens, tys}. The root node corresponds to a general number up to six digits in length: S = digits6. The set of rewrite rules is based on a knowledge of English: digits6 digits3 thousand digits3 digits6 digits3 thousand OR digits3 digits3 digit1 hundred digits2 digits3 digit1 hundred OR digits2 P= digits2 teens OR tys OR tys digit1 OR digit1 digit1 one OR two OR . . . nine teens ten OR eleven OR . . . nineteen tys twenty OR thirty OR . . . OR ninety The grammar takes digit6 and applies the productions until the elements in the final alphabet are produced, as shown in the figure. Because it contains rewrite rules such as digits6 digits3 thousand, this grammar cannot be type 3. It is easy to confirm that it is type 2. digit6 digits3 digit1 six hundred thousand digits2 tys thirty digit1 nine 639,014 digits3 digits2 teens fourteen 2,953 digits3 digit1 two digit6 thousand digit1 nine digits3 hundred digits2 tys fifty digit1 three These two derivation trees show how the grammar G yields the pronunciation of 639,014 and 2,953. The final string of terminal symbols is read from left to right. 8.6.3 Recognition using grammars Recognition using grammars is formally very similar to the general approaches used throughout pattern recognition. Suppose we suspect that a test sentence was generated by one of c different grammars, G1 , G2 , . . . , Gc , which can be considered as different models or classes. A test sentence x is classified according to which grammar could have produced it, or equivalently, the language L(Gi ) of which x is a member. Up to now we have worked forward -- forming a derivation from a root node to a final sentence. For recognition, though, we must employ the inverse process: that is, given a particular x, find a derivation in G that leads to x. This process, called parsing, is virtually always much more difficult than forming a derivation. We now parsing 36 CHAPTER 8. NON-METRIC METHODS discuss one general approach to parsing, and briefly mention two others. Bottom-up parsing Bottom-up parsing starts with the test sentence x, and seeks to simplify it, so as to represent it as the root symbol. The basic approach is to use candidate productions from P "backwards," i.e., find rewrite rules whose right hand side matches part of the current string, and replace that part with a segment that could have produced it. This is the general method in the Cocke-Younger-Kasami algorithm, which fills a parse table from the "bottom up." The grammar must be expressed in Chomsky normal form and thus the productions P must all be of the form A BC, a broad but not all inclusive category of grammars. Entries in the table are candidate strings in a portion of a valid derivation. If the table contains the source symbol S, then indeed we can work forward from S and derive the test sentence, and hence x L(G). In the following, xi (for i = 1, . . . n) represents the individual terminal characters in the string to be parsed. Algorithm 4 (Bottom-up parsing) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 parse table begin initialize G = (A, I, S, P), x = x1 x2 . . . xn i0 do i i + 1 Vi1 {A | A xi } until i = n j1 do j j + 1 i 0 do i i + 1 Vij k0 do k k + 1 Vij Vij {A | A BC P, B Vik and C Vi+k,j-k } until k = j - 1 until i = n - j + 1 until j = n if S V1n then print "parse of" x "successful in G" return end Consider the operation of Algorithm 4 in the following simple abstract example. Let the grammar G have two terminal and three intermediate symbols: A = {a, b}, and I = {A, B, C}. The root symbol is S, and there are just four production rules: S AB OR BC on, as illustrated by the pink lines in Fig. 8.13. S A B b A a A a B C C B a b Figure 8.15: This valid derivation of "babaa" in G can be read from the pink lines in the parse table of Fig. 8.13 generated by the bottom-up parse algorithm. The bottom-up and top-down parsers just described are quite general and there are a number of parsing algorithms which differ in space and time complexities. Many parsing methods depend upon the model underlying the grammar. One popular such model is finite state machines. Such a machine consists of nodes and transition links; each node can emit a symbol, as shown in Fig. 8.16. finite state machine 8.7 Grammatical inference In many applications, the grammar is designed by hand. Nevertheless, learning plays an extremely important role in pattern recognition research and it is natural that we attempt to learn a grammar from example sentences it generates. When seeking to follow that general approach we are immediately struck by differences between the areas addressed by grammatical methods and those that can be described as statistical. First, for most languages there are many -- often an infinite number of -- grammars that can produce it. If two grammars G1 and G2 generate the same language (and no other sentences), then this ambiguity is of no consequence; recognition will be the same. However, since training is always based on a finite set of samples, the problem is underspecified. There are an infinite number of grammars consistent with the training data, and thus we cannot recover the source grammar uniquely. 8.7. GRAMMATICAL INFERENCE 39 mouse S the A cow B was C found D seen by E under the F barn G farmer Figure 8.16: One type of finite state machine consists of nodes that can emit terminal symbols ("the," "mouse," etc.) and transition to another node. Such operation can be described by a grammar. For instance, the rewrite rules for this finite state machine include S theA, A mouseB OR cowB, and so on. Clearly these rules imply this finite state machine implements a type 3 grammar. The final internal node (shaded) would lead to the null symbol . There are two main techniques used to make the problem of inferring a grammar from instances tractable. The first is to use both positive and negative instances. That is, we use a set D+ of sentences known to be derivable in the grammar; we also use a set D- that are known to be not derivable in the grammar. In a multicategory case, it is common to take the positive instances in Gi and use them for negative examples in Gj for j = i. Even with both positive and negative instances, a finite training set rarely specifies the grammar uniquely. Thus our second technique is to impose conditions and constraints. A trivial illustration is that we demand that the alphabet of the candidate grammar contain only those symbols that appear in the training sentences. Moreover, we demand that every production rule in the grammar be used. We seek the "simplest" grammar that explains the training instances where "simple" generally refers to the total number of rewrite rules, or the sum of their lengths, or other natural criterion. These are versions of Occam's razor, that the simplest explanation of the data is to be preferred (Chap ??). In broad overview, learning proceeds as follows. An initial grammar G0 is guessed. Often it is useful to specify the type of grammar (1, 2 or 3), and thus place constraints on the forms of the candidate rewrite rules. In the absence of other prior information, it is traditional to make G0 as simple as possible and gradually expand the set of productions as needed. Positive training sentences x+ are selected from D+ one by i one. If x+ cannot be parsed by the grammar, then new rewrite rules are proposed i for P. A new rule is accepted if and only if it is used for a successful parse of x+ and i does not allow any negative samples to be parsed. In greater detail, then, an algorithm for inferring the grammar is: Algorithm 5 (Grammatical inference (overview)) 1 2 3 4 5 6 7 8 9 10 11 12 begin initialize D+ , D- , set of rules that "cover" the training data. After such training it is traditional to simplify the resulting logical rule by means of standard logical methods. The designer must specify the predicates and functions, based on a prior knowledge of the problem domain. The algorithm begins by considering the most general rules using these predicates and functions, and finds the "best" simple rule. Here, "best" means that the rule describes the largest number of training examples. Then, the algorithm searches among all refinements of the best rule, choosing the refinement that too is "best." This process is iterated until no more refinements can be added, or when the number of items described is maximum. In this way a single, though possibly complex, if-then rule has been learned (Fig. 8.18). The sequential covering algorithm iterates this process and returns a set of rules. Because of its greedy nature, the algorithm need not learn the smallest set of rules. IF THEN Fish(x)=T sequential covering IF HasHair(x) IF (Width(x)>2m) IF Swims(x) IF Runs(x) IF HasEyes(x) THEN Fish(x)=F THEN Fish(x)=F THEN Fish(x)=T THEN Fish(x)=F THEN Fish(x)=T IF Swims(x) IF Swims(x) LaysEggs(x) Runs(x) THEN Fish(x)=T THEN Fish(x)=F IF Swims(x) IF Swims(x) IF Swims(x) (Weight(x)>9kg) HasScales(x) HasHair(x) THEN Fish(x)=F THEN Fish(x)=T THEN Fish(x)=F IF Swims(x) HasScales(x) HasGills(x) THEN Fish(x)=T IF Swims(x) IF Swims(x) HasScales(x) HasScales(x) (Length(x)>5m) HasEyes(x) THEN Fish(x)=T THEN Fish(x)=F Figure 8.18: In sequential covering, candidate rules are searched through successive refinements. First, the "best" rule having a single conditional predicate is found, i.e., the one explaining most training data. Next, other candidate predicates are added, the best compound rule selected, and so forth. A general approach is to search first through all rules having a single attribute. 44 CHAPTER 8. NON-METRIC METHODS Next, consider the rule having a single conjunction of two predicates, then these conjunctions, and so on. Note that this greedy algorithm need not be optimal -- that is, it need not yield the most compact rule. Summary Non-metric data consists of lists of nominal attributes; such lists might be unordered or ordered (strings). Tree-based methods such as CART, ID3 and C4.5 rely on answers to a series of questions (typically binary) for classification. The designer selects the form of question and the tree is grown, starting at the root node, by finding splits of data that make the representation more "pure." There are several acceptable impurity measures, such as misclassification, variance and Gini; the entropy impurity, however, has found greatest use. To avoid overfitting and to improve generalization, one must either employ stopped splitting (declaring a node with non-zero impurity to be a leaf), or instead prune a tree trained to minimum impurity leafs. Tree classifiers are very flexible, and can be used in a wide range of applications, including those with data that is metric, non-metric or in combination. When comparing patterns that consist of strings of non-numeric symbols, we use edit distance -- a measure of the number of fundamental operations (insertions, deletions, exchanges) that must be performed to transform one string into another. While the general edit distance is not a true metric, edit distance can nevertheless be used for nearest-neighbor classification. String matching is finding whether a test string appears in a longer text. The requirement of a perfect match in basic string matching can be relaxed, as in string matching with errors, or with the don't care symbol. These basic string and pattern recognition ideas are simple and straightforward, addressing them in large problems requires algorithms that are computationally efficient. Grammatical methods assume the strings are generated from certain classes of rules, which can be described by an underlying grammar. A grammar G consists of an alphabet, intermediate symbols, a starting or root symbol and most importantly a set of rewrite rules, or productions. The four different types of grammars -- free, context-sensitive, context-free, and regular -- make different assumptions about the nature of the transformations. Parsing is the task of accepting a string x and determining whether it is a member of the language generated by G, and if so, finding a derivation. Grammatical methods find greatest use in highly structured environments, particularly where structure lies at many levels. Grammatical inference generally uses positive and negative example strings (i.e., ones in the language generated by G and ones not in that language), to infer a set of productions. Rule-based systems use either propositional logic (variable-free) or first-order logic to describe a category. In broad overview, rules can be learned by sequentially "covering" elements in the training set by successively more complex compound rules. Bibliographical and Historical Remarks Most work on decision trees addresses problems in continuous features, though a key property of the method is that they apply to nominal data too. Some of the foundations of tree-based classifiers stem from the Concept Learning System described in [42], but the important book on CART [10] provided a strong statistics foundation and revived interest in the approach. Quinlan has been a leading exponent of tree classifiers, introducing ID3 [66], C4.5 [69], as well as the application of minimum 8.8. BIBLIOGRAPHICAL AND HISTORICAL REMARKS 45 description length for pruning [71, 56]. A good overview is [61], and a comparison of multivariate decision tree methods is given in [11]. Splitting and pruning criteria based on probabilities are explored in [53], and the use of an interesting information metric for this end is described in [52]. The Gini index was first used in analysis of variance in categorical data [47]. Incremental or on-line learning in decision trees is explored in [85]. The missing variable problem in trees is addressed in [10, 67], which describe methods more general than those presented here. An unusual parallel "neural" search through trees was presented in [78]. The use of edit distance began in the 1970s [64]; a key paper by Wagner and Fischer proposed the fundamental Algorithm 3 and showed that it was optimal [88]. The explosion of digital information, especially natural language text, has motivated work on string matching and related operations. An excellent survey is [5] and two thorough books are [23, 82]. The computational complexity of string algorithms is presented in [21, Chapter 34]. The fast string matching method of Algorithm 2 was introduced in [9]; its complexity and speedups and improvements were discussed in [18, 35, 24, 4, 40, 83]. String edit distance that permits block-level transpositions is discussed in [48]. Some sophisticated string operations -- two-dimensional string matching, longest common subsequence and graph matching -- have found only occasionally use in pattern recognition. Statistical methods applied to strings are discussed in [26]. Finite-state automata have been applied to several problems in string matching [23, Chapter 7], as well as time series prediction and switching, for instance converting from an alphanumeric representation to a binary representation [43]. String matching has been applied to the recognition DNA sequences and text, and is essential in most pattern recognition and template matching involving large databases of text [14]. There is a growing literature on special purpose hardware for string operations, of which the Splash-2 system [12] is a leading example. The foundations of a formal study of grammar, including the classification of grammars, began with the landmark book by Chomsky [16]. An early exposition of grammatical inference [39, Chapter 6] was the source for much of the discussion here. Recognition based on parsing (Latin pars orationis or "part of speech") has been fundamental in automatic language recognition. Some of the earliest work on three-dimensional object recognition relied on complex grammars which described the relationships of corners and edges, in block structures such arches and towers. It was found that such systems were very brittle; they failed whenever there were errors in feature extraction, due to occlusion and even minor misspecifications of the model. For the most part, then, grammatical methods have been abandoned for object recognition and scene analysis [60, 25]. Grammatical methods have been applied to the recognition of some simple, highly structured diagrams, such as electrical circuits, simple maps and even Chinese/Japanese characters. For useful surveys of the basic ideas in syntactic pattern recognition see [33, 34, 32, 13, 62, 14], for parsing see [28, 3], for grammatical inference see [59]. The complexity of parsing type 3 is linear in the length of the string, type 2 is low-order polynomial, type 1 is exponential; pointers to the relevant literature appear in [76]. There has been a great deal of work on parsing natural language and speech, and a good textbook on artificial intelligence addressing this topic and much more is [75]. There is much work on inferring grammars from instances, such as Crespi-Reghizzi algorithm (context free) [22]. If queries can be presented interactively, the learning of a grammar can be speeded [81]. The methods described in this chapter have been expanded to allow for stochastic grammars, where there are probabilities associated with rules [20]. A grammar can be considered a specification of a prior probability for a class; for instance, a uniform 46 CHAPTER 8. NON-METRIC METHODS prior over all (legal) strings in the language L. Error-correcting parsers have been used when random variations arise in an underlying stochastic grammar [50, 84]. One can also apply probability measures to languages [8]. Rule-based methods have formed the foundation of expert systems, and have been applied extensively through many branches of artificial intelligence such as planning, navigation and prediction; their use in pattern recognition has been modest, however. Early influential systems include DENDRAL, for inferring chemical structure from mass spectra [29], PROSPECTOR, for finding mineral deposits [38], and MYCIN, for medical diagnosis [79]. Early use of rule induction for pattern recognition include that of Michalski [57, 58]. Figure 8.17 was inspired by Winston's influential work on learning simple geometrical structures and relationships [91]. Learning rules can be called inductive logic programming; Clark and Niblett have made a number of contributions to the field, particularly their CN2 induction algorithm [17]. Quinlan, who has contributed much to the theory and application of tree-based classifiers, describes his FOIL algorithm, which uses a minimum description length criterion to stop the learning of first-order rules [68]. Texts on inductive logic include [46, 63] and general machine learning, including inferencing [44, 61]. Problems Section 8.2 1. When a test pattern is classified by a decision tree, that pattern is subjected to a sequence of queries, corresponding to the nodes along a path from root to leaf. Prove that for any decision tree, there is a functionally equivalent tree in which every such path consists of distinct queries. That is, given an arbitrary tree prove that it is always possible to construct an equivalent tree in which no test pattern is ever subjected to the same query twice. Section 8.3 2. Consider classification trees that are non-binary. (a) Prove that for any arbitrary tree, with possibly unequal branching ratios throughout, there exists a binary tree that implements the same classification function. (b) Consider a tree with just two levels -- a root node connected to B leaf nodes (B 2). What are the upper and the lower limits on the number of levels in a functionally equivalent binary tree, as a function of B? (c) As in part (b), what are the upper and lower limits on the number of nodes in a functionally equivalent binary tree? 3. Compare the computational complexities of a monothetic and a polythetic tree classifier trained on the same data as follows. Suppose there are n/2 training patterns in each of two categories. Every pattern has d attributes, each of which can take on k discrete values. Assume that the best split evenly divides the set of patterns. (a) How many levels will there be in the monothetic tree? The polythetic tree? (b) In terms of the variables given, what is the complexity of finding the optimal split at the root of a monothetic tree? A polythetic tree? 8.8. PROBLEMS (c) Compare the total complexities for training the two trees fully. 47 4. The task here is to find the computational complexity of training a tree classifier using the twoing impurity where candidate splits are based on a single feature. Suppose there are c classes, 1 , 2 , ..., c , each with n/c patterns that are d-dimensional. Proceed as follows: (a) How many possible non-trivial divisions into two supercategories are there at the root node? (b) For any one of these candidate supercategory divisions, what is the computational complexity of finding the split that minimizes the entropy impurity? (c) Use your results from parts (a) & (b) to find the computational complexity of finding the split at the root node. (d) Suppose for simplicity that each split divides the patterns into equal subsets and furthermore that each leaf node corresponds to a single pattern. In terms of the variables given, what will be the expected number of levels of the tree? (e) Naturally, the number of classes represented at any particular node will depend upon the level in the tree; at the root all c categories must be considered, while at the level just above the leaves, only 2 categories must be considered. (The pairs of particular classes represented will depend upon the particular node.) State some natural simplifying assumptions, and determine the number of candidate classes at any node as a function of level. (You may need to use the floor or ceiling notation, x or x , in your answer, as described in the Appendix.) (f) Use your results from part (e) and the number of patterns to find the computational complexity at an arbitrary level L. (g) Use all your results to find the computational complexity of training the full tree classifier. (h) Suppose there n = 210 patterns, each of which is d = 6 dimensional, evenly divided among c = 16 categories. Suppose that on a uniprocessor a fundamental computation requires roughly 10-10 seconds. Roughly how long will it take to train your classifier using the twoing criterion? How long will it take to classify a single test pattern? 5. Consider training a binary tree using the entropy impurity, and refer to Eqs. 1 & 5. (a) Prove that the decrease in entropy impurity provided by a single yes/no query can never be greater than one bit. (b) For the two trees in Example 1, verify that each split reduces the impurity and that this reduction is never greater than 1 bit. Explain nevertheless why the impurity at a node can be lower than at its descendent, as occurs in that Example. (c) Generalize your result from part (a) to the case with arbitrary branching ratio B 2. 48 CHAPTER 8. NON-METRIC METHODS 6. Let P (1 ), . . . , P (c ) denote the probabilities of c classes at node N of a binary c classification tree, and j=1 P (j ) = 1. Suppose the impurity i(P (1 ), . . . , P (c )) at N is a strictly concave function of the probabilities. That is, for any probabilities = i(P a (1 ), . . . , P a (c )) = i(P b (1 ), . . . , P b (c )) and = i(1 P a (1 ) + (1 - 1 )P b (1 ), . . . , c P a (c ) + (1 - c )P b (c )), c ia ib i () then for 0 j 1 and j = 1, we have j=1 ia i ib . (a) Prove that for any split, we have i(s, t) 0, with equality if and only if P (j |TL ) = P (j |TR ) = P (j |T ), for j = 1, . . . , c. In other words, for a concave impurity function, splitting never increases the impurity. (b) Prove that entropy impurity (Eq. 1) is a concave function. (c) Prove that Gini impurity (Eq. 3) is a concave function. 7. Show that the surrogate split method described in the text corresponds to the assumption that the missing feature (attribute) is the one most informative. 8. Con05 level. 16. Consider the following patterns, each having four binary-valued attributes: 1 1100 0000 1010 0011 2 1100 1111 1110 0111 Note especially that the first patterns in the two categories are the same. (a) Create by hand a binary classification tree for this data. Train your tree so that the leaf nodes have the lowest impurity possible. (b) Suppose it is known that during testing the prior probabilities of the two categories will not be equal, but instead P (1 ) = 2P (2 ). Modify your training method and use the above data to form a tree for this case. Section 8.4 17. Consider training a binary decision tree to classify two-component patterns from two categories. The first component is binary, 0 or 1, while the second component has six possible values, A through F: 1 1A 0E 0B 1B 1F 0D 2 0A 0C 1C 0F 0B 1D 8.8. PROBLEMS 51 Compare splitting the root node based on the first feature with splitting it on the second feature in the following way. (a) Use an entropy impurity with a two-way split (i.e., B = 2) on the first feature and a six-way split on the second feature. (b) Repeat (a) but using a gain ratio impurity. (c) In light of your above answers discuss the value of gain ratio impurity in cases where splits have different branching ratios. Section 8.5 18. Consider strings x and text, of length m and n, respectively, from an alphabet A consisting of d characters. Assume that the naive string-matching algorithm (Algorithm 1) exits the implied loop in line 4 as soon as a mismatch occurs. Prove that the number of character-to-character comparisons made on average for random strings is 1 - d-m 2(n - m + 1). 1 - d-1 (n - m + 1) 19. Consider string matching using the Boyer-Moore algorithm (Algorithm 2) based on the trinary alphabet A = {a, b, c}. Apply the good-suffix function G and the last-occurrence function F to each of the following strings: (a) "acaccacbac" (b) "abababcbcbaaabcbaa" (c) "cccaaababaccc" (d) "abbabbabbcbbabbcbba" 20. Consider the string-matching problem illustrated in the top of Fig. 8.8. Assume text began at the first character of "probabilities." (a) How many basic character comparisons are required by the naive string-matching algorithm (Algorithm 1) to find a valid shift? (b) How many basic character comparisons are required by the Boyer-Moore string matching algorithm (Algorithm 2)? 21. For each of the texts below, determine the number of fundamental character comparisons needed to find all valid shifts for the test string x = "abcca" using the naive string-matching algorithm (Algorithm 1) and the Boyer-Moore algorithm (Algorithm 2). (a) "abcccdabacabbca" (b) "dadadadadadadad" (c) "abcbcabcabcabc" (d) "accabcababacca" 52 (e) "bbccacbccabbcca" CHAPTER 8. NON-METRIC METHODS 22. Write pseudocode for an efficient construction of the last-occurrence function F used in the Boyer-Moore algorithm (Algorithm 2). Let d be the number of elements in the alphabet A, and m the length of string x. (a) What is the time complexity of your algorithm in the worst case? (b) What is the space complexity of your algorithm in the worst case? (c) How many fundamental operations are required to compute F for the 26letter English alphabet for x = "bonbon"? For x = "marmalade"? For x = "abcdabdabcaabcda"? 23. Consider the training data from the trinary alphabet A = {a, b, c} in the table 1 aabbc ababcc babbcc 2 bccba bbbca cbbaaaa 3 caaaa cbcaab baaca Use the simple edit distance to classify each of the below strings. If there are ambiguities in the classification, state which two (or all three) categories are candidates. (a) "abacc" (b) "abca" (c) "ccbba" (d) "bbaaac" 24. Repeat Problem 23 using its training data but the following test data: (a) "ccab" (b) "abdca" (c) "abc" (d) "bacaca" 25. Repeat Problem 23 but assume that the cost of different string transformations are not equal. In particular, assume that an interchange is twice as costly as an insertion or a deletion. 26. Consider edit distance with positive but otherwise arbitrary costs associated with each of the fundamental operations of insertion, d (logical), 41 variable-free logic, see logic, propositional 69 variance impurity, see impurity, variance virtual value, 19 War and Peace, 24 word, 22 Contents 9 Algorithm-independent machine learning 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Lack of inherent superiority of any classifier . . . . . . . . . . . . . . 9.2.1 No Free Lunch Theorem . . . . . . . . . . . . . . . . . . . . . Example 1: No Free Lunch for binary data . . . . . . . . . . . . . . . 9.2.2 *Ugly Duckling Theorem . . . . . . . . . . . . . . . . . . . . 9.2.3 Minimum description length (MDL) . . . . . . . . . . . . . . 9.2.4 Minimum description length principle . . . . . . . . . . . . . 9.2.5 Overfitting avoidance and Occam's razor . . . . . . . . . . . . 9.3 Bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Bias and variance for regression . . . . . . . . . . . . . . . . . 9.3.2 Bias and variance for classification . . . . . . . . . . . . . . . 9.4 *Resampling for estimating statistics . . . . . . . . . . . . . . . . . . 9.4.1 Jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example 2: Jackknife estimate of bias and variance of the mode . . . 9.4.2 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Resampling for classifier design . . . . . . . . . . . . . . . . . . . . . 9.5.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm 1: AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.3 Learning with queries . . . . . . . . . . . . . . . . . . . . . . 9.5.4 Arcing, learning with queries, bias and variance . . . . . . . . 9.6 Estimating and comparing classifiers . . . . . . . . . . . . . . . . . . 9.6.1 Parametric models . . . . . . . . . . . . . . . . . . . . . . . . 9.6.2 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.3 Jackknife and bootstrap estimation of classification accuracy 9.6.4 Maximum-likelihood model comparison . . . . . . . . . . . . 9.6.5 Bayesian model comparison . . . . . . . . . . . . . . . . . . . 9.6.6 The problem-average error rate . . . . . . . . . . . . . . . . . 9.6.7 Predicting final performance from learning curves . . . . . . . 9.6.8 The capacity of a separating plane . . . . . . . . . . . . . . . 9.7 Combining classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.1 Component classifiers with discriminant functions . . . . . . 9.7.2 Component classifiers without discriminant functions . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliographical and Historical Remarks . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computer exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 3 4 4 6 8 11 13 14 15 15 18 20 22 23 24 25 26 26 28 29 31 32 33 33 35 36 37 39 42 44 45 46 48 49 50 51 58 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 CONTENTS Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 70 Chapter 9 Algorithm-independent machine learning 9.1 Introduction such range I pattern recognition. When confronting one isa"best." of algorithms, every reader has wondered at one time or another which Of course, some algorithms may be preferred because of their lower computational complexity; others may be preferred because they take into account some prior knowledge of the form of the data (e.g., discrete, continuous, unordered list, string, ...). Nevertheless there are classification problems for which such issues are of little or no concern, or we wish to compare algorithms that are equivalent in regard to them. In these cases we are left with the question: Are there any reasons to favor one algorithm over another? For instance, given two classifiers that perform equally well on the training set, it is frequently asserted that the simpler classifier can be expected to perform better on a test set. But is this version of Occam's razor really so evident? Likewise, we frequently prefer or impose smoothness on a classifier's decision functions. Do simpler or "smoother" classifiers generalize better, and if so, why? In this chapter we address these and related questions concerning the foundations and philosophical underpinnings of statistical pattern classification. Now that the reader has intuition and experience with individual algorithms, these issues in the theory of learning may be better understood. In some fields there are strict conservation laws and constraint laws -- such as the conservation of energy, charge and momentum in physics, or the second law of thermodynamics, which states that the entropy of an isolated system can never decrease. These hold regardless of the number and configuration of the forces at play. Given the usefulness of such laws, we naturally ask: are there analogous results in pattern recognition, ones that do not depend upon the particular choice of classifier or learning method? Are there any fundamental results that hold regardless of the cleverness of the designer, the number and distribution of the patterns, and the nature of the classification task? Of course it is very valuable to know that there exists a constraint on classifier 3 n the previous chapters we have seen many learning algorithms and techniques for Occam's razor 4 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING accuracy, the Bayes limit, and it is sometimes useful to compare performance to this theoretical limit. Alas in practice we rarely if ever know the Bayes error rate. Even if we did know this error rate, it would not help us much in designing a classifier; thus the Bayes error is generally of theoretical interest. What other fundamental principles and properties might be of greater use in designing classifiers? Before we address such problems, we should clarify the meaning of the title of this chapter. "Algorithm-independent" here refers, first, to those mathematical foundations that do not depend upon the particular classifier or learning algorithm used. Our upcoming discussion of bias and variance is just as valid for methods based on neural networks as for the nearest-neighbor or for model-dependent maximum likelihood. Second, we mean techniques that can be used in conjunction with different learning algorithms, or provide guidance in their use. For example, cross validation and resampling techniques can be used with any of a large number of training methods. Of course by the very general notion of an algorithm these too are algorithms, technically speaking, but we discuss them in this chapter because of their breadth of applicability and independence from the details of the learning techniques encountered up to here. In this chapter we shall see, first, that no pattern classification method is inherently superior to any other, or even to random guessing; it is the type of problem, prior distribution and other information that determine which form of classifier should provide the best performance. We shall then explore several ways to quantify and adjust the "match" between a learning algorithm and the problem it addresses. In any particular problem there are differences between classifiers, of course, and thus we show that with certain assumptions we can estimate their accuracy (even for instance before the candidate classifier is fully trained) and compare different classifiers. Finally, we shall see methods for integrating component or "expert" classifiers, which themselves might implement any of a number of algorithms. We shall present the results that are most important for pattern recognition practitioners, occasionally skipping over mathematical details that can be found in the original research referenced in the Bibliographical and Historical Remarks section. 9.2 Lack of inherent superiority of any classifier We now turn to the central question posed above: If we are interesIf we wish to compare the algorithms overall, we therefore must average over all such possible target functions consistent with the training data. Part 2 of Theorem 9.1 states that averaged over all possible target functions, there is no difference in off-training set errors between the two algorithms. For each of the 25 distinct target functions consistent with the n = 3 patterns in D, there is exactly one other target function whose output is inverted for each of the patterns outside the training set, and this ensures that the performances of algorithms 1 and 2 will also be inverted, so that the contributions to the formula in Part 2 cancel. Thus indeed Part 2 of the Theorem as well as Eq. 4 are obeyed. Figure 9.1 illustrates a result derivable from Part 1 of Theorem 9.1. Each of the six squares represents the set of all possible classification problems; note that this is not the standard feature space. If a learning system performs well -- higher than average generalization accuracy -- over some set of problems, then it must perform worse than average elsewhere, as shown in a). No system can perform well throughout the full set of functions, d); to do so would violate the No Free Lunch Theorem. In sum, all statements of the form "learning/recognition algorithm 1 is better than algorithm 2" are ultimately statements about the relevant target functions. There is, hence, a "conservation theorem" in generalization: for every possible learning algorithm for binary classification the sum of performance over all possible target functions is exactly zero. Thus we cannot achieve positive performance on some problems without getting an equal and opposite amount of negative performance on other problems. While we may hope that we never have to apply any particular algorithm to certain problems, all we can do is trade performance on problems we do not expect to encounter with those that we do expect to encounter. This, and the other results from the No Free Lunch Theorem, stress that it is the assumptions about the learning domains that are relevant. Another practical import of the Theorem is that even popular and theoretically grounded algorithms will perform poorly on some problems, ones in which the learning algorithm and the posterior happen not to be "matched," as governed by Eq. 1. Practitioners must be aware of this possibility, which arises in real-world applications. Expertise limited to a small range of methods, even powerful ones such as neural networks, will not suffice for all classification problems. 8 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING a) b) c) possible learning systems - - - - 0 0 - - 0 f) d) e) impossible learning systems + + + + 0 0 0 + 0 0 0 0 0 0 0 0 0 - + - 0 - 0 0 0 - Figure 9.1: The No Free Lunch Theorem shows the generalization performance on the off-training set data that can be achieved (top row), and the performance that cannot be achieved (bottom row). Each square represents all possible classification problems consistent with the training data -- this is not the familiar feature space. A + indicates that the classification algorithm has generalization higher than average, a - indicates lower than average, and a 0 indicates average performance. The size of a symbol indicates the amount by which the performance differs from the average. For instance, a) shows that it is possible for an algorithm to have high accuracy on a small set of problems so long as it has mildly poor performance on all other problems. Likewise, b) shows that it is possible to have excellent performance throughout a large range of problem but this will be balanced by very poor performance on a large range of other problems. It is impossible, however, to have good performance throughout the full range of problems, shown in d). It is also impossible to have higher than average performance on some problems, and average performance everywhere else, shown in e). Experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems. 9.2.2 *Ugly Duckling Theorem While the No Free Lunch Theorem shows that in the absence of assumptions we should not prefer any learning or classification algorithm over another, an analogous theorem addresses features and patterns. Roughly speaking, the Ugly Duckling Theorem states that in the absence of assumptions there is no privileged or "best" feature representation, and that even the notion of similarity between patterns depends implicitly on assumptions which may or may not be correct. Since we are using discrete representations, we can use logical expressions or "predicates" to describe a pattern, much as in Chap. ??. If we denote a binary feature attribute by fi , then a particular pattern might be described by the predicate "f1 AN D f2 ," another pattern might be described as "N OT f2 ," and so on. Likewise we could have a predicate involving the patterns themselves, such as x1 OR x2 . Figure 9.2 shows how patterns can be represented in a Venn diagram. Below we shall need to count predicates, and for clarity it helps to consider a particular Venn diagram, such as that in Fig. 9.3. This is the most general Venn diagram based on two features, since for every configuration of f1 and f2 there is problem space (not feature space) + - - ++ + - + - + 0 0 - 9.2. LACK OF INHERENT SUPERIORITY OF ANY CLASSIFIER a) f1 x1 x2 x3 x4 x5 f3 f2 x1 x4 f3 b) f1 f1 x1 f2 x2 x3 f3 x3 c) 9 f2 x2 x6 x4 x5 Figure 9.2: Patterns xi , represented as d-tuples of binary features fi , can be placed in Venn diagram (here d = 3); the diagram itself depends upon the classification problem and its constraints. For instance, suppose f1 is the binary feature attribute has legs, f2 is has right arm and f3 the attribute has right hand. Thus in part a) pattern x1 denotes a person who has legs but neither arm nor hand; x2 a person who has legs and an arm, but no hand; and so on. Notice that the Venn diagram expresses the biological constraints associated with real people: it is impossible for someone to have a right hand but no right arm. Part c) expresses different constraints, such as the biological constraint of mutually exclusive eye colors. Thus attributes f1 , f2 and f3 might denote brown, green and blue respectively and a pattern xi describes a real person, whom we can assume cannot have eyes that differ in color. indeed a pattern. Here predicates can be as simple as "x1 ," or more complicated, such as "x1 OR x2 OR x4 ," and so on. f1 x1 x2 f2 x3 x4 Figure 9.3: The Venn for a problem with no constraints on two features. Thus all four binary attribute vectors can occur. The rank r of a predicate is the number of the simplest or indivisible elements it contains. The tables below show the predicates of rank 1, 2 and 3 associated with the Venn diagram of Fig. 9.3. Not shown is the fact that there is but one predicate of rank r = 4, the disjunction of the x1 , . . . , x4 , which has the logical value True. If we let n be the total number of regions in the Venn diagram (i.e., the number of distinct possible patterns), then there are n predicates of rank r, as shown at the bottom of r the table. rank Technically speaking, we should use set operations rather than logical operations when discussing the Venn diagram, writing x1 x2 instead of x1 OR x2 . Nevertheless we use logical operations here for consistency with the rest of the text. 10 rank r = 1 x1 f1 AN D x2 f1 AN D x3 f2 AN D x4 N OT (f1 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING rank OR x2 OR x3 OR x4 OR x3 OR x4 OR x4 r=2 f1 f1 XOR f2 NOT f2 f2 NOT(f1 AND f2 ) NOT f1 4 2 N OT f2 f2 N OT f1 OR f2 ) x1 x1 x1 x2 x2 x3 x1 x1 x1 x2 OR OR OR OR x2 x2 x3 x3 rank OR x3 OR x4 OR x3 OR x4 r=3 f1 OR f2 f1 OR NOT f2 NOT(f1 AND f2 ) f2 OR NOTf1 4 1 =4 =6 4 3 =4 The total number of predicates in the absence of constraints is n r=0 n r = (1 + 1)n = 2n , (5) and thus for the d = 4 case of Fig. 9.3, there are 24 = 16 possible predicates (Problem 9). Note that Eq. 5 applies only to the case where there are no constraints; for Venn diagrams that do incorporate constraints, such as those in Fig. 9.2, the formula does not hold (Problem 10). Now we turn to our central question: In the absence of prior information, is there a principled reason to judge any two distinct patterns as more or less similar than two other distinct patterns? A natural and familiar measure of similarity is the number of features or attributes shared by two patterns, but even such an obvious measure presents conceptual difficulties. To appreciate such difficulties, consider first a simple example. Suppose attributes f1 and f2 represent blind in right eye and blind in left eye, respectively. If we base similarity on shared features, person x1 = {1, 0} (blind only in the right eye) is maximally different from person x2 = {0, 1} (blind only in the left eye). In particular, in this scheme x1 is more similar to a totally blind person and to a normally sighted person than he is to x2 . But this result may prove unsatisfactory; we can easily envision many circumstances where we would consider a person blind in just the right eye to be "similar" to one blind in just the left eye. Such people might be permitted to drive automobiles, for instance. Further, a person blind in just one eye would differ significantly from totally blind person who would not be able to drive. A second, related point is that there are always multiple ways to represent vectors (or tuples) of attributes. For instance, in the above example, we might use alternative features f1 and f2 to represent blind in right eye and same in both eyes, respectively, and then the four types of people would be represented as shown in the tables. f1 0 0 1 1 f2 0 1 0 1 f1 0 0 1 1 f2 1 0 0 1 x1 x2 x3 x4 Of course there are other representations, each more or less appropriate to the particular problem at hand. In the absence of prior information, though, there is no principled reason to prefer one of these representations over another. 9.2. LACK OF INHERENT SUPERIORITY OF ANY CLASSIFIER 11 We must then still confront the problem of finding a principled measure the similarity between two patterns, given some representation. The only plausible candidate measure in this circumstance would be the number of predicates (rather than the number of features) the patterns share. Consider two distinct patterns (in some representation) xi and xj , where i = j. Regardless of the constraints in the problem (i.e., the Venn diagram), there are, of course, no predicates of rank r = 1 that are shared by the two patterns. There is but one predicate of rank r = 2, i.e., xi OR xj . A predicate of rank r = 3 must contain three patterns, two of which are xi and xj . Since there are d patterns total, there are then d-2 = d - 2 predicates of rank 3 that 1 are shared by xi and xj . Likewise, for an arbitrary rank r, there are d-2 predicates r-2 shared by the two patterns, where 2 r d. The total number of predicates shared by the two patterns is thus the sum d r-2 d-2 r-2 = (1 + 1)d-2 = 2d-2 . (6) Note the key result: Eq. 6 is independent of the choice of xi and xj (so long as they are distinct). Thus we conclude that the number of predicates shared by two distinct patterns is constant, and independent of the patterns themselves (Problem 11). We conclude that if we judge similarity based on the number of predicates that patterns share, then any two distinct patterns are "equally similar." This is stated formally as: Theorem 9.2 (Ugly Duckling) Given that we use a finite set of predicates that enables us to distinguish any two patterns under consideration, the number of predicates shared by any two such patterns is constant and independent of the choice of those patterns. Furthermore, if pattern similarity is based on the total number of predicates shared by two patterns, then any two patterns are "equally similar." In summary, then, the Ugly Duckling Theorem states something quite simple yet important: there is no problem-independent or privileged or "best" set of features or feature attributes. Moreover, while the above was derived using d-tuples of binary values, it also applies to a continuous feature spaces too, if such as space is discretized (at any resolution). The Theorem forces us to acknowledge that even the apparently simple notion of similarity between patterns is fundamentally based on implicit assumptions about the problem domain (Problem 12). 9.2.3 Minimum description length (MDL) It is sometimes claimed that the minimum description length principle provides justification for preferring one type of classifier over another -- specifically "simpler" classifiers over "complex" ones. Briefly stated, the approach purports to find some irreducible, smallest representation of all members of a category (much like a "signal"); all variation among the individual patterns is then "noise." The principle argues that by simplifying recognizers appropriately, the signal can be retained while the noise is ignored. Because the principle is so often invoked, it is important to understand what properly derives from it, what does not, and how it relates to the No Free Lunch The Theorem gets its fanciful name from the following counter-intuitive statement: Assuming similarity is based on the number of shared predicates, an ugly duckling A is as similar to beautiful swan B as beautiful swan C is to B, given that these items differ from one another. 12 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING Theorem. To do so, however, we must first understand the notion of algorithmic complexity. Algorithmic complexity Algorithmic complexity -- also known as Kolmogorov complexity, Kolmogorov-Chaitin complexity, descriptional complexity, shortest program length or algorithmic entropy -- seeks to quantify an inherent complexity of a binary string. (We shall assume both classifiers and patterns are described by such strings.) Algorithmic complexity can be explained by analogy to communication, the earliest application of information theory (App. ??). If the sender and receiver agree upon a specification method L, such as an encoding or compression technique, then message x can then be transmitted as y, denoted L(y) = x or y : L(y) = x. The cost of transmission of x is the length of the transmitted message y, that is, |y|. The least such cost is hence the minimum length of such a message, denoted min : L(y) = x; this minimal length is the entropy of x |y| abstract computer under the specification or transmission method L. Algorithmic complexity is defined by analogy to entropy, where instead of a specification method L, we consider programs running on an abstract computer, i.e., one whose functions (memory, processing, etc.) are described operationally and without regard to hardware implementation. Consider an abstract computer that takes as a program a binary string y and outputs a string x and halts. In such a case we say that y is an abstract encoding or description of x. A universal description should be independent of the specification (up to some additive constant), so that we can compare the complexities of different binary strings. Such a method would provide a measure of the inherent information content, the amount of data which must be transmitted in the absence of any other prior knowledge. The Kolmogorov complexity of a binary string x, denoted K(x), is defined as the size of the shortest program y, measured in bits, that without additional data computes the string x and halts. Formally, we write K(x) = min[U (y) = x], |y| (7) Turing machine where U represents an abstract universal Turing machine or Turing computer. For our purposes it suffices to state that a Turing machine is "universal" in that it can implement any algorithm and compute any computable function. Kolmogorov complexity is a measure of the incompressibility of x, and is analogous to minimal sufficient statistics, the optimally compressed representation of certain properties of a distribution (Chap. ??). Consider the following examples. Suppose x consists solely of n 1s. This string is actually quite "simple." If we use some fixed number of bits k to specify a general program containing a loop for printing a string of 1s, we need merely log2 n more bits to specify the iteration number n, the condition for halting. Thus the Kolmogorov complexity of a string of n 1s is K(x) = O(log2 n). Next consider the transcendental number , whose infinite sequence of seemingly random binary digits, 11.00100100001111110110101010001 . . .2 , actually contains only a few bits of information: the size of the shortest program that can produce any arbitrarily large number of consecutive digits of . Informally we say the algorithmic complexity of is a constant; formally we write K() = O(1), which means K() does not grow with increasing number of desired bits. Another example is a "truly" random binary string, which cannot be expressed as a shorter string; its algorithmic complexity is within a 9.2. LACK OF INHERENT SUPERIORITY OF ANY CLASSIFIER 13 constant factor of its length. For such a string we write K(x) = O(|x|), which means that K(x) grows as fast as the length of x (Problem 13). 9.2.4 Minimum description length principle We now turn to a simple, "naive" version of the minimum description length principle and its application to pattern recognition. Given that all members of a category share some properties, yet differ in others, the recognizer should seek to learn the common or essential characteristics while ignoring the accidental or random ones. Kolmogorov complexity seeks to provide an objective measure of simplicity, and thus the description of the "essential" characteristics. Suppose we seek to design a classifier using a training set D. The minimum description length (MDL) principle states that we should minimize the sum of the model's algorithmic complexity and the description of the training data with respect to that model, i.e., K(h, D) = K(h) + K(D using h). h (8) Thus we seek the model h that obeys h = arg min K(h, D) (Problem 14). (Variations on the naive minimum description length principle use a weighted sum of the terms in Eq. 8.) In practice, determining the algorithmic complexity of a classifier depends upon a chosen class of abstract computers, and this means the complexity can be specified only up to an additive constant. A particularly clear application of the minimum description length principle is in the design of decision tree classifiers (Chap. ??). In this case, a model h specifies the tree and the decisions at the nodes; thus the algorithmic complexity of the model is proportional to the number of nodes. The complexity of the data given the model could be expressed in terms of the entropy (in bits) of the data D, the weighted sum of the entropies of the data at the leaf nodes. Thus if the tree is pruned based on an entropy criterion, there is an implicit global cost criterion that is equivalent to minimizing a measure of the general form in Eq. 8 (Computer exercise 1). It can be shown theoretically that classifiers designed with a minimum description length principle are guaranteed to converge to the ideal or true model in the limit of more and more data. This is surely a very desirable property. However, such derivations cannot prove that the principle leads to superior performance in the finite data case; to do so would violate the No Free Lunch Theorems. Moreover, in practice it is often difficult to compute the minimum description length, since we may not be clever enough to find the "best" representation (Problem 17). Assume there is some correspondence between a particular classifier and an abstract computer; in such a case it may be quite simple to determine the length of the string y necessary to create the classifier. But since finding the algorithmic complexity demands we find the shortest such string, we must perform a very difficult search through possible programs that could generate the classifier. The minimum description length principle can be viewed from a Bayesian perspective. Using our current terminology, Bayes formula states P (h|D) = P (h)P (D|h) P (D) (9) for discrete hypotheses and data. The optimal hypothesis h is the one yielding the highest posterior probability, i.e., 14 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING h = arg max[P (h)P (D|h)] h h = arg max[log2 P (h) + log2 P (D|h)], (10) much as we saw in Chap. ??. We note that a string x can be communicated or represented at a cost bounded below by -log2 P (x), as stated in Shannon's optimal coding theorem. Shannon's theorem thus provides a link between the minimum description length (Eq. 8) and the Bayesian approaches (Eq. 10). The minimum description length principle states that simple models (small K(h)) are to be preferred, and thus amounts to a bias toward "simplicity." It is often easier in practice to specify such a prior in terms of a description length than it is using functions of distributions (Problem 16). We shall revisit the issue of the tradeoff between simplifying the model and fitting the data in the bias-variance dilemma in Sec. 9.3. It is found empirically that classifiers designed using the minimum description length principle work well in many problems. As mentioned, the principle is effectively a method for biasing priors over models toward "simple" models. The reasons for the many empirical success of the principle are not trivial, as we shall see in Sect. 9.2.5. One of the greatest benefits of the principle is that it provides a computationally clear approach to balancing model complexity and the fit of the data. In somewhat more heuristic methods, such as pruning neural networks, it is difficult to compare the algorithmic complexity of the network (e.g., number of units or weights) with the entropy of the data with respect to that model. 9.2.5 Overfitting avoidance and Occam's razor Throughout our discussions of pattern classifiers, we have mentioned the need to avoid overfitting by means of regularization, pruning, inclusion of penalty terms, minimizing a description length, and so on. The No Free Lunch results throw such techniques into question. If there are no problem-independent reasons to prefer one algorithm over another, why is overfitting avoidance nearly universally advocated? For a given training error, why do we generally advocate simple classifiers with fewer features and parameters? In fact, techniques for avoiding overfitting or minimizing description length are not inherently beneficial; instead, such techniques amount to a preference, or "bias," over the forms or parameters of classifiers. They are only beneficial if they happen to address problems for which they work. It is the match of the learning algorithm to the problem -- not the imposition of overfitting avoidance -- that determines the empirical success. There are problems for which overfitting avoidance actually leads to worse performance. The effects of overfitting avoidance depend upon the choice of representation too; if the feature space is mapped to a new, formally equivalent one, overfitting avoidance has different effects (Computer exercise ??). In light of the negative results from the No Free Lunch theorems, we might probe more deeply into the frequent empirical "successes" of the minimum description length principle and the more general philosophical principle of Occam's razor. In its original form, Occam's razor stated merely that "entities" (or explanations) should not be multiplied beyond necessity, but it has come to be interpreted in pattern recognition as counselling that one should not use classifiers that are more complicated than are necessary, where "necessary" is determined by the quality of fit to the training data. Given the respective requisite assumptions, the No Free Lunch theorem proves that 9.3. BIAS AND VARIANCE 15 there is no benefit in "simple" classifiers (or "complex" ones, for that matter) -- simple classifiers claim neither unique nor universal validity. The frequent empirical "successes" of Occam's razor imply that the classes of problems addressed so far have certain properties. What might be the reason we explore problems that tend to favor simpler classifiers? A reasonable hypothesis is that through evolution, we have had strong selection pressure on our pattern recognition apparatuses to be computationally simple -- require fewer neurons, less time, and so forth -- and in general such classifiers tend to be "simple." We are more likely to ignore problems for which Occam's razor does not hold. Analogously, researchers naturally develop simple algorithms before more complex ones, as for instance in the progression from the Perceptron, to multilayer neural networks, to networks with pruning, to networks with topology learning, to hybrid neural net/rule-based methods, and so on -- each more complex than its predecessor. Each method is found to work on some problems, but not ones that are "too complex." For instance the basic Perceptron is inadequate for optical character recognition; a simple three-layer neural network is inadequate for speaker-independent speech recognition. Hence our design methodology itself imposes a bias toward "simple" classifiers; we generally stop searching for a design when the classifier is "good enough." This principle of satisficing -- creating an adequate though possibly non-optimal solution -- underlies much of practical pattern recognition as well as human cognition. Another "justification" for Occam's razor derives from a property we might strongly desire or expect in a learning algorithm. If we assume that adding more training data does not, on average, degrade the generalization accuracy of a classifier, then a version of Occam's razor can in fact be derived. Note, however, that such a desired property amounts to a non-uniform prior over learning algorithms -- while this property is surely desirable, it is a premise and cannot be "proven." Finally, the No Free Lunch theorem implies that we cannot use training data to create a scheme by which we can with some assurance distinguish new problems for which the classifier will generalize well from new problems for which the classifier will generalize poorly (Problem 8). satisficing 9.3 Bias and variance Given that there is no general best classifier unless the probability over the class of problems is restricted, practitioners must be prepared to explore a number of methods or models when solving any given classification problem. Below we will define two ways to measure the "match" or "alignment" of the learning algorithm to the classification problem: the bias and the variance. The bias measures the accuracy or quality of the match: high bias implies a poor match. The variance measures the precision or specificity of the match: a high variance implies a weak match. Designers can adjust the bias and variance of classifiers, but the important bias-variance relation shows that the two terms are not independent; in fact, for a given mean-square error, they obey a form of "conservation law." Naturally, though, with prior information or even mere luck, classifiers can be created that have a different mean-square error. 9.3.1 Bias and variance for regression Bias and variance are most easily understood in the context of regression or curve fitting. Suppose there is a true (but unknown) function F (x) with continuous valued output with noise, and we seek to estimate it based on n samples in a set D generated 16 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING by F (x). The regression function estimated is denoted g(x; D) and we are interested in the dependence of this approximation on the training set D. Due to random variations in data selection, for some data sets of finite size this approximation will be excellent while for other data sets of the same size the approximation will be poor. The natural measure of the effectiveness of the estimator can be expressed as its mean-square deviation from the desired optimal. Thus we average over all training sets D of fixed size n and find (Problem 18) ED (g(x; D) - F (x))2 = (ED [g(x; D) - F (x)])2 + ED (g(x; D) - ED [g(x; D)])2 . bias2 variance (11) bias variance biasvariance dilemma The first term on the right hand side is the bias (squared) -- the difference between the expected value and the true (but generally unknown) value -- while the second term is the variance. Thus a low bias means on average we accurately estimate F from D. Further, a low variance means the estimate of F does not change much as the training set varies. Even if an estimator is unbiased (i.e., the bias = 0 and its expected value is equal to the true value), there can nevertheless be a large mean-square error arising from a large variance term. Equation 11 shows that the mean-square error can be expressed as the sum of a bias and a variance term. The bias-variance dilemma or bias-variance trade-off is a general phenomenon: procedures with increased flexibility to adapt to the training data (e.g., have more free parameters) tend to have lower bias but higher variance. Different classes of regression functions g(x; D) -- linear, quadratic, sum of Gaussians, etc. -- will have different overall errors; nevertheless, Eq. 11 will be obeyed. Suppose for example that the true, target function F (x) is a cubic polynomial of one variable, with noise, as illustrated in Fig. 9.4. We seek to estimate this function based on a sampled training set D. Column a) at the left, shows a very poor "estimate" g(x) -- a fixed linear function, independent of the training data. For different training sets sampled from F (x) with noise, g(x) is unchanged. The histogram of this meansquare error of Eq. 11, shown at the bottom, reveals a spike at a fairly high error; because this estimate is so poor, it has a high bias. Further, the variance of the constant model g(x) is zero. The model in column b) is also fixed, but happens to be a better estimate of F (x). It too has zero variance, but a lower bias than the poor model in a). Presumably the designer imposed some prior knowledge about F (x) in order to get this improved estimate. The model in column c) is a cubic with trainable coefficients; it would learn F (x) exactly if D contained infinitely many training points. Notice the fit found for every training set is quite good. Thus the bias is low, as shown in the histogram at the bottom. The model in d) is linear in x, but its slope and intercept are determined from the training data. As such, the model in d) has a lower bias than the models in a) and b). In sum, for a given target function F (x), if a model has many parameters (generally low bias), it will fit the data well but yield high variance. Conversely, if the model has few parameters (generally high bias), it may not fit the data particularly well, but this fit will not change much as for different data sets (low variance). The best way to get low bias and low variance is the have prior information about the target function. We can virtually never get zero bias and zero variance; to do so would mean there is only one learning problem to be solved, in which case the answer is already 9.3. BIAS AND VARIANCE 17 a) g(x) = fixed b) g(x) = fixed c) y g(x) = a0 + a1x + a0x2 +a3x3 learned d) y g(x) = a0 + a1x learned y g(x) y g(x) F(x) g(x) F(x) F(x) g(x) F(x) D1 x x x x y g(x) y y g(x) y D2 x F(x) g(x) F(x) F(x) g(x) F(x) x x x y g(x) y y g(x) y F(x) D3 x F(x) g(x) F(x) F(x) g(x) x x x p bias p bias p bias p bias E E E E variance variance variance Figure 9.4: The bias-variance dilemma can be illustrated in the domain of regression. Each column represents a different model, each row a different set of n = 6 training points, Di , randomly sampled from the true function F (x) with noise. Histograms of the mean-square error of E ED [(g(x) - F (x))2 ] of Eq. 11 are shown at the bottom. Column a) shows a very poor model: a linear g(x) whose parameters are held fixed, independent of the training data. This model has high bias and zero variance. Column b) shows a somewhat better model, though it too is held fixed, independent of the training data. It has a lower bias than in a) and the same zero variance. Column c) shows a cubic model, where the parameters are trained to best fit the training samples in a mean-square error sense. This model has low bias, and a moderate variance. Column d) shows a linear model that is adjusted to fit each training set; this model has intermn the training data lead to significantly different classifiers and relatively "large" changes in accuracy. As we saw in Chap. ??, decision tree classifiers trained by a greedy algorithm can be unstable -- a slight change in the position of a single training point can lead to a radically different tree. In general, bagging improves recognition for unstable classifiers since it effectively averages over such discontinuities. There are no convincing theoretical derivations or simulation studies showing that bagging will help all stable classifiers, however. Bagging is our first encounter with multiclassifier systems, where a final overall classifier is based on the outputs of a number of component classifiers. The global decision rule in bagging -- a simple vote among the component classifiers -- is the most elementary method of pooling or integrating the outputs of the component classifiers. We shall consider multiclassifier systems again in Sect. 9.7, with particular attention to forming a single decision rule from the outputs of the component classifiers. 9.5.2 Boosting weak learner The goal of boosting is to improve the accuracy of any given learning algorithm. In boosting we first create a classifier with accuracy on the training set greater than average, and then add new component classifiers to form an ensemble whose joint decision rule has arbitrarily high accuracy on the training set. In such a case we say the classification performance has been "boosted." In overview, the technique trains successive component classifiers with a subset of the training data that is "most informative" given the current set of component classifiers. Classification of a test point x is based on the outputs of the component classifiers, as we shall see. For definiteness, consider creating three component classifiers for a two-category problem through boosting. First we randomly select a set of n1 < n patterns from the full training set D (without replacement); call this set D1 . Then we train the first classifier, C1 , with D1 . Classifier C1 need only be a weak learner, i.e., have accuracy only slightly better than chance. (Of course, this is the minimum requirement; a weak learner could have high accuracy on the training set. In that case the benefit In Sect. 9.7 we shall come across other names for component classifiers. For the present purposes we simply note that these are not classifiers of component features, but are instead members in an ensemble of classifiers whose outputs are pooled so as to implement a single classification rule. 9.5. RESAMPLING FOR CLASSIFIER DESIGN 27 of boosting will be small.) Now we seek a second training set, D2 , that is the "most informative" given component classifier C1 . Specifically, half of the patterns in D2 should be correctly classified by C1 , half incorrectly classified by C1 (Problem 29). Such an informative set D2 is created as follows: we flip a fair coin. If the coin is heads, we select remaining samples from D and present them, one by one to C1 until C1 misclassifies a pattern. We add this misclassified pattern to D2 . Next we flip the coin again. If heads, we continue through D to find another pattern misclassified by C1 and add it to D2 as just described; if tails we find a pattern which C1 classifies correctly. We continue until no more patterns can be added in this manner. Thus half of the patterns in D2 are correctly classified by C1 , half are not. As such D2 provides information complementary to that represented in C1 . Now we train a second component classifier C2 with D2 . Next we seek a third data set, D3 , which is not well classified by the combined system C1 and C2 . We randomly select a training pattern from those remaining in D, and classify that pattern with C1 and with C2 . If C1 and C2 disagree, we add this pattern to the third training set D3 ; otherwise we ignore the pattern. We continue adding informative patterns to D3 in this way; thus D3 contains those not well represented by the combined decisions of C1 and C2 . Finally, we train the last component classifier, C3 , with the patterns in D3 . Now consider the use of the ensemble of three trained component classifiers for classifying a test pattern x. Classification is based on the votes of the component classifiers. Specifically, if C1 and C2 agree on the category label of x, we use that label; if they disagree, then we use the label given by C3 (Fig. 9.6). We skipped over a practical detail in the boosting algorithm: how to choose the number of patterns n1 to train the first component classifier. We would like the final system to be trained with all patterns in D of course; moreover, because the final decision is a simple vote among the component classifiers, we would like to have roughly equal number of patterns in each (i.e., n1 n2 n3 n/3). A reasonable first guess is to set n1 n/3 and create the three component classifiers. If the classification problem is very simple, however, component classifier C1 will explain most of the data and thus n2 (and n3 ) will be much less than n1 , and not all of the patterns in the training set D will be used. Conversely, if the problem is extremely difficult, then C1 will explain but little of the data, and nearly all the patterns will be informative with respect to C1 ; thus n2 will be unacceptably large. Thus in practice we may need to run the overall boosting procedure a few times, adjusting n1 in order to use the full training set and, if possible, get roughly equal partitions of the training set. A number of simple heuristics can be used to improve the partitioning of the training set as well (Computer exercise ??). The above boosting procedure can be applied recursively to the component classifiers themselves, giving a 9-component or even 27-component full classifier. In this way, a very low training error can be achieved, even a vanishing training error if the problem is separable. AdaBoost There are a number of variations on basic boosting. The most popular, AdaBoost -- from "adaptive" boosting -- allows the designer to continue adding weak learners until some desired low training error has been achieved. In AdaBoost each training pattern receives a weight which determines its probability of being selected for a training set for an individual component classifier. If a training pattern is accurately 28 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING x2 x1 n = 27 x2 component classifiers R 2 C1 x2 R 2 R 1 R 1 n1 = 15 final classification by voting x1 x2 R 2 n2 = 8 x1 n3 = 4 R 2 x1 C2 x2 R 1 C3 R 1 x1 Figure 9.6: A two-dimensional two-category classification task is shown at the top. The middle row shows three component (linear) classifiers Ck trained by LMS algorithm (Chap. ??), where their training patterns were chosen through the basic boosting procedure. The final classification is given by the voting of the three component classifiers, and yields a nonlinear decision boundary, as shown at the bottom. Given that the component classifiers are weak learners (i.e., each can learn a training set better than chance), then the ensemble classifier will have a lower training error on the full training set D than does any single component classifier. classified, then its chance of being used again in a subsequent component classifier is reduced; conversely, if the pattern is not accurately classified, then its chance of being used again is raised. In this way, AdaBoost "focuses in" on the informative or "difficult" patterns. Specifically, we initialize these weights across the training set to to be uniform. On each iteration k, we draw a training set at random according to these weights, and train component classifier Ck on the patterns selected. Next we increase weights of training patterns misclassified by Ck and decrease weights of the patterns correctly classified by Ck . Patterns chosen according to this new distribution are used to train the next classifier, Ck+1 , and the process is iterated. We let the patterns and their labels in D be denoted xi and yi , respectively and let Wk (i) be the kth (discrete) distribution over all these training samples. The AdaBoost phe training error will decrease, as given by Eq. 37. It is often found that the test error decreases in boosted systems as well, as shown in red. some applications, however, the patterns are unlabeled. We shall return in Chap. ?? to the problem of learning when no labels are available but here we assume there exists some (possibly costly) way of labeling any pattern. Our current challenge is thus to determine which unlabeled patterns would be most informative (i.e., improve the classifier the most) if they were labeled and used as training patterns. These query are the patterns we will present as a query to an oracle -- a teacher who can label, without error, any pattern. This approach is called variously learning with queries, oracle active learning or interactive learning and is a special case of a resampling technique. Learning with queries might be appropriate, for example, when we want to design a classifier for handwritten numerals using unlabeled pixel images scanned from documents from a corpus too large for us to label every pattern. We could start by randomly selecting some patterns, presenting them to an oracle, and then training the classifier with the returned labels. We then use learning with queries to select unlabeled patterns from our set to present to a human (the oracle) for labeling. Informally, we would expect the most valuable patterns would be near the decision boundaries. More generally we begin with a preliminary, weak classifier that has been developed with a small set of labeled samples. There are two related methods for then selecting an informative pattern, i.e., a pattern for which the current classifier is least certain. confidence In confidence based query selection the classifier computes discriminant functions gi (x) based query for the c categories, i = 1, . . . , c. An informative pattern x is one for which the two selection largest discriminant functions have nearly the same value; such patterns lie near the current decision boundaries. Several search heuristics can be used to find such points efficiently (Problem 30). voting The second method, voting based or committee based query selection, is similar to based query the previous method but is applicable to multiclassifier systems, that is, ones comprisselection ing several component classifiers (Sect. 9.7). Each unlabeled pattern is presented to 9.5. RESAMPLING FOR CLASSIFIER DESIGN 31 each of the k component classifiers; the pattern that yields the greatest disagreement among the k resulting category labels is considered the most informative pattern, and is thus presented as a query to the oracle. Voting based query selection can be used even if the component classifiers do not provide analog discriminant functions, for instance decision trees, rule-based classifiers or simple k-nearest neighbor classifiers. In both confidence based and voting based methods, the pattern labeled by the oracle is then used for training the classifier in the traditional way. (We shall return in Sect. 9.7 to training an ensemble of classifiers.) Clearly such learning with queries does not directly exploit information about the prior distribution of the patterns. In particular, in most problems the distributions of query patterns will be large near the final decision boundaries (where patterns are informative) rather than at the region of highest prior probability (where they are typically less informative), as illustrated in Fig. 9.8. One benefit of learning with queries is that we need not guess the form of the underlying distribution, but can instead use non-parametric techniques, such as nearest-neighbor classification, that allow the decision boundary to be found directly. If there is not a large set of unlabeled samples available for queries, we can nevertheless exploit learning with queries if there is a way to generate query patterns. Suppose we have a only small set of labeled handwritten characters. Suppose too we have image processing algorithms for altering these images to generate new, surrogate patterns for queries to an oracle. For instance the pixel images might be rotated, scaled, sheared, be subject to random pixel noise, or have their lines thinned. Further, we might be able to generate new patterns "in between" two labeled patterns by interpolating or somehow mixing them in a domain-specific way. With such generated query patterns the classifier can explore regions of the feature space about which it is least confident (Fig. 9.8). 9.5.4 Arcing, learning with queries, bias and variance In Chap. ?? and many other places, we have stressed the need for training a classifier on samples drawn from the distribution on which it will be tested. Resampling in general, and learning with queries in particular, seem to violate this recommendation. Why can a classifier trained on a strongly weighted distribution of data be expected to do well -- or better! -- than one trained on the i.i.d. sample? Why doesn't resampling lead to worse performance, to the extent that the resampled distribution differs from the i.i.d. one? Indeed, if we were to take a model of the true distribution and train it with a highly skewed distribution obtained by learning with queries, the final classifier accuracy might be unacceptably low. Consider, however, two interrelated points about resampling methods and altered distributions. The first is that resampling methods are generally used with techniques that do not attempt to model or fit the full category distributions. Thus even if we suspect the prior distributions for two categories are Gaussian, we might use a non-parametric method such as nearest neighbor, radial basis function, or RCE classifiers when using learning with queries. Thus in learning with queries we are not fitting parameters in a model, as described in Chap. ??, but instead are seeking decision boundaries more directly. The second point is that as the number of component classifiers is increased, techniques such as general boosting and AdaBoost effectively broaden that class of implementable functions, as illustrated in Fig. 9.6. While the final classifier might indeed be characterized as parametric, it is in an expanded space of parameters, one 32 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING x2 x2 i.i.d. samples EB = 0.02275 x2 x1 active learning E = 0.05001 x1 E = 0.02422 x1 Figure 9.8: Active learning can be used to create classifiers that are more accurate than ones using i.i.d. sampling. The figure at the top shows a two-dimensional problem with two equal circular Gaussian priors; the Bayes decision boundary is a straight line and the Bayes error EB = 0.02275. The bottom figure on the left shows a nearestneighbor classifier trained with n = 30 labeled points sampled i.i.d. from the true distributions. Note that most of these points are far from the decision boundary. The figure at the right illustrates active learning. The first four points were sampled near the extremes of the feature space. Subsequent query points were chosen midway between two points already used by the classifier, one randomly selected from each of the two categories. In this way, successive queries to the oracle "focused in" on the true decision boundary. The final generalization error of this classifier (0.02422) is lower than the one trained using i.i.d. samples (0.05001). larger than that of the first component classifier. In broad overview, resampling, boosting and related procedures are heuristic methods for adjusting the class of implementable decision functions. As such they allow the designer to try to "match" the final classifier to the problem by indirectly adjusting the bias and variance. The power of these methods is that they can be used with an arbitrary classification technique such as the Perceptron, which would otherwise prove extremely difficult to adjust to the complexity of an arbitrary problem. 9.6 Estimating and comparing classifiers There are at least two reasons for wanting to know the generalization rate of a classifier on a given problem. One is to see if the classifier performs well enough to be useful; another is to compare its performance with that of a competing design. Estimating the final generalization performance invariably requires making assumptions about the classifier or the problem or both, and can fail if the assumptions are not valid. We should stress, then, that all the following methods are heuristic. Indeed, if there were a foolproof method for choosing which of two classifiers would generalize better on an arbitrary new problem, we could incorporate such a method into the learning and violate the No Free Lunch Theorem. Occasionally our assumptions are explicit (as in parametric models), but more often than not they are implicit and difficult to identify or relate to the final estimation (as in empirical methods). 9.6. ESTIMATING AND COMPARING CLASSIFIERS 33 9.6.1 Parametric models One approach to estimating the generalization rate is to compute it from the assumed parametric model. For example, in the two-class multivariate normal case, we might estimate the probability of error using the Bhattacharyya or Chernoff bounds (Chap ??), substituting estimates of the means and the covariance matrix for the unknown parameters. However, there are three problems with this approach. First, such an error estimate is often overly optimistic; characteristics that make the training samples peculiar or unrepresentative will not be revealed. Second, we should always suspect the validity of an assumed parametric model; a performance evaluation based on the same model cannot be believed unless the evaluation is unfavorable. Finally, in more general situations where the distributions are not simple it is very difficult to compute the error rate exactly, even if the probabilistic structure is known completely. 9.6.2 Cross validation In cross validation we randomly split the set of labeled training samples D into two parts: one is used as the traditional training set for adjusting model parameters in the classifier. The other set -- the validation set -- is used to estimate the generalization error. Since our ultimate goal is low generalization error, we train the classifier until we reach a minimum of this validation error, as sketched in Fig. 9.9. It is essential that the validation (or the test) set not include points used for training the parameters in the classifier -- a methodological error known as "testing on the training set." E validation set va a lid tio n trainin g Figure 9.9: In cross validation, the data set D is split into two parts. The first (e.g., 90% of the patterns) is used as a standard training set for setting free parameters in the classifier model; the other (e.g., 10%) is the validation set and is meant to represent the full generalization task. For most problems, the training error decreases monotonically during training, as shown in black. Typically, the error on the validation set decreases, but then increases, an indication that the classifier may be overfitting the training data. In cross validation, training or parameter adjustment is stopped at the first minimum of the validation error. Cross validation can be applied to virtually every classification method, where the specific form of learning or parameter adjustment depends upon the general training A related but less obvious problem arises when a classifier undergoes a long series of refinements guided by the results of repeated testing on the same test data. This form of "training on the test data" often escapes attention until new test samples are obtained. stop training here amount of training, parameter adjustment 34 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING m-fold cross validation anti-cross validation method. For example, in neural networks of a fixed topology (Chap. ??), the amount of training is the number of epochs or presentations of the training set. Alternatively, the number of hidden units can be set via cross validation. Likewise, the width of the Gaussian window in Parzen windows (Chap. ??), and an optimal value of k in the k-nearest neighbor classifier (Chap. ??) can be set by cross validation. Cross validation is heuristic and need not (indeed cannot) give improved classifiers in every case. Nevertheless, it is extremely simple and for many real-world problems is found to improve generalization accuracy. There are several heuristics for choosing the portion of D to be used as a validation set (0 < < 1). Nearly always, a smaller portion of the data should be used as validation set ( < 0.5) because the validation set is used merely to set a single global property of the classifier (i.e., when to stop adjusting parameters) rather than the large number of classifier parameters learned using the training set. If a classifier has a large number of free parameters or degrees of freedom, then a larger portion of D should be used as a training set, i.e., should be reduced. A traditional default is to split the data with = 0.1, which has proven effective in many applications. Finally, when the number of degrees of freedom in the classifier is small compared to the number of training points, the predicted generalization error is relatively insensitive to the choice of . A simple generalization of the above method is m-fold cross validation. Here the training set is randomly divided into m disjoint sets of equal size n/m, where n is again the total number of patterns in D. The classifier is trained m times, each time with a different set held out as a validation set. The estimated performance is the mean of these m errors. In the limit where m = n, the method is in effect the leave-one-out approach to be discussed in Sect. 9.6.3. We emphasize that cross validation is a heuristic and need not work on every problem. Indeed, there are problems for which anti-cross validation is effective -- halting the adjustment of parameters when the validation error is the first local maximum. As such, in any particular problem designers must be prepared to explore different values of , and possibly abandon the use of cross validation altogether if performance cannot be improved (Computer exercise 5). Cross validation is, at base, an empirical approach that tests the classifier experimentally. Once we train a classifier using cross validation, the validation error gives an estimate of the accuracy of the final classifier on the unknown test set. If the true but unknown error rate of the classifier is p, and if k of the n independent, randomly drawn test samples are misclassified, then k has the binomial distribution P (k) = n pk (1 - p)n -k . k (38) Thus, the fraction of test samples misclassified is exactly the maximum likelihood estimate for p (Problem 39): k . (39) n The properties of this estimate for the parameter p of a binomial distribution are well known. In particular, Fig. 9.10 shows 95% confidence intervals as a function of p and n . For a given value of p, the probability is 0.95 that the true value of p lies ^ ^ in the interval between the lower and upper curves marked by the number n of test samples (Problem 36). These curves show that unless n is fairly large, the maximum likelihood estimate must be interpreted with caution. For example, if no errors are made on 50 test samples, with probability 0.95 the true error rate is between zero and p= ^ 9.6. ESTIMATING AND COMPARING CLASSIFIERS 35 8%. The classifier would have to make no errors on more than 250 test samples to be reasonably sure that the true error rate is below 2%. p 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 10 150 2 0 3 0 5 00 1 50 2 000 1 00 10250 0 10 0 5 0 3 0 2 5 1 10 ^ p 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 9.10: The 95% confidence intervals for a given estimated error probability p ^ can be derived from a binomial distribution of Eq. 38. For each value of p, the true ^ probability has a 95% chance of lying between the curves marked by the number of test samples n . The larger the number of test samples, the more precise the estimate of the true probability and hence the smaller the 95% confidence interval. 9.6.3 Jackknife and bootstrap estimation of classification accuracy A method for comparing classifiers closely related to cross validation is to use the jackknife or bootstrap estimation procedures (Sects. 9.4.1 & 9.4.2). The application of the jackknife approach to classification is straightforward. We estimate the accuracy of a given algorithm by training the classifier n separate times, each time using the training set D from which a different single training point has been deleted. This is merely the m = n limit of m-fold cross validation. Each resulting classifier is tested on the single deleted point and the jackknife estimate of the accuracy is then simply the mean of these leave-one-out accuracies. Here the computational complexity may be very high, especially for large n (Problem 28). The jackknife, in particular, generally gives good estimates, since each of the the n classifiers is quite similar to the classifier being tested (differing solely due to a single training point). Likewise, the jackknife estimate of the variance of this estimate is given by a simple generalization of Eq. 32. A particular benefit of the jackknife approach is that it can provide measures of confidence or statistical significance in the comparison between two classifier designs. Suppose trained classifier C1 has an accuracy of 80% while C2 has accuracy of 85%, as estimated by the jackknife procedure. Is C2 really better than C1 ? To answer this, we calculate the jackknife estimate of the variance of the classification accuracies and use traditional hypothesis testing to see if C1 's apparent superiority is statistically significant (Fig. 9.11). There are several ways to generalize the bootstrap method to the problem of es- 36 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING probability a leave-one-out replication has a particular accuracy C1 C2 accuracy (%) 70 75 80 85 90 95 100 Figure 9.11: Jackknife estimation can be used to compare the accuracies of classifiers. The jackknife estimate of classifiers C1 and C2 are 80% and 85%, and full widths (twice the square root of the jackknife estimate of the variances) are 12% and 15%, as shown by the bars at the bottom. In this case, traditional hypothesis testing could show that the difference is not statistically significant at some confidence level. timating the accuracy of a classifier. One of the simplest approaches is to train B classifiers, each with a different bootstrap data set, and test on other bootstrap data sets. The bootstrap estimate of the classifier accuracy is simply the mean of these bootstrap accuracies. In practice, the high computational complexity of bootstrap estimation of classifier accuracy is rarely worth possible improvements in that estimate. In Sect. 9.5.1 we shall discuss bagging, a useful modification of bootstrap estimation. 9.6.4 Maximum-likelihood model comparison ML-II Recall first the maximum-likelihood parameter estimation methods discussed in Chap. ??. ^ Given a model with unknown parameter vector , we find the value which maxi^ Maximum-likelihood model mizes the probability of the training data, i.e., p(D|). comparison or maximum-likelihood model selection -- sometimes called ML-II -- is a direct generalization of those techniques. The goal here is to choose the model that best explains the training data, in a way that will become clear below. We again let hi H represent a candidate hypothesis or model (assumed discrete for simplicity), and D the training data. The posterior probability of any given model is given by Bayes' rule: P (D|hi )P (hi ) P (D|hi )P (hi ), p(D) P (hi |D) = (40) evidence where we will rarely need the normalizing factor p(D). The data-dependent term, P (D|hi ), is the evidence for hi ; the second term, P (hi ), is our subjective prior over the space of hypotheses -- it rates our confidence in different models even before the data arrive. In practice, the data-dependent term dominates in Eq. 40, and hence the priors P (hi ) are often neglected in the computation. In maximum-likelihood model comparison, we find the maximum likelihood parameters for each of the candidate models, calculate the resulting likelihoods, and select the model with the largest such likelihood in Eq. 40 (Fig. 9.12). 9.6. ESTIMATING AND COMPARING CLASSIFIERS evidence P(D|h3) 37 P(D|h2) P(D|h1) D D0 Figure 9.12: The evidence (i.e., probability of generating different data sets given a model) is shown for three models of different expressive power or complexity. Model h1 is the most expressive, since with different values of its parameters the model can fit a wide range of data sets. Model h3 is the most restrictive of the three. If the actual data observed is D0 , then maximum-likelihood model selection states that we should choose h2 , which has the highest evidence. Model h2 "matches" this particular data set better than do the other two models, and should be selected. 9.6.5 Bayesian model comparison Bayesian model comparison uses the full information over priors when computing posterior probabilities in Eq. 40. In particular, the evidence for a particular hypothesis is an integral, P (D|hi ) = p(D|, hi )p(|D, hi )d, (41) where as before describes the parameters in the candidate model. It is common for ^ the posterior P (|D, hi ) to be peaked at , and thus the evidence integral can often be approximated as: P (D|hi ) ^ ^ P (D|, hi ) p(|hi ) . best fit Occam factor likelihood Before the data arrive, model hi has some broad range of model parameters, denoted by 0 and shown in Fig. 9.13. After the data arrive, a smaller range is commensurate or compatible with D, denoted . The Occam factor in Eq. 42, ^ p(|hi ) = 0 param. vol. commensurate with D , param. vol. commensurate with any data (42) Occam factor Occam factor = = (43) is the ratio of two volumes in parameter space: 1) the volume that can account for data D and 2) the prior volume, accessible to the model without regard to D. The Occam factor has magnitude less than 1.0; it is simply the factor by which the hypothesis space collapses by the presence of data. The more the training data, the smaller the range of parameters that are commensurate with it, and thus the greater this collapse in the parameter space and the larger the Occam factor (Fig. 9.13). 38 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING p(|D,hi) p(|hi) ^ 0 Figure 9.13: In the absence of training data, a particular model h has available a large range of possible values of its parameters, denoted 0 . In the presence of a particular training set D, a smaller range is available. The Occam factor, /0 , measures the fractional decrease in the volume of the model's parameter space due to the presence of training data D. In practice, the Occam factor can be calculated fairly easily if the evidence is approximated as a k-dimensional Gaussian, centered on ^ the maximum-likelihood value . Naturally, once the posteriors for different models have been calculated by Eq. 42 & 40, we select the single one having the highest such posterior. (Ironically, the Bayesian model selection procedure is itself not truly Bayesian, since a Bayesian procedure would average over all possible models when making a decision.) The evidence for hi , i.e., P (D|hi ), was ignored in a maximum-likelihood setting ^ of parameters ; nevertheless it is the central term in our comparison of models. As mentioned, in practice the evidence term in Eq. 40 dominates the prior term, and it is traditional to ignore such priors, which are often highly subjective or problematic anyway (Problem 38, Computer exercise 7). This procedure represents an inherent bias towards simple models (small ); models that are overly complex (large ) are automatically self-penalizing where "overly complex" is a data-dependent concept. In the general case, the full integral of Eq. 41 is too difficult to calculate analytically or even numerically. Nevertheless, if is k-dimensional and the posterior can be assumed to be a Gaussian, then the Occam factor can be calculated directly (Problem 37), yielding: P (D|hi ) ^ ^ P (D|, hi ) p(|hi )(2)k/2 |H|-1/2 . best fit likelihood where 2 lnp(|D, hi ) (45) 2 is a Hessian matrix -- a matrix of second-order derivatives -- and measures how ^ "peaked" the posterior is around the value . Note that this Gaussian approximation does not rely on the fact that the underlying model of the distribution of the data in feature space is or is not Gaussian. Rather, it is based on the assumption that the evidence distribution arises from a large number of independent uncorrelated processes and is governed by the Law of Large Numbers. The integration inherent H= Occam factor (44) 9.6. ESTIMATING AND COMPARING CLASSIFIERS 39 in Bayesian methods is simplified using this Gaussian approximation to the evidence. Since calculating the needed Hessian via differentiation is nearly always simpler than a high-dimensional numerical integration, the Bayesian method of model selection is not at a severe computational disadvantage relative to its maximum likelihood counterpart. There may be a problem due to degeneracies in a model -- several parameters could be relabeled and leave the classification rule (and hence the likelihood) unchanged. The resulting degeneracy leads, in essence, to an "overcounting" which alters the effective volume in parameter space. Degeneracies are especially common in neural network models where the parameterization comprises many equivalent weights (Chap. ??). For such cases, we must multiply the right hand side of Eq. 42 by the ^ degeneracy of in order to scale the Occam factor, and thereby obtain the proper estimate of the evidence (Problem 42). Bayesian model selection and the No Free Lunch Theorem There seems to be a fundamental contradiction between two of the deepest ideas in the foundation of statistical pattern recognition. On the one hand, the No Free Lunch Theorem states that in the absence of prior information about the problem, there is no reason to prefer one classification algorithm over another. On the other hand, Bayesian model selection is theoretically well founded and seems to show how to reliably choose the better of two algorithms. Consider two "composite" algorithms -- algorithm A and algorithm B -- each of which employs two others (algorithm 1 and algorithm 2). For any problem, algorithm A uses Bayesian model selection and applies the "better" of algorithm 1 and algorithm 2. Algorithm B uses anti-Bayesian model selection and applies the "worse" of algorithm 1 and algorithm 2. It appears that algorithm A will reliably outperform algorithm B throughout the full class of problems -- in contradiction with Part 1 of the No Free Lunch Theorem. What is the resolution of this apparent contradiction? In Bayesian model selection we ignore the prior over the space of models, H, effectively assuming it is uniform. This assumption therefore does not take into account how those models correspond to underlying target functions, i.e., mappings from input to category labels. Accordingly, Bayesian model selection usually corresponds to a non-uniform prior over target functions. Moreover, depending on the arbitrary choice of model, the precise non-uniform prior will vary. In fact, this arbitrariness is very well-known in statistics, and good practitioners rarely apply the principle of indifference, assuming a uniform prior over models, as Bayesian model selection requires. Indeed, there are many "paradoxes" described in the statistics literature that arise from not being careful to have the prior over models be tailored to the choice of models (Problem 38). The No Free Lunch Theorem allows that for some particular non-uniform prior there may be a learning algorithm that gives better than chance -- or even optimal -- results. Apparently Bayesian model selection corresponds to non-uniform priors that seem to match many important real-world problems. principle of indifference 9.6.6 The problem-average error rate The examples we have given thus far suggest that the problem with having only a small number of samples is that the resulting classifier will not perform well on new data -- it will not generalize well. Thus, we expect the error rate to be a function of 40 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING the number n of training samples, typically decreasing to some minimum value as n apne the most promising model quickly and efficiently, we need then only train this model fully. One method is to use a classifier's performance on a relatively small training set to predict its performance on the ultimate large training set. Such performance is revealed in a type of learning curve in which the test error is plotted versus the size of the training set. Figure 9.15 shows the error rate on an independent test set after the classifier has been fully trained on n n points in the training set. (Note that in this form of learning curve the training error decreases monotonically and does not show "overtraining" evident in curves such as Fig. 9.9.) For many real-world problems, such learning curves decay monotonically and can be adequately described by a power-law function of the form Etest = a + b/n (49) 9.6. ESTIMATING AND COMPARING CLASSIFIERS Etest 1 43 0.8 0.6 0.4 0.2 2000 4000 6000 8000 10000 n' Figure 9.15: The test error for three classifiers, each fully trained on the given number n of training patterns, decreases in a typical monotonic power-law function. Notice that the rank order of the classifiers trained on n = 500 points differs from that for n = 10000 points and the asymptotic case. where a, b and 1 depend upon the task and the classifier. In the limit of very large n , the training error equals the test error, since both the training and test sets represent the full problem space. Thus we also model the training error as a power-law function, having the same asymptotic error, Etrain = a - c/n . (50) If the classifier is sufficiently powerful, this asymptotic error, a, is equal to the Bayes error. Furthermore, such a powerful classifier can learn perfectly the small training sets and thus the training error (measured on the n points) will vanish at small n , as shown in Fig. 9.16. E 0.8 0.6 0.4 0.2 Etest Etrain 2000 4000 6000 8000 10000 a n' Figure 9.16: Test and training error of a classifier fully trained on data subsets of different size n selected randomly from the full set D. At low n , the classifier can learn the category labels of the points perfectly, and thus the training error vanishes there. In the limit n , both training and test errors approach the same asymptotic value, a. If the classifier is sufficiently powerful and the training data is sampled i.i.d., then a is the Bayes error rate, EB . Now we seek to estimate the asymptotic error, a, from the training and test errors on small and intermediate size training sets. From Eqs. 49 & 50 we find: b c - n n b c + . n n Etest + Etrain Etest - Etrain = = 2a + (51) 44 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING If we make the assumption of = and b = c, then Eq. 51 reduces to Etest + Etrain Etest - Etrain = = 2a 2b . n (52) Given this assumption, it is a simple matter to measure the training and test errors for small and intermediate values of n , plot them on a log-log scale and estimate a, as shown in Fig. 9.17. Even if the approximations = and b = c do not hold in practice, the difference Etest - Etrain nevertheless still forms a straight line on a loglog plot and the sum, s = b + c, can be found from the height of the log[Etest + Etrain ] curve. The weighted sum cEtest + bEtrain will be a straight line for some empirically set values of b and c, constrained to obey b + c = s, enabling a to be estimated (Problem 41). Once a has been estimated for each in the set of candidate classifiers, the one with the lowest a is chosen and must be trained on the full training set D. log[E] log[b+c]log[2b] log[Etest +Etrain]log[2a] 1 log[Etest -Etrain] log[n'] Figure 9.17: If the test and training errors versus training set size obey the power-law functions of Eqs. 49 & 50, then the log of the sum and log of the difference of these errors are straight lines on a log-log plot. The estimate of the asymptotic error rate a is then simply related to the height of the log[Etest + Etrain ] line, as shown. 9.6.8 The capacity of a separating plane Consider the partitioning of a d-dimensional feature snctions. For instance, we might have four component classifiers -- a k-nearest-neighbor classifier, a decision tree, a neural network, and a rule-based system -- all addressing the same problem. While a neural network would provide analog values for each of the c categories, the rule-based system would give only a single category label (i.e., a oneof-c representation) and the k-nearest neighbor classifier would give only rank order of the categories. In order to integrate the information from the component classifiers we must convert the their outputs into discriminant values obeying the constraint of Eq. 55 so we can use the framework of Fig. 9.19. The simplest heuristics to this end are the following: Analog If the outputs of a component classifier are analog values gi , we can use the ~ softmax transformation, ~ egi gi = c . (60) ~ egi j=1 softmax to convert them to values gi . Rank order If the output is a rank order list, we assume the discriminant function is linearly proportional to the rank order of the item on the list. Of course, the resulting gi should then be properly normalized, and thus sum to 1.0. One-of-c If the output is a one-of-c representation, in which a single category is identified, we let gj = 1 for the j corresponding to the chosen category, and 0 otherwise. The table gives a simple illustration of these heuristics. Analog value g i gi ~ 0.4 0.158 0.6 0.193 0.9 0.260 0.3 0.143 0.2 0.129 0.1 0.111 Rank order gi 4/21 = 0.194 1/21 = 0.048 2/21 = 0.095 6/21 = 0.286 5/21 = 0.238 3/21 = 0.143 One-of-c gi gi ~ 0 0 1 1.0 0 0 0 0 0 0 0 0 gi ~ 3rd 6th 5th 1st 2nd 4th Once the outputs of the component classifiers have been converted to effective discriminant functions in this way, the component classifiers are themselves held fixed, but the gating network is trained as described in Eq. 59. This method is particularly useful when several highly trained component classifiers are pooled to form a single decision. 9.7. SUMMARY 49 Summary The No Free Lunch Theorem states that in the absence of prior information about the problem there are no reasons to prefer one learning algorithm or classifier model over another. Given that a finite set of feature values are used to distinguish the patterns under consideration, the Ugly Duckling Theorem states that the number of predicates shared by any two different patterns is constant, and does not depend upon the choice of the two objects. Together, these theorems highlight the need for insight into proper features and matching the algorithm to the data distribution -- there is no problem independent "best" learning or pattern recognition system nor feature representation. In short, formal theory and algorithms taken alone are not enough; pattern classification is an empirical subject. Two ways to describe the match between classifier and problem are the bias and variance. The bias measures the accuracy or quality of the match (high bias implies a poor match) and the variance measures the precision or specificity of the match (a high variance implies a weak match). The bias-variance dilemma states that learning procedures with increased flexibility to adapt to the training data (e.g., have more free parameters) tend to have lower bias but higher variance. In classification there is a non-linear relationship between bias and variance, and low variance tends to be more important for classification than low bias. If classifier models can be expressed as binary strings, the minimum description length principle states that the best model is the one with the minimum sum of such a model description and the training data with respect to that model. This general principle can be extended to cover modelspecific heuristics such as weight decay and pruning in neural networks, regularization in specific models, and so on. The basic insight underlying resampling techniques -- such as the bootstrap, jackknife, boosting, and bagging -- is that multiple data sets selected from a given data set enable the value and ranges of arbitrary statistics to be computed. In classification, boe for an adequate but not necessarily the optimal solution [87]. An empirical study showing that simple classifiers often work well can be found in [45]. The basic bias-variance decomposition and bias-variance dilemma [37] in regression appear in many statistics books [41, 16]. Geman et al. give a very clear presentation in the context of neural networks, but their discussion of classification is only indirectly related to their mathematical derivations for regression [35]. Our presentation for classification (zero-one loss) is based on Friedman's important paper [32]; the biasvariance decomposition has been explored in other non-quadratic cost functions as well [42]. Quenouille introduced the term jackknife in 1956 [76]. The theoretical foundations of resampling techniques are presented in Efron's clear book [28], and practical guides to their use include [36, 25]. Papers on bootstrap techniques for error estimation include [48]. Breiman has been particularly active in introducing and exploring resampling methods for estimation and classifier design, such as bagging [11] and general arcing [13]. AdaBoost [31] builds upon Schapire's analysis of the strength of weak learnability [82] and Freund's early work in the theory of learning [30]. Boosting in multicategory problems is a bit more subtle than in two-category problems we discussed [83]. Angluin's early work on queries for concept learning [3] was generalized to 9.7. PROBLEMS 51 active learning by Cohn and many others [18, 20] and is fundamental to some efforts in collecting large databases [93, 95, 94, 99]. Cross validation was introduced by Cover [23], and has been used extensively in conjunction with classification methods such as neural network. Estimates of error under different conditions include [34, 110, 103] and an excellent paper, which derives the size of test set needed for accurate estimation of classification accuracy is [39]. Bowyer and Phillip's book covers empirical evaluation techniques in computer vision [10], many of which apply to more general classification domains. The roots of maximum likelihood model selection stem from Bayes himself, but one of the earlier technical presentations is [38]. Interest in Bayesian model selection was revived in a series of papers by MacKay, whose primary interest was in applying the method to neural networks and interpolation [66, 69, 68, 67]. These model selection methods have subtle relationships to minimum description length (MDL) [78] and so-called maximum entropy approaches -- topics that would take us a bit beyond our central concerns. Cortes and her colleagues pioneered the analysis of learning curves for estimating the final quality of a classifier [22, 21]. No rate of convergence results can be made in the arbitrary case for finding the Bayes error, however [6]. Hughes [46] first carried out the required computations and obtained in Fig. 9.14. Extensive books on techniques for combining general classifiers include [55, 56] and for combining neural nets in particular include [86, 9]. Perrone and Cooper described the benefits that arise when expert classifiers disagree [73]. Dasarathy's book [24] has a nice mixture of theory (focusing more on sensor fusion than multiclassifier systems per se) and a collection of important original papers, including [43, 61, 96]. The simple heuristics for converting 1-of-c and rank order outputs to numerical values enabling integration were discussed in [63]. The hierarchical mixture of experts architecture and learning algorithm was first described in [51, 52]. A specific hierarchical multiclassifier technique is stacked generalization [107, 88, 89, 12], where for instance Gaussian kernel estimates at one level are pooled by yet other Gaussian kernels at a higher level. We have skipped over a great deal of work from the formal field of computational learning theory. Such work is generally preoccupied with convergence properties, asymptotics, and computational complexity, and usually relies on simplified or general models. Anthony and Biggs' short, clear and elegant book is an excellent introduction to the field [5]; broader texts include [49, 70, 53]. Perhaps the work from the field most useful for pattern recognition practitioners comes from weak learnability and boosting, mentioned above. The Probably approximately correct (PAC) framework, introduced by Valiant [98], has been very influential in computation learning theory, but has had only minor influence on the development of practical pattern recognition systems. A somewhat broader formulation, Probably almost Bayes (PAB), is described in [4]. The work by Vapnik and Chervonenkis on structural risk minimization [102], and later Vapnik-Chervonenkis (VC) theory [100, 101], derives (among other things) expected error bounds; it too has proven influential to the theory community. Alas, the bounds derived are somewhat loose in practice [19, 106]. Problems Section 9.2 1. One of the "conservations laws" for generalization states that the positive generalization performance of an algorithm in some learning situations must be offset 52 CHAPTER 9. ALGORITHM-INDEPENDENT MACHINE LEARNING by negative performance elsewhere. Consider a very simple learning algorithm that seems to contradict this law. For each test pattern, the prediction of the majority learning algorithm is merely the category most prevalent in the training data. (a) Show that averaged over all two-category problems of a given number of features that the off-training set error is 0.5. (b) Repeat (a) but for the minority learning algorithm, which always predicts the category label of the category least prevalent in the training data. (c) Use your answers from (a) & (b) to illustrate Part 2 of the No Free Lunch Theorem (Theorem 9.1). 2. Prove Part 1 of Theorem 9.1, i.e., that uniformly averaged over all target functions F , E1 (E|F, n) - E2 (E|F, n) = 0. Summarize and interpret this result in words. 3. Prove Part 2 of Theorem 9.1, i.e., for any fixed training set D, uniformly averaged over F , E1 (E|F, D) - E2 (E|F, D) = 0. Summarize and interpret this result in words. 4. Prove Part 3 of Theorem 9.1, i.e., uniformly averaged over all priors P (F ), E1 (E|n) - E2 (E|n) = 0. Summarize and interpret this result in words. 5. Prove Part 4 of Theorem 9.1, i.e., for any fixed training set D, uniformly averaged over P (F ), E1 (E|D) - E2 (E|D) = 0. Summarize and interpret this result in words. 6. Suppose you call an algorithm better if it performs slightly better than average over most problems, but very poorly on a small number of problems. Explain why the NFL Theorem does not preclude the existence of algorithms "better" in this way. 7. Show by simple counterexamples that the averaging in the different Parts of the No Free Lunch Theorem (Theorem 9.1) must be "uniformly." For instance imagine that the sampling distribution is a Dirac delta distribution centered on a single target function, and algorithm 1 guesses the target function exactly while algorithm 2 disagrees with algorithm 1 on every prediction. (a) Part 1 (b) Part 2 (c) Part 3 (d) Part 4 8. State how the No Free Lunch theorems imply that you cannot use training data to distinguish between new problems for which you generalize well from those for which you generalize poorly. Argue by reductio ad absurdum: that if you could distinguish such problems, then the No Free Lunch Theorem would be violated. n 9. Prove the relation r=0 n r = (1 + 1)n = 2n of Eq. 5 two ways: (a) State the polynomial expansion of (x + y)n as a summation of coefficients and powers of x and y. Then, make a simple substitution for x and y. n (b) Prove the relation by induction. Let K(n) = r=0 1 n r . First confirm that the relation is valid for n = 1, i.e., that K(1) = 2 . Now prove that K(n + 1) = 2K(n) for arbitrary n. 9.7. PROBLEMS 53 10. Consider the number of different Venn diagrams for k binary features f1 , . . . , fk . (Figure 9.2 shows several of these configurations for the k = 3 case.) (a) How many functionally different Venn diagrams exist for the k = 2 case? Sketch all of them. For each case, state how many different regions exis-means . . . . . . . . . . . . . . . . . 10.4.4 *Fuzzy k-means clustering . . . . . . . . . . . Algorithm 2: Fuzzy k-means . . . . . . . . . . . . . . 10.5 Unsupervised Bayesian Learning . . . . . . . . . . . 10.5.1 The Bayes Classifier . . . . . . . . . . . . . . 10.5.2 Learning the Parameter Vector . . . . . . . . Example 2: Unsupervised learning of Gaussian data . 10.5.3 Decision-Directed Approximation . . . . . . . 10.6 *Data Description and Clustering . . . . . . . . . . . 10.6.1 Similarity Measures . . . . . . . . . . . . . . 10.7 Criterion Functions for Clustering . . . . . . . . . . 10.7.1 The Sum-of-Squared-Error Criterion . . . . . 10.7.2 Related Minimum Variance Criteria . . . . . 10.7.3 Scattering Criteria . . . . . . . . . . . . . . . Example 3: Clustering criteria . . . . . . . . . . . . 10.8 *Iterative Optimization . . . . . . . . . . . . . . . . Algorithm 3: Basic minimum-squared-error . . . . . 10.9 Hierarchical Clustering . . . . . . . . . . . . . . . . . 10.9.1 Definitions . . . . . . . . . . . . . . . . . . . 10.9.2 Agglomerative Hierarchical Clustering . . . . Algorithm 4: Agglomerative hierarchical . . . . . . . 10.9.3 Stepwise-Optimal Hierarchical Clustering . . Algorithm 5: Stepwise optimal hierarchical clustering 10.9.4 Hierarchical Clustering and Induced Metrics . 10.10*The Problem of Validity . . . . . . . . . . . . . . . 10.11Competitive Learning . . . . . . . . . . . . . . . . . Algorithm 6: Competitive learning . . . . . . . . . . 10.11.1 Unknown number of clusters . . . . . . . . . Algorithm 7: leader-follower . . . . . . . . . . . . . . 1 3 3 4 6 7 8 9 11 13 13 14 15 17 17 18 21 23 24 25 29 29 30 31 33 35 36 37 37 39 39 41 42 43 43 45 47 48 48 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 CONTENTS 10.11.2 Adaptive Resonance . . . . . . . . . . . . . . . . . . . . . . . . 10.12*Graph Theoretic Methods . . . . . . . . . . . . . . . . . . . . . . . . 10.13Component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.13.1 Principal component analysis (PCA) . . . . . . . . . . . . . . . 10.13.2 Non-linear component analysis . . . . . . . . . . . . . . . . . . 10.13.3 *Independent component analysis (ICA) . . . . . . . . . . . . . 10.14Low-Dimensional Representations and Multidimensional Scaling (MDS) 10.14.1 Self-organizing feature maps . . . . . . . . . . . . . . . . . . . . 10.14.2 Clustering and Dimensionality Reduction . . . . . . . . . . . . Algorithm 8: Hierarchical dimensionality reduction . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliographical and Historical Remarks . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computer exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 51 53 53 54 55 58 61 65 66 66 68 68 79 84 87 Chapter 10 Unsupervised Learning and Clustering 10.1 Introduction labeled by their use labeled U to be supervised.category membership. Procedures thatunsupervised samples are said Now we shall investigate a number of procedures, which use unlabeled samples. That is, we shall see what can be done when all one has is a collection of samples without being told their category. One might wonder why anyone is interested in such an unpromising problem, and whether or not it is possible even in principle to learn anything of value from unlabeled samples. There are at least five basic reasons for interest in unsupervised procedures. First, collecting and labeling a large set of sample patterns can be surprisingly costly. For instance, recorded speech is virtually free, but accurately labeling the speech -- marking what word or phoneme is being uttered at each instant -- can be very expensive and time consuming. If a classifier can be crudely designed on a small set of labeled samples, and then "tuned up" by allowing it to run without supervision on a large, unlabeled set, much time and trouble can be saved. Second, one might wish to proceed in the reverse direction: train with large amounts of (less expensive) unlabeled data, and only then use supervision to label the groupings found. This may be appropriate for large "data mining" applications where the contents of a large database are not known beforehand. Third, in many applications the characteristics of the patterns can change slowly with time, for example in automated food classification as the seasons change. If these changes can be tracked by a classifier running in an unsupervised mode, improved performance can be achieved. Fourth, we can use unsupervised methods to find features, that will then be useful for categorization. There are unsupervised methods that represent a form of data-dependent "smart preprocessing" or "smart feature extraction." Lastly, in the early stages of an investigation it may be valuable to gain some insight into the nature or structure of the data. The discovery of distinct subclasses or similarities among patterns or of major departures from expected characteristics may suggest we significantly alter our 3 ntil now we have assumed that the training samples used to design a classifier were 4 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING approach to designing the classifier. The answer to the question of whether or not it is possible in principle to learn anything from unlabeled data depends upon the assumptions one is willing to accept -- theorems can not be proved without premises. We shall begin with the very restrictive assumption that the functional forms for the underlying probability densities are known, and that the only thing that must be learned is the value of an unknown parameter vector. Interestingly enough, the formal solution to this problem will turn out to be almost identical to the solution for the problem of supervised learning given in Chap. ??. Unfortunately, in the unsupervised case the solution suffers from the usual problems associated with parametric assumptions without providing any of the benefits of computational simplicity. This will lead us to various attempts to reformulate the problem as one of partitioning the data into subgroups or clusters. While some of the resulting clustering procedures have no known significant theoretical properties, they are still among the more useful tools for pattern recognition problems. 10.2 Mixture Densities and Identifiability We begin by assuming that we know the complete probability structure for the problem with the sole exception of the values of some parameters. To be more specific, we make the following assumptions: 1. The samples come from a known number c of classes. 2. The prior probabilities P (j ) for each class are known, j = 1, . . . , c. 3. The forms for the class-conditional probability densities p(x|j , j ) are known, j = 1, . . . , c. 4. The values for the c parameter vectors 1 , . . . , c are unknown. 5. The category labels are unknown. Samples are assumed to be obtained by selecting a state of nature j with probability P (j ) and then selecting an x according to the probability law p(x|j , j ). Thus, the probability density function for the samples is given by c p(x|) = j=1 p(x|j , j )P (j ), (1) component densities mixing parameters where = ( 1 , . . . , c ). For obvious reasons, a density function of this form is called a mixture density. The conditional densities p(x|j , j ) are called the component densities, and the prior probabilities P (j ) are called the mixing parameters. The mixing parameters can also be included among the unknown parameters, but for the moment we shall assume that only is unknown. Our basic goal will be to use samples drawn from this mixture density to estimate the unknown parameter vector . Once we know we can decompose the mixture into its components and use a Bayesian classifier on the derived densities, if indeed classification is our final goal. Before seeking explicit solutions to this problem, however, let us ask whether or not it is possible in principle to recover from the mixture. Suppose that we had an unlimited number of samples, and that we used one of the nonparametric methods of Chap. ?? to determine the value of p(x|) for every x. If 10.2. MIXTURE DENSITIES AND IDENTIFIABILITY 5 there is only one value of that will produce the observed values for p(x|), then a solution is at least possible in principle. However, if several different values of can produce the same values for p(x|), then there is no hope of obtaining a unique solution. These considerations lead us to the following definition: a density p(x|) is said to be identifiable if = implies that there exists an x such that p(x|) = p(x| ). Or put another way, a density p(x|) is not identifiable if we cannot recover a unique , even from an infinite amount of data. In the discouraging situation where we cannot infer any of the individual parameters (i.e., components of ), the density is completely unidentifiable. Note that the identifiability of is a property of the model, irrespective of any procedure we might use to determine its value. As one might expect, the study of unsupervised learning is greatly simplified if we restrict ourselves to identifiable mixtures. Fortunately, most mixtures of commonly encountered density functions are identifiable, as are most complex or high-dimensional density functions encountered in real-world problems. Mixtures of discrete distributions are not always so obliging. As a simple example consider the case where x is binary and P (x|) is the mixture complete unidentifiability P (x|) = = 1 x 1 x 1 (1 - 1 )1-x + 2 (1 - 2 )1-x 2 2 1 (1 + 2 ) if x = 1 2 1 - 1 (1 + 2 ) if x = 0. 2 Suppose, for example, that we know for our data that P (x = 1|) = 0.6, and hence that P (x = 0|) = 0.4. Then we know the function P (x|), but we cannot determine , and hence cannot extract the component distributions. The most we can say is that 1 +2 = 1.2. Thus, here we have a case in which the mixture distribution is completely unidentifiable, and hence a case for which unsupervised learning is impossible in principle. Related situations may permit us to determine one or some parameters, but not all (Problem 3). This kind of problem commonly occurs with discrete distributions. If there are too many components in the mixture, there may be more unknowns than independent equations, and identifiability can be a serious problem. For the continuous case, the problems are less severe, although certain minor difficulties can arise due to the possibility of special cases. Thus, while it can be shown that mixtures of normal densities are usually identifiable, the parameters in the simple mixture density P (1 ) P (2 ) 1 1 p(x|) = exp - (x - 1 )2 + exp - (x - 2 )2 2 2 2 2 (2) cannot be uniquely identified if P (1 ) = P (2 ), for then 1 and 2 can be interchanged without affecting p(x|). To avoid such irritations, we shall acknowledge that identifiability can be a problem, but shall henceforth assume that the mixture densities we are working with are identifiable. Technically speaking, a distribution is not identifiable if we cannot determine the parameters without bias. We might guess their correct values, but such a guess would have to be biased in some way. 6 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING 10.3 Maximum-Likelihood Estimates Suppose now that we a (i ), and we obtain P (i |x, D) = p(x|i , D)P (i ) c j=1 . (32) p(x|j , D)P (j ) Central to the Bayesian approach is the introduction of the unknown parameter vector via 18 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING p(x|i , D) = = p(x, |i , D) d p(x|, i , D)p(|i , D) d. (33) Since the selection of x is independent of the samples, we have p(x|, i , D) = p(x|i , i ). Similarly, since knowledge of the state of nature when x is selected tells us nothing about the distribution of , we have p(|i , D) = p(|D), and thus P (x|i , D) = p(x|i , i )p(|D) d. (34) That is, our best estimate of p(x|i ) is obtained by averaging p(x|i , i ) over i . Whether or not this is a good estimate depends on the nature of p(|D), and thus our attention turns at last to that density. 10.5.2 Learning the Parameter Vector p(D|)p() , p(D|)p() d We can use Bayes' formula to write p(|D) = (35) where the independence of the samples yields the likelihood n p(D|) = k=1 p(xk |). (36) Alternatively, letting Dn denote the set of n samples, we can write Eq. 35 in the recursive form p(|Dn ) = p(xn |)p(|Dn-1 ) . p(xn |)p(|Dn-1 ) d (37) These are the basic equations for unsupervised Bayesian learning. Equation 35 emphasizes the relation between the Bayesian and the maximum-likelihood solutions. If p() is essentially uniform over the region where p(D|) peaks, then p(|D) peaks ^ at the same place. If the only significant peak occurs at = , and if the peak is very sharp, then Eqs. 32 & 34 yield ^ p(x|i , D) p(x|i , ) and P (i |x, D) ^ p(x|i , i )P (i ) c j=1 (38) . (39) ^ p(x|j , j )P (j ) That is, these conditions justify the use of the maximum-likelihood estimate as if it were the true value of in designing the Bayes classifier. As we saw in Sect. ??.??, in the limit of large amounts of data, maximum-likelihood and the Bayes methods will agree (or nearly agree). While many small sample size 10.5. UNSUPERVISED BAYESIAN LEARNING p(D|) 19 ^ Figure 10.4: In a highly skewed or multiple peak posterior distribution such as illus^ trated here, the maximum-likelihood solution will yield a density very different from a Bayesian solution, which requires the integration over the full range of parameter space . problems they will agree, there exist small problems where the approximations are poor (Fig. 10.4). As we saw in the analogous case in supervised learning whether one chooses to use the maximum-likelihood or the Bayes method depends not only on how confident one is of the prior distributions, but also on computational considerations; maximum-likelihood techniques are often easier to implement than Bayesian ones. Of course, if p() has been obtained by supervised learning using a large set of labeled samples, it will be far from uniform, and it will have a dominant influence on p(|Dn ) when n is small. Equation 37 shows how the observation of an additional unlabeled sample modifies our opinion about the true value of , and emphasizes the ideas of updating and learning. If the mixture density p(x|) is identifiable, then each additional sample tends to sharpen p(|Dn ), and under fairly general conditions p(|Dn ) can be shown to converge (in probability) to a Dirac delta function centered at the true value of (Problem 8). Thus, even though we do not know the categories of the samples, identifiability assures us that we can learn the unknown parameter vector , and thereby learn the component densities p(x|i , ). This, then, is the formal Bayesian solution to the problem of unsupervised learning. In retrospect, the fact that unsupervised learning of the parameters of a mixture density is so similar to supervised learning of the parameters of a component density is not at all surprising. Indeed, if the component density is itself a mixture, there would appear to be no essential difference between the two problems. There are, however, some significant differences between supervised and unsupervised learning. One of the major differences concerns the issue of identifiability. With supervised learning, the lack of identifiabilitanalytically simple results. Exact solutions for even the simplest nontrivial examples lead to computational requirements that grow exponentially with the number of samples (Problem ??). The problem of unsupervised learning is too important to abandon just because exact solutions are hard to find, however, and numerous procedures for obtaining approximate solutions have been suggested. Since the important difference between supervised and unsupervised learning is the presence or absence of labels for the samples, an obvious approach to unsupervised learning is to use the prior information to design a classifier and to use the decisions of this classifier to label the samples. This is called the decision-directed approach to unsupervised learning, and it is subject to many variations. It can be applied sequentially on-line by updating the classifier each time an unlabeled sample is classified. Alternatively, it can be applied in parallel (batch mode) by waiting until all n samples are classified before updating the classifier. If desired, this process can be repeated until no changes occur in the way the samples are labeled. Various heuristics can be introduced to make the extent of any corrections depend upon the confidence of the classification decision. There are some obvious dangers associated with the decision-directed approach. If the initial classifier is not reasonably good, or if an unfortunate sequence of samples is encountered, the errors in classifying the unlabeled samples can drive the classifier the wrong way, resulting in a solution corresponding roughly to one of the lesser peaks of the likelihood function. Even if the initial classifier is optimal, in general the resulting labeling will not be the same as the true class membership; the act of classification will exclude samples from the tails of the desired distribution, and will include samples from the tails of the other distributions. Thus, if there is significant overlap between the component densities, one can expect biased estimates and less than optimal results. Despite these drawbacks, the simplicity of decision-directed procedures makes the Bayesian approach computationally feasible, and a flawed solution is often better than none. If conditions are favorable, performance that is nearly optimal can be achieved at far less computational expense. In practice it is found that most of these procedures work well if the parametric assumptions are valid, if there is little overlap between the component densities, and if the initial classifier design is at least roughly correct (Computer exercise 7). 24 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING 10.6 *Data Description and Clustering Let us reconsider our original problem of learning something of use from a set of unlabeled samples. Viewed geometrically, these samples may form clouds of points in a d-dimensional space. Suppose that we knew that these points came from a single normal distribution. Then the most we could learn form the data would be contained in the sufficient statistics -- the sample mean and the sample covariance matrix. In essence, these statistics constitute a compact description of the data. The sample mean locates the center of gravity of the cloud; it can be thought of as the single point m that best represents all of the data in the sense of minimizing the sum of squared distances from m to the samples. The sample covariance matrix describes the amount the data scatters along various directions. If the data points are actually normally distributed, then the cloud has a simple hyperellipsoidal shape, and the sample mean tends to fall in the region where the samples are most densely concentrated. Of course, if the samples are not normally distributed, these statistics can give a very misleading description of the data. Figure 10.5 shows four different data sets that all have the same mean and covariance matrix. Obviously, second-order statistics are incapable of revealing all of the structure in an arbitrary set of data. Figure 10.5: These four data sets have identical sall directions. Clusters defined by Euclidean distance will be invariant to translations or rotations in feature space -- rigid-body motions of the data points. However, they will not be invariant to linear transformations in general, or to other transformations that distort the distance relationships. Thus, as Fig. 10.7 illustrates, a simple scaling of the coordinate axes can result in a different grouping of the data into clusters. Of course, this is of no concern for problems in which arbitrary rescaling is an unnatural 26 x2 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING x2 d0 = .3 1 1 x2 d0 = .1 1 d0 = .03 .8 .8 .8 .6 .6 .6 .4 .4 .4 .2 .2 .2 0 .2 .4 .6 .8 1 x1 0 .2 .4 .6 .8 1 x1 0 .2 .4 .6 .8 1 x1 Figure 10.6: The distance threshold affects the number and size of clusters. Lines are drawn between points closer than a distance d0 apart for three different values of d0 -- the smaller the value of d0 , the smaller and more numerous the clusters. or meaningless transformation. However, if clusters are to mean anything, they should be invariant to transformations natural to the problem. One way to achieve invariance is to normalize the data prior to clustering. For example, to obtain invariance to displacement and scale changes, one might translate and scale the axes so that all of the features have zero mean and unit variance -- standardize the data. To obtain invariance to rotation, one might rotate the axes so that they coincide with the eigenvectors of the sample covariance matrix. This transformation to principal components (Sect. 10.13.1) can be preceded and/or followed by normalization for scale. However, we should not conclude that this kind of normalization is necessarily desirable. Consider, for example, the matter of translating and whitening -- scaling the axes so that each feature has zero mean and unit variance. The rationale usually given for this normalization is that it prevents certain features from dominating distance calculations merely because they have large numerical values, much as we saw in networks trained with backpropagation (Sect. ??.??). Subtracting the mean and dividing by the standard deviation is an appropriate normalization if this spread of values is due to normal random variation; however, it can be quite inappropriate if the spread is due to the presence of subclasses (Fig. ??). Thus, this routine normalization may be less than helpful in the cases of greatest interest. Section ?? describes other ways to obtain invariance to scaling. Instead of scaling axes, we can change the metric in interesting ways. For instance, one broad class of distance metrics is of the form d 1/q d(x, x ) = k=1 |xk - xk | q , (44) Minkowski metric city block metric where q 1 is a selectable parameter -- the general Minkowski metric we considered in Chap. ??. Setting q = 2 gives the familiar Euclidean metric while setting q = 1 the Manhattan or city block metric -- the sum of the absolute distances along each of the d coordinate axes. Note that only q = 2 is invariant to an arbitrary rotation or In backpropagation, one of the goals for such preprocessing and scaling of data was to increase learning speed; in contrast, such preprocessing does not significantly affect the speed of these clustering algorithms. 10.6. *DATA DESCRIPTION AND CLUSTERING x2 1.6 27 x2 1 1.4 (.5 0 ) 0 2 .8 1.2 1 .6 .8 .4 .6 .2 .4 0 .2 .4 .6 .8 1 x1 .2 0 ( 2 .5 ) 0 0 .1 .2 .3 .4 .5 x1 x2 .5 .4 .3 .2 .1 0 .25 .5 .75 1 1.25 x1 1.5 1.75 2 Figure 10.7: Scaling axes affects the clusters in a minimum distance cluster method. The original data and minimum-distance clusters are shown in the upper left -- points in one cluster are shown in red, the other gray. When the vertical axis is expanded by a factor of 2.0 and the horizontal axis shrunk by a factor of 0.5, the clustering is altered (as shown at the right). Alternatively, if the vertical axis is shrunk by a factor of 0.5 and the horizontal axis expanded by a factor of 2.0, smaller more numerous clusters result (shown at the bottom). In both these scaled cases, the clusters differ from the original. translation in feature space. Another alternative is to use some kind of metric based on the data itself, such as the Mahalanobis distance. More generally, one can abandon the use of distance altogether and introduce a nonmetric similarity function s(x, x ) to compare two vectors x and x . Conventionally, this is a symmetric functions whose value is large when x and x are somehow "similar." For example, when the angle between two vectors is a meaningful measure of their similarity, then the normalized inner product xt x x x similarity function s(x, x ) = (45) 28 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING x2 x2 x1 x1 Figure 10.8: If the data fall into well-separated clusters (left), normalization by a whitening transform for the full data may reduce the separation, and hence be undesirable (right). Such a whitening normalization may be appropriate if the full data set arises from a single fundamental process (with noise), but inappropriate if there are several different processes, as shown here. may be an appropriate similarity function. This measure, which is the cosine of the angle between x and x , is invariant to rotation and dilation, though it is not invariant to translation and general linear transformations. When the features are binary valued (0 or 1), this similarity functions has a simple non-geometrical interpretation in terms of shared features or shared attributes. Let us say that a sample x possesses the ith attribute if xi = 1. Then xt x is merely the number of attributes possessed by both x and x , and x x = (xt xx t x )1/2 is the geometric mean of the number of attributes possessed by x and the number possessed by x . Thus, s(x, x ) is a measure of the relative possession of common attributes. Some simple variations are s(x, x ) = the fraction of attributes shared, and s(x, x ) = xt x , xt x + x t x - x t x (47) xt x , d (46) Tanimoto distance the ratio of the number of shared attributes to the number possessed by x or x . This latter measure (sometimes known as the Tanimoto coefficient or Tanimoto distance) is frequently encountered in the fields of information retrieval and biological taxonomy. Related measures of similarity arise in other applications, the variety of measures testifying to the diversity of problem domains (Computer exercise ??). Fundamental issues in measurement theory are involved in the use of any distance or similarity function. The calculation of the similarity between two vectors always involves combining the values of their components. Yet in many pattern recognition applications the components of the feature vector measure seemingly noncomparable quantities, such as meters and kilograms. Recall our example of classifying fish: how can one compare the lightness of the skin to the length or weight of the fish? Should the comparison depend on whether the length is measured in meters or inches? How does one treat vectors whose components have a mixture of nominal, ordinal, interval and ratio scales? Ultimately, there are rarely clear methodological answers to these questions. When a user selects a particular similarity function or normalizes the data in a particular way, information is introduced that gives the procedure meaning. We have given examples of some alternatives that have proved to be useful. (Competitive 10.7. CRITERION FUNCTIONS FOR CLUSTERING 29 learning, discussed in Sect. 10.11, is a popular decision directed clustering algorithm.) Beyond that we can do little more than alert the unwary to these pitfalls of clustering. Amidst all this discussion of clustering, we must not lose sight of the fact that often the clusters found will later be labeled (e.g., by resorting to a teacher or small number of labeled samples), and that the clusters can then be used for classification. In that case, the same similarity (or metric) should be used for classification as was used for forming the clusters (Computer exercise 8). 10.7 Criterion Functions for Clustering We have just consie favored under conditions where there may be unknown or irrelevant linear transformations of the data. Invariant Criteria It is not particularly hard to show that the eigenvalues 1 , . . . , d of S-1 SB are invariW ant under nonsingular linear transformations of the data (Problem ??). Indeed, these eigenvalues are the basic linear invariants of the scatter matrices. Their numerical values measure the ratio of between-cluster to within-cluster scatter in the direction of the eigenvectors, and partitions that yield large values are usually desirable. Of 10.7. CRITERION FUNCTIONS FOR CLUSTERING 33 course, as we pointed out in Sect. ??, the fact that the rank of SB can not exceed c-1 means that no more than c-1 of these eigenvalues can be nonzero. Nevertheless, good partitions are ones for which the nonzero eigenvalues are large. One can invent a great variety of invariant clustering criteria by composing appropriate functions of these eigenvalues. Some of these follow naturally from standard matrix operations. For example, since the trace of a matrix is the sum of its eigenvalues, one might elect to maximize the criterion function d trS-1 SB = W i=1 i . (64) By using the relation ST = SW + SB , one can derive the following invariant relatives of [trSW and |SW | (Problem 25): d Jf = trS-1 SW = T i=1 1 1 + i (65) and |SW | 1 = . |ST | 1 + i i=1 d (66) Since all of these criterion functions are invariant to linear transformations, the same is true of the partitions that extremize them. In the special case of two clusters, only one eigenvalue is nonzero, and all of these criteria yield the same clustering. However, when the samples are partitioned into more than two clusters, the optimal partitions, though often similar, need not be the same, as shown in Example 3. Example 3: Clustering criteria We can gain some intuition by considering these criteria applied to the following data set. sample 1 2 3 4 5 6 7 8 9 10 x1 -1.82 -0.38 -0.13 -1.17 -0.92 -1.69 0.33 -0.71 1.27 -0.16 x2 0.24 -0.39 0.16 0.44 0.16 -0.01 -0.17 -0.21 -0.39 -0.23 sample 11 12 13 14 15 16 17 18 19 20 x1 0.41 1.70 0.92 2.41 1.48 -0.34 0.83 0.62 -1.42 0.67 x2 0.91 0.48 -0.49 0.32 -0.23 1.88 0.23 0.81 -0.51 -0.55 All of the clusterings seem reasonable, and there is no strong argument to favor one over the others. For the case c = 2, the clusters minimizing the Je indeed tend to favor clusters of roughly equal numbers of points, as illustrated in Fig. 10.9; in contrast, Jd favors one large and one fairly small cluster. Since the full data set happens to be spread horizontally more than vertically, the eigenvalue in the horizontal direction is greater than that in the vertical direction. As such, the clusters are "stretched" 34 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING c=2 c=3 Je Jd Jf The clusters found by minimizing a criterion depends upon the criterion function as well as the assumed number of clusters. The sum-of-squared-error criterion Je (Eq. 49), the determinant criterion Jd (Eq. 63) and the more subtle trace criterion Jf (Eq. 65) were applied to the 20 points in the table with the assumption of c = 2 and c = 3 clusters. (Each point in the table is shown, with bounding boxes defined by -1.8 < x < 2.5 and -0.6 < y < 1.9.) horizontally somewhat. In general, the differences between the cluster criteria become less pronounced for large numbers of clusters. For the c = 3 case, for instance, the clusters depend only mildly upon the cluster criterion -- indeed, two of the clusterings are identical. With regard to the criterion function involving ST , note that ST does not depend on how the samples are partitioned into clusters. Thus, the clusterings that minimize |SW |/|ST | are exactly the same as the ones that minimize |SW |. If we rotate and scale the axes so that ST becomes the identity matrix, we see that minimizing tr[S-1 SW ] T is equivalent to minimizing the sum-of-squared-error criterion trSW after performing this normalization. Clearly, this criterion suffers from the very defects that we warned about in Sect. ??, and it is probably the least desirable of these criteria. One final warning about invariant criteria is in order. If different apparent clusters can be obtained by scaling the axes or by applying any other linear transformation, then all of these groupings will be exposed by invariant procedures. Thus, invariant criterion functions are more likely to possess multiple local extrema, and are correspondingly more difficult to optimize. The variety of the criterion functions we have discussed and the somewhat subtle differences between them should not be allowed to obscure their essential similarity. In every case the underlying model is that the samples form c fairly well separated clouds of points. The within-cluster scatter matrix SW is used to measure the compactness of these clouds, and the basic goal is to find the most compact grouping. While this approach has proved useful for many problems, it is not universally applicable. For example, it will not extract a very dense cluster embedded in the center of a diffuse cluster, or separate intertwined line-like clusters. For such cases one must devise other 10.8. *ITERATIVE OPTIMIZATION 35 criterion functions that are better matched to the structure present or being sought. 10.8 *Iterative Optimization Once a criterion function has been selected, clustering becomes a well-defined problem in discrete optimization: find those partitions of the set of samples that extremize the criterion function. Since the sample set is finite, there are only a finite number of possible partitions. Thus, in theory the clustering problem can always be solved by exhaustive enumeration. However, the computational complexity renders such an approach unthinkable for all but the simplest problems; there are approximately cn /c! ways of partitioning a set of n elements into c subsets, and this exponential growth with n is overwhelming (Problem 17). For example an exhaustive search for the best set of 5 clusters in 100 samples would require considering more than 1067 partitionings. Simply put, in most applications an exhaustive search is completely infeasible. The approach most frequently used in seeking optimal partitions is iterative optimization. The basic idea is to find some reasonable initial partition and to "move" samples from one group to another if such a move will improve the value of the criterion function. Like hill-climbing procedures in general, these approaches guarantee local but not global optimization. Different starting points can lead to different solutions, and one never knows whether or not the best solution has been found. Despite these limitations, the fact that the computational requirements are bearable makes this approach attractive. Let us consider the use of iterative improvement to minimize the sum-of-squarederror criterion Je , written as c Je = i=1 Ji , (67) where an effective error per cluster is defined to be Ji = xDi x - mi 2 (68) and the mean of each cluster is, as before, mi = 1 ni x. xDi (48) ^ Suppose that a sample x currently in cluster Di is tentatively moved to Dj . Then mj changes to m = mj + j and Jj increases to Jj ^ x - mj nj + 1 (69) = xDi x - m j 2 ^ + x - m j ^ x - mj nj + 1 2 2 = xDi x - mj - + nj (^ - mj ) x nj + 1 2 36 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING = Jj + nj ^ x - mj nj + 1 2 . (70) Under the assumption that ni = 1 (singleton clusters should not be destroyed), a similar calculation (Problem 29) shows that mi changes to m = m - i and Ji decreases to Ji = Ji - ni ^ x - mi 2 . ni - 1 (72) ^ x - mi ni - 1 (71) These equations greatly simplify the computation of the change in the criterion ^ function. The transfer of x from Di to Dj is advantageous if the decrease in Ji is greater than the increase in Jj . This is the case if ni ^ x - mi ni - 1 2 > nj ^ x - mj nj + 1 2 , (73) ^ which typically happens whenever x is closer to mj than mi . If reassignment is profitable, the greatest decrease in sum of squared error is obtained by selecting the ^ cluster for which nj /(nj + 1) x - mj 2 is minimum. This leads to the following clustering procedure: Algorithm 3 (Basic iterative minimum-squared-error clustering) 1 2 3 4 5 6 7 8 9 10 begin initialize n, c, m1 , m2 , . . . , mc ^ do randomly select a sample x; ^ ^ i arg min mi - x (classify x) i if ni = 1 then compute j = nj nj +1 nj nj -1 ^ x - mj ^ x - mi 2 2 j=i j=i ^ if k j for all j then transfer x to Dk recompute Je , mi , mk until no change in Je in n attempts return m1 , m2 , . . . , mc end A moment's consideration will show that this procedure is is essentially a sequential version of the k-means procedure (Algorithm 1) described in Sect. 10.4.3. Where the k-means procedure waits until all n samples have been reclassified before updating, the Basic Iterative Minimum-Squared-Error procedure updates after each sample is reclassified. It has been experimentally observed that this procedure is more susceptible to being trapped in local minima, and it has the further disadvantage of making the results depend on the order in which the candidates are selected. However, it is at least a stepwise optimal procedure, and it can be easily modified to apply to problems in which samples are acquired sequentially and clustering must be done on-line. One question that plagues all hill-climbing procedures is the choice of the starting point. Unfortunately, there is no simple, universally good solution to this problem. One approach is to select c samples randomly for the initial cluster centers, using them to partition the data on a minimum-distance basis. Repetition with different random selections can give some indication of the sensitivity of the solution to the 10.9. HIERARCHICAL CLUSTERING 37 starting point. Yet another approach is to find the c-cluster starting point from the solutions to the (c - a)-cluster problem. The solution for the one-cluster problem is the total sample mean; the starting point for the c-cluster problem can be the final means for the (c - a)-cluster problem plus the sample that is farthest from the nearest cluster center. This approach leads us directly to the so-called hierarchical clustering procedures, which are simple methods that can provide very good starting points for iterative optimization. 10.9 Hierarchical Clustering Up to now, our methods have formed disjoint clusters -- in computer science terminology, we would say that the data description is "flat." However, there are many times when clusters have subclusters, these have sub-subclusters, and so on. In biological taxonomy, for instance, kingdoms are split into phylums, which are split into subphylums, which are split into orders, and suborders, and families, and subfamilies, and genus and species, and so on, all the way to a particular individual organism. Thus we might have kingdom = animal, phylum = Chordata, subphylum = Vertebrata, class = Osteichthyes, subclass = Actinopterygii, order = Salmoniformes, family = Salmonidae, genus = Oncorhynchus, species = Oncorhynchus kisutch, and individual = the particular Coho salmon caught in my net. Organisms that lie in the animal kingdom -- such as a salmon and a moose -- share important attributes that are not present in organisms in the plant kingdom, such as redwood trees. In fact, this kind of hierarchical clustering permeates classifactory activities in the sciences. Thus we now turn to clustering methods which will lead to representations that are "hierarchical," rather than flat. 10.9.1 Definitions Let us consider a sequence of partitions of the n samples into c clusters. The first of these is a partition into n clusters, each cluster containing exactly one sample. The next is a partition into n - 1 clusters, the next a partition into n - 2, and so on until the nth, in which all the samples form one cluster. We shall say that we are at level k in the sequence when c = n - k + 1. Thus, level one corresponds to n clusters and level n to one cluster. Given any two samples x and x , at some level they will be grouped together in the same cluster. If the sequence has the property that whenever two samples are in the same cluster at level k they remain together at all higher levels, then the sequence is said to be a hierarchical clustering. The most natural representation of hierarchical clustering is a corresponding tree, called a dendrogram, which shows how the samples are grouped. Figure 10.10 shows a dendrogram for a simple problem involving eight samples. Level 1 shows the eight samples as singleton clusters. At level 2, samples x6 and x7 have been grouped to form a cluster, and they stay together at all subsequent levels. If it is possible to measure the similarity between clusters, then the dendrogram is usually drawn to scale to show the similarity between the clusters that are grouped. In Fig. 10.10, for example, the similarity between the two groups of samples that are merged at level 5 has a value of roughly 60. We shall see shortly how such similarity values can be obtained, but first note that the similarity values can be used to help determine whether groupings are natural or forced. If the similarity values for the levels are roughly evenly distributed throughout dendrogram 38 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING the range of possible values, then there is no principled argument that any particular number of clusters is better or "more natural" than another. Conversely, suppose that there is a unusually large gap between the similarity values for the levels corresponding to c = 3 and to c = 4 clusters. In such a case, one can argue that c = 3 is the most natural number of clusters (Problem 35). x 1 x2 x3 x 4 x5 x 6 x7 x8 Level 1 Level 4 Level 5 Level 6 Level 7 Level 8 Figure 10.10: A dendrogram can represent the results of hierarchical clustering algorithms. The vertical axis shows a generalized measure of similarity among clusters. Here, at level 1 all eight points lie in singleton clusters; each point in a cluster is highly similar to itself, of course. Points x6 and x7 happen to be the most similar, and are merged at level 2, and so forth. Another representation for hierarchical clustering is based on sets, in which each level of cluster may contain sets that are subclusters, as shown in Fig. 10.11. Yet another, textual, representation uses brackets, such as: {{x1 , {x2 , x3 }}, {{{x4 , x5 }, {x6 , x7 }}, x8 }}. While such representations may reveal the hierarchical structure of the data, they do not naturally represent the similarities quantitatively. For this reason dendrograms are generally preferred. x4 x3 x6 3 x2 2 x7 5 4 x5 x1 6 7 8 x8 Figure 10.11: A set or Venn diagram representation of two-dimensional data (which was used in the dendrogram of Fig. 10.10) reveals the hierarchical structure but not the quantitative distances between clusters. The levels are numbered in red. Because of their conceptual simplicity, hierarchical clustering procedures are among the best-known of unsupervised methods. The procedures themselves can be divided according to two distinct approaches -- agglomerative and divisive. Agglomerative (bottom-up, clumping) procedures start with n singleton clusters and form the sequence by successively merging clusters. Divisive (top-down, splitting) procedures start with all of the samples in one cluster and form the sequence by successively splitting clusters. The computation needed to go from one level to another is usually Agglomerative divisive Similarity scale Level 2 Level 3 100 90 80 70 60 50 40 30 20 10 0 10.9. HIERARCHICAL CLUSTERING 39 simpler for the agglomerative procedures. However, when there are many samples and one is interested in only a small number of clusters, this computation will have to be repeated many times. For simplicity, we shall concentrate on agglomerative procedures, and merely touch on some divisive methods in Sect. 10.12. 10.9.2 Agglomerative Hierarchical Clustering The major steps in agglomerative clustering are contained in the following procedure, where c is the desired number of final clusters: Algorithm 4 (Agglomerative hierarchical clustering) 1 2 3 4 5 6 7 begin initialize c, c n, Di {xi }, i = 1, . . . , n ^ do c c - 1 ^ ^ Find nearest clusters, say, Di and Dj Merge Di and Dj until c = c ^ return c clusters end As described, this procedure terminates when the specified number of clusters has been obtained and returns the clusters, described as set of points (rather than as mean or representative vectors). If we continue until c = 1 we can produce a dendrogram like that in Fig. 10.10. At any level the "distance" between nearest clusters can provide the dissimilarity value for that level. Note that we have not said how to measure the distance between two clusters, and hence how to find the "nearest" clusters, required by line 3 of the Algorithm. The considerations here are much like those involved in selecting a general clustering criterion function. For simplicity, we shall generally restrict our attention to the following distance measures: dmin (Di , Dj ) = dmax (Di , Dj ) = davg (Di , Dj ) = dmean (Di , Dj ) = min x - x xDi x Dj (74) (75) x-x (76) (77) xDi x Dj max x - x 1 n i nj xDi x Dj m i - mj . All of these measures have a minimum-variance flavor, and they usually yield the same results if the clusters are compact and well separated. However, if the clusters are close to one another, or if their shapes are not basically hyperspherical, quite different results can be obtained. Below we shall illustrate some of the differences. But first let us consider the computational complexity of a particularly simple agglomerative clustering algorithm. Suppose we have n patterns in d-dimensional space, and we seek to form c clusters using dmin (Di , Dj ) defined in Eq. 74. We will, once and for all, need to calculate n(n - 1) inter-point distances -- each of which is an O(d2 ) calculation -- and place the results in an inter-point distance table. The space complexity is, then, O(n2 ). Finding the minimum distance pair (for the first merging) requires that we step through the complete list, keeping the 40 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING index of the smallest distance. Thus for the first agglomerative step, the complexity is O(n(n - 1)(d2 + 1)) = O(n2 d2 ). For an arbitrary agglomeration step (i.e., from c ^ to c - 1), we need merely step through the n(n - 1) - c "unused" distances in the ^ ^ list and find the smallest for which x and x lie in different clusters. This is, again, O(n(n - 1) - c). The full time complexity is thus O(cn2 d2 ), and in typical conditions ^ n c. The Nearest-Neighbor Algorithm minimum algorithm singlelinkage algorithm When dmin is used to measure the distance between clusters (Eq. 74) the algorithm is sometimes called the nearest-neighbor cluster algorithm, or minimum algorithm Moreover, if it is terminated when the distance between nearest clusters exceeds an arbitrary threshold, it is called the single-linkage algorithm. Suppose that we think of the data points as being nodes of a graph, with edges forming a path between the nodes in the same subset Di . When dmin is used to measure the distance between subsets, the nearest neighbor nodes determine the nearest subsets. The merging of Di and Dj corresponds to adding an edge between the nearest pair of nodes in Di and Dj . Since edges linking clusters always go between distinct clusters, the resulting graph never has any closed loops or circuits; in the terminology of graph theory, this procedure generates a tree. If it is allowed to continue until all of the subsets are linked, the result is a spanning tree -- a tree with a path from any node to any other node. Moreover, it can be shown that the sum of the edge lengths of the resulting tree will not exceed the sum of the edge lengths for any other spanning tree for that set of samples (Problem 37). Thus, with the use of dmin as the distance measure, the agglomerative clustering procedure becomes an algorithm for generating a minimal spanning tree. Figure 10.12 shows the results of applying this procedure to Gaussian data. In both cases the procedure was stopped giving two large clusters (plus three singleton outliers); a minimal spanning tree can be obtained by adding the shortest possible edge between the two clusters. In the first case where the clusters are fairly well separated, the obvious clusters are found. In the second case, the presence of a point located so as to produce a bridge between the clusters results in a rather unexpected grouping into one large, elongated cluster, and one small, compact cluster. This behavior is often called the "chaining effect," and is sometimes considered to be a defect of this distance measure. To the extent that the results are very sensitive to noise or to slight changes in position of the data points, this is certainly a valid criticism. The Farthest-Neighbor Algorithm maximum algorithm completelinkage algorithm complete subgraph When dmax (Eq. 75) is used to measure the distance between subsets, the algorithm is sometimes called the farthest-neighbor clustering algorithm, or maximum algorithm. If it is terminated when the distance between nearest clusters exceeds an arbitrary threshold, it is called the complete-linkage algorithm. The farthest-neighbor algorithm discourages the growth of elongated clusters. Application of the procedure can be thought of as producing a graph in which edges connect all of the nodes in a cluster. In the terminology of graph theory, every cluster constitutes a complete subgraph. The distance between two clusters is determined by the most distant nodes in the two spanning tree There are methods for sorting or arranging the entries in the inter-point distance table so as to easily avoid inspection of points in the same cluster, but these typically do not improve the complexity results significantly. 10.9. HIERARCHICAL CLUSTERING 41 Figure 10.12: Two Gaussians were used to generate two-dimensional samples, shown in pink and black. The nearest-neighbor clustering algorithm gives two clusters that well approximate the generating Gaussians (left). If, however, another particular sample is generated (red point at the right) and the procedure re-started, the clusters do not well approximate the Gaussians. This illustrates how the algorithm is sensitive to the details of the samples. clusters. When the nearest clusters are merged, the graph is changed by adding edges between every pair of nodes in the two clusters. If we define the diameter of a partition as the largest diameter for clusters in the partition, then each iteration increases the diameter of the partition as little as possible. As Fig. 10.13 illustrates, this is advantageous when the true clusters are compact and roughly equal in size. Nevertheless, when this is not the case -- as happens with the two elongated clusters -- the resulting groupings can be meaningless. This is another example of imposing structure on data rather than finding structure in it. Compromises The minimum and maximum measures represent two extremes in measuring the distance between clusters. Like all procedures that involve minima or maxima, they tend to be overly sensitive to "outliers" or "wildshots." The use of averaging is an obvious way to ameliorate these problems, and davg and dmean (Eqs. 76 & 77) are natural compromises between dmin and dmax . Computationally, dmean is the simplest of all of these measures, since the others require computing all ni nj pairs of distances x - x . However, a measure such as davg can be used when the distances x - x are replaced by similarity measures, where the similarity between mean vectors may be difficult or impossible to define. 10.9.3 Stepwise-Optimal Hierarchical Clustering We observed earlier that if clusters are grown by merging the nearest pair of clusters, then the results have a minimum variance flavor. However, when the measure 42 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING dmax = large dmax = small Figure 10.13: The farthest-neighbor clustering algorithm uses the separation between the most distant points as a criterion for cluster membership. If this distance is set very large, then all points lie in the same cluster. In the case shown at the left, a fairly large dmax leads to three clusters; a smaller dmax gives four clusteramount proportional to netj , as shown by the red arrows in Fig. 10.14. It is this competition between cluster units, and the rsulting suppression of activity in all but the one with the largest net that gives the algorithm its name. Learning is confined to the weights at the most active unit. The weight vector at this unit is updated to be more like the pattern: w(t + 1) = w(t) + x, d (87) 2 wi = 1. where is a learning rate. The weights are then normalized to insure i=0 This normalization is needed to keep the classification and clustering based on the position in feature space rather than overall magnitude of w. Without such weight normalization, a single weight, say wj , could grow in magnitude and forever give the greatest value netj , and through competition thereby prevent other clusters from learning. Figure 10.15 shows the trajectories of three cluster centers in response to a sequence of patterns chosen randomly from the set shown. 10.11. COMPETITIVE LEARNING Algorithm 6 (Competitive learning) 1 2 3 4 5 6 7 8 9 10 47 begin initialize , n, c, w1 , w2 , . . . , wc xi {1, xi } i = 1, . . . n augment all patterns xi xi / xi i = 1, . . . n normalize all patterns do randomly select a pattern x t j arg max wj x classify x j wj wj + x weight update wj wj / wj weight normalization until no significant change in w in n attempts return w1 , w2 , . . . , wc end x3, w3 x2, w2 x1, w1 3 Figure 10.15: All of the three-dimensional patterns have been normalized ( i=1 x2 = 1), i and hence lie on a two-dimensional sphere. Likewise, the weights of the three cluster centers have been normalized. The red curves show the trajectory of the weight vectors; at the end of learning, each lies near the center of a cluster. A drawback of Algorithm 6 is that there is no guarantee that it will terminate, even for a finite, non-pathological data set -- the condition in line 8 may never be satisfied and thus the weights may vary forever. A simple heuristic is to decay the learning rate in line 6 , for instance by (t) = (0)t for < 1 where t is an iteration number. If the initial cluster centers are representative of the full data set, and the rate of decay is set so that the full data set is presented at least several times before the learning is reduced to very small values, then good results can be expected. However if then a novel pattern is added, it cannot be learning, since is too small. Likewise, such a learning decay scheme is inappropriate if we seek to track gradual changes in the data. In a non-stationary environment, a we may want a clustering algorithm to be stable to prevent ceaseless recoding, and yet plastic, or changeable, in response to a new pattern. (Freezing cluster centers would prevent recoding, but would not permit learning of new patterns.) This tradeoff has been called the stability-plasticity dilemma, and we shall see in Sect. 10.11.2 how it can be addressed. First, however, we turn to the problem of unknown number of clusters. stabilityplasticity 48 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING 10.11.1 Unknown number of clusters We have mentioned the problem of unknown number of cluster centers. When the number is unknown, we can proceed in one of two general ways. In the first, we compare some cluster criterion as a function of the number of clusters. If there is a large gap in the criterion values, it suggests a "natural" number of clusters. A second approach is to state a threshold for the creation of a new cluster. This is useful in on-line cases. The drawback is that it depends more strongly on the order of data presentation. Whereas clustering algorithms such as k-means and hierarchical clustering typically have all data present before clustering begins (i.e., are off-line), there are occasionally situations in which clustering must be performed on-line as the data streams in, for instance when there is inadequate memory to store all the patterns themselves, or in a time-critical situation where the clusters need to be used even before the full data is present. Our graph theoretic met. Very quickly, a stable configuration of output and input units occurs, called a "resonance"(though this has nothing to do with the type of resonance in a driven oscillator). ART networks detect novelty by means of the orienting subsystem. The details need not concern us here, but in broad overview, the orienting subsystem has two inputs: the total number of active input features and the total number of features that are active in the input layer. (Note that these two numbers need not be the same, since the top-down feedback affects the activation of the input units, but not the number of active inputs themselves.) If an input pattern is "too different" from any current cluster centers, then the orienting subsystem sends a reset wave signal that renders the active output unit quiet. This allows a new cluster center to be found, or if all have been explored, then a new cluster center is created. The criterion for "too different" is a single number, set by the user, called the vigilance, (0 1. Denoting the number of active input features as |I| and the number active in the input layer during a resonance as |R|, then there will be a reset if |R| < , |I| vigilance parameter (88) vigilance where rho is a user-set number called the vigilance parameter. A low vigilance parameter means that there can be a poor "match" between the input and the learned cluster and the network will accept it. (Thus vigilance and the ratio of the number of features used by ART, while motivated by proportional considerations, is just one of an infinite number of possible closeness criteria (related to ). For the same data set, a low vigilance leads to a small number of large coarse clusters being formed, while a high vigilance leads to a large number of fine clusters (Fig. 10.19). We have presented the basic approach and issues with ART1, but these return (though in a more subtle way) in analog versions of ART in the literature. 10.12. *GRAPH THEORETIC METHODS 51 Figure 10.19: The results of ART1 applied to a sequence of binary figures. a) = xx. b) = 0.xx. 10.12 *Graph Theoretic Methods Where the mathematics of normal mixtures and minimum-variance partitions leads us to picture clusters as isolated clumps, the language and concepts of graph theory lead us to consider much more intricate structures. Unfortunately, there is no uniform way of posing clustering problems as problems in graph theory. Thus, the effective use of these ideas is still largely an art, and the reader who wants to explore the possibilities should be prepared to be creative. We begin our brief look into graph-theoretic methods by reconsidering the simple procedures that produce the graphs shown in Fig. 10.6. Here a threshold distance d0 was selected, and two points are placed in the same cluster if the distance between them is less than d0 . This procedure can easily be generalized to apply to arbitrary similarity measures. Suppose that we pick a threshold value s0 and say that xi is similar to xj if s(xi , xj ) > s0 . This defines an n-by-n similarity matrix S = [sij ], with binary component sij = 1 0 if s(xi , xj ) > s0 otherwise. (89) similarity matrix Furthermore, this matrix induces a similarity graph, dual to S, in which nodes correspond to points and an edge joins node i and node j if and only if sij = 1. The clusterings produced by the single-linkage algorithm and by a modified version of the complete-linkage algorithm are readily described in terms of this graph. With the single-linkage algorithm, two samples x and x are in the same cluster if and only if there exists a chain x, x1 , x2 , . . . , xk , x such that x is similar to x1 , x1 is similar to x2 , and so on for the whole chain. Thus, this clustering corresponds to the connected components of the similarity graph. With the complete-linkage algorithm, all samples in a given cluster must be similar to one another, and no sample can be in more than one cluster. If we drop this second requirement, then this clustering corresponds to the maximal complete subgraphs of the similarity graph -- the "largeserence between the source distribution and the estimate. That is, a is the basis vectors of A and thus p(y; a) is an estimate of the p(y). ^ This difference can be quantified by the Kullback-Liebler divergence: D(p(y), p(y; a)) ^ = D(p(y)||^(y; a)) p p(y) = p(y)log dy p(y; a) ^ = H(y) - The log-likelihood is 1 l(a) = n n p(y)log p(y; a)dy ^ (94) log p(xi ; a). ^ i=1 (95) 10.13. COMPONENT ANALYSIS 57 and using the law of large numbers, the Kullback-Liebler divergence can be written as p(y) dy p(y; a) ^ (96) l(a) = - = p(y)logp(y)dy - p(y)log H(y) -D(p(y)||^(y; a)), p indep. of W where the entropy H(y) is independent of W. Thus we maximize the log-likelihood by minimizing the Kullback-Liebler divergence with respect to the estimated density p(y; a): ^ l(a) =- D(p(y)||^(y; a)). p (97) W W Because A is an invertible matrix, and because the Kullback-Liebler divergence is invariant under invertible transformation (Problem 47), we have l(a) =- D(p(x)||^(z)). p W W H(yyy) WWW log[|WWW|] + log WWW WWW [WWW-1 ]t - (xxx)zzzt , n (98) = = i=1 xxi yyi (99) score function (100) where (xxx) is the score function, the gradient fector of the log likelihood: p(z )/z 1 1 p(z)/z (z) = - = - p(z) Thus the learning rule is p(z1 ) . . . p(zq )/zq p(zq ) H(xxx) = [xxxt ]-1 - (xx)yyt . xxx A simpler form comes if we merely scale, following the natural gradient xxx (101) H(xxx) (102) WWt WW = [I - (xx)xxt ]WWW. xxx This, then is the learning algorithm. An assumption is that at most one of the sources is Gaussian distributed (Problem 46). Indeed this method is most successful if the distributions are highly skewed or otherwise deviate markedly from Gaussian. We can understand the difference between PCA and ICA in the following way. Imagine that there were two sources that are correlated and large correlated signals in a particular direction. PCA would find that direction, and indeed would reduce the sum-squared error. Such components are not independent, and would not be useful for separating the sources. As such, they would not be found by ICA. Instead, ICA 58 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING would find those directions that are best for separating the sources -- even if those directions have small eigenvectors. Generally speaking, when used as preprocessing for classification, independent component analysis has several characteristics that make it more desirable than linear or non-linear principal component analysis. As we saw in Fig. 10.23, such principal components need not be effective in separating classes. Recall that the sensed input consists of a signal (due to the true categories) plus noise. If the noise is large much larger than the signal, principal components will depend more upon the noise than on the signal. Since the different categories are, we assume, independent, independent component analysis is likely to extract those features that are useful in distinguishing the classes. 10.14 Low-Dimensional Representations and Multidimensional Scaling (MDS) Part of the problem of deciding whether or not a given clustering means anything stems from our inability to visualize the structure of multidimensional data. This problem is further aggravated when similarity or dissimilarity measures are used that lack the familiar properties of distance. One way to attack this problem is to try to represent the data points as points in some lower-dimensional space in such a way that the distances between points in the that space correspond to the dissimilarities between points in the original space. If acceptably accurate representations can be found in two or perhaps three dimensions, this can be an extremely valuable way to gain insight into the structure of the data. The general process of finding a configuration of points whose interpoint distances correspond to similarities or dissimilarities is often called multidimensional scaling. Let us begin with the simpler case where it is meaningful to talk about the distances between the n samples x1 , . . . , xn . Let yi be the lower-dimensional image of xi , ij be the distance between xi and xj , and dij be the distance between yi and yj (Fig. 10.25). Then we are looking for a configuration of image points y1 , . . . , yn for which the n(n - 1)/2 distances dij between image points are as close as possible to the corresponding original distances ij . Since it will usually not be possible to find a configuration for which dij = ij for all i and j, we need some criterion for deciding whether or not one configuration is better than another. The following sum-of-squared-error functions are all reasonable candidates: (dij - ij )2 Jee = i<j i<j 2 ij 2 (103) Jf f Jef = i<j dij - ij ij (104) (105) = 1 ij i<j i<j (dij - ij )2 . ij Since these criterion functions involve only the distances between points, they are invariant to rigid-body motions of the configurations. Moreover, they have all been 10.14. LOW-DIMENSIONAL REPRESENTATIONS AND MULTIDIMENSIONAL SCALING (MDS)59 y2 x3 ij x2 xi xj yi dij yj x1 y1 Figure 10.25: The distance between points in the original space are ij while in the projected space dij . In practice, the source space is typically of very high dimension, and the mapped space of just two or three dimensions, to aid visualization. (In order to illustrate the correspondence between points in the two spaces, the size and color of each point xi matches that of its image yi . normalized so that their minimum values are invariant to dilations of the sample points. While Jee emphasizes the largest errors (regardless whether the distances ij are large or small), Jf f emphasizes the largest fractional errors (regardless whether the errors |dij - ij | are large or small). A useful compromise is Jef , which emphasizes the largest product of error and fractional error. Once a criterion function has been selected, an optimal configuration y1 , . . . , yn is defined as one that minimizes that criterion function. An optimal configuration can be sought by a standard gradient-descent procedure, starting with some initial configuration and changing the yi 's in the direction of greatest rate of decrease in the criterion function. Since dij = yi - yj , the gradient of dij with respect to yi is merely a unit vector in the direction of yi -yj . Thus, the gradients of the criterion functions are easy to compute: yk Jee yk Jf f yk Jef 2 2 ij (dkj - kj ) j=k = yk - yj dkj i<j = 2 j=k dkj - kj yk - yj 2 kj dkj dkj - kj yk - yj . kj dkj = 2 ij i<j j=k The starting configuration can be chosen randomly, or in any convenient way that ^ spreads the image points about. If the image points lie in a d-dimensional space, 60 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING ^ then a simple and effective starting configuration can be found by selecting those d coordinates of the samples that have the largest variance. The following example illustrates the kind of results that can be obtained by these techniques. The data consist of thirty points spaced at unit intervals along a spiral in three-dimensions: cos(k/ 2) = sin(k/ 2) = k/ 2, k = 0, 1, . . . , 29. = x1 (k) x2 (k) x3 (k) Figure 10.26 shows a the three-dimensional data. When the Jef criterion was used, twenty iterations of a gradient descent procedure produced the two-dimensional configuration shown at the right. Of course, translations, rotations, and reflections of this configuration would be equally good solutions. x3 20 15 10 5 x2 1 0 1 x1 Figure 10.26: Thirty points of the form (cos(k/ 2), sin(k/ 2), k/ 2)t for k = 0, 1, . . . , 29 are shown at the left. Multidimensional scaling using the Jef criterion (Eq. 105) and a two-dimensional target space leads to the image points shown at the right. This lower-dimensional representation shows clearly the fundamental sequential nature of the points in the original, source space. In non-metric multidimensional scaling problems, the quantities ij are dissimilarities whose numerical values are not as important as their rank order. An ideal configuration would be one for which the rank order of the distances dij is the same as the rank order ore 10.27: A self-organizing map from the (two-dimensional) disk source space to the (one-dimensional) line of the target space can be learned as follows. For each point x in the target line, there exists a corresponding point in the source space that, if sensed, would lead to x begin most active. For clarity, then, we can link theses points in the source; it is as if the image line is placed in the source space. At the state shown, the particular sensed point leads to x begin most active. The learning rule (Eq. 109) makes its source point move toward the sensed point, as shown by the small arrow. Because of the window function (|y - y|), points adjacent to x are also moved toward the sensed point, thought not as much. If such learning is repeated many times as the arm randomly senses the whole source space, a topologically correct map is learned. along the target line. When a pattern , each node in the target space computes its net activation, netk = i wki . One of the units is most activated; call it y . The i weights to this unit and those in its immediate neighborhood are updated according to: wki (t + 1) = wki (t) + (t)(|y - y |)i , (109) window function where (t) is a learning rate which depends upon the iteration number t. Next, every weight vector is normalized such that |w| = 1. (Naturally, only those weight vectors that have been altered during the learning trial need be re-normalized.) The function (|y - y |) is called the "window function," and has value 1.0 for y = y and smaller for large values of |y - y |. The window function is vital to the success of the algorithm: it insures that neighboring points in the target space have weights that are similar, and thus correspond to neighboring points in the source space, thereby insuring topological neighborhoods (Fig. 10.28). The learning rate (t) decreases slowly as a function of iteration number (i.e., as patterns are presented) to insure that learning will ultimately stop. Equation 109 has a particularly straightforward interpretation. For each pattern presentation, the "winning" unit in the target space (y ) is adjusted so that it is more like the particular pattern. Others in the neighborhood of y are also adjusted so that their weights more nearly match that of the input pattern (though not quite as much as for y , according to the window function). In this way, neighboring points in the input space lead to neighboring points being active. After are large number of pattern presentations, learning according to Eq. 109 10.14. LOW-DIMENSIONAL REPRESENTATIONS AND MULTIDIMENSIONAL SCALING (MDS)63 y2 y* y* y y1 Figure 10.28: Typical window functions for self-organizing maps for target spaces in one dimension (left) and two dimensions (right). In each case, the weights at the maximally active unit, y, in the target space get the largest weight update while units more distant get smaller update. insures that neighboring points in the source space lead to neighboring points in the target space. Informally speaking, it is as if the target space line has been placed on the source space, and learning pulls and stretches the line to fill the source space, as illustrated in Fig. 10.29 shows the development of the map. After 150000 training presentations, a topological map has been learned. 0 20 100 1000 10000 25000 50000 75000 100000 150000 Figure 10.29: If a large number of pattern presentations are made using the setup of Fig. 10.27, a topologically ordered map develops. The number of pattern presentations is listed. The learning of such self-organizing maps is very general, and can be applied to virtually any source space, target space and continuous nonlinear mapping. Figure 10.30 shows the development of a self-organizing map from a square source space to a square (grid) target space. There are generally inherent ambiguities in the maps learned by this algorithm. For instance, a mapping from a square to a square could eight possible orientations, corresponding to the four rotation and two flip symmetries. Such ambiguity is generally irrelevant for suLet us consider a simple modification of hierarchical clustering to reduce dimensionality. In place of an n-by-n matrix of distances between samples, we consider a d-by-d correlation matrix R = [ij ], where the correlation coefficient ij is related to the covariances (or sample covariances) by ij = ij . ii jj (110) principal component factor analysis data matrix correlation matrix Since 0 2 1, with 2 = 0 for uncorrelated features and 2 = 1 for completely ij ij ij correlated features, 2 plays the role of a similarity function for features. Two features ij 66 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING for which 2 is large are clearly good candidates to be merged into one feature, thereby ij reducing the dimensionality by one. Repetition of this process leads to the following hierarchical procedure: Algorithm 8 (Hierarchical dimensionality reduction) 1 2 3 4 5 6 7 8 9 10 begin initialize d , Di {xi }, i = 1, . . . , d ^ dd+1 ^ ^ do d d - 1 compute R by Eq. 110 Find most correlated distinct clusters, say Di and Dj Di Di Dj merge delete Dj ^ until d = d return d clusters end Probably the simplest way to merge two groups of features is just to average them. (This tacitly assumes that the features have been scaled so that their numerical ranges are comparable.) With this definition of a new feature, there is no problem in defining the correlation matrix for groups of features. It is not hard to think of variations on this general theme, but we shall not pursue this topic further. For the purposes of pattern classification, the most serious criticism of all of the approaches to dimensionality reduction that we have mentioned is that they are overly concerned with faithful representation of the data. Greatest emphasis is usually placed on those features or groups of features that have the greatest variability. But for classification, we are interested in discrimination -- not representation. While it is a truism that the ideal representation is the one that makes classification easy, it is not always so clear that clustering without explicitly incorporating classification criteria will find such a representation. Roughly speaking, the most interesting features are the ones for which the difference in the class means is large relative to the standard deviations, not the ones for which merely the standard deviations are large. In short, we are interested in something more like the method of multiple discriminant analysis described in Sect. ??. There is a large body of theory on methods of dimensionality reduction for pattern classification. Some of these methods seek to form new features out of linear combinations of old ones. Others seek merely a smaller subset of the original features. A major problem confronting this theory is that the division of pattern recognition into feature extraction followed by classification is theoretically artificial. A completely optimal feature extractor can never by anything but an optimal classifier. It is only when constraints are placed on the classifier or limitations are placed on the size of the set of samples that one can formulate nontrivial (or very complicated) problems. Various ways of circumventing this problem that may be useful under the proper circumstances can be found in the literature. When it is possible to exploit knowledge of the problem domain to obtain more informative features, that is usually the most profitable course of action. Summary Unsupervised learning and clustering seek to extract information from unlabeled samples. If the underlying distribution comes from a mixture of component densities de- 10.14. SUMMARY 67 scribed by a set of unknown parameters , then can be estimated by Bayesian or maximum-likelihood methods. A more general approach is to define some measure of similarity between two clusters, as well as a global criterion such as a sum-squarederror or trace of a scatter matrix. Since there are only occasionally analytic methods for computing the clustering which optimizes the criterion, a number of greedy (locally step-wise optimal) iterative algorithms can be used, such as k-means and fuzzy k-means clustering. If we seek to reveal structure in the data at many levels -- i.e., clusters with subclusters and sub-subcluster -- then hierarchical methods are needed. Agglomerative or bottom-up methods start with each sample as a singleton cluster and iteratively merge clusters that are "most similar" according to some chosen similarity or distance measure. Conversely, divisive or top-down methods start with a single cluster representing the full data set and iteratively splitting into smaller clusters, each time seeking the subclusters that are most dissimilar. The resulting hierarchical structure is revealed in a dendrogram. A large disparity in the similarity measure for successive cluster levels in a dendrogram usually indicates the "natural" number of clusters. Alternatively, the problem of cluster validity -- knowing the proper number of clusters -- can also be addressed by hypothesis testing. In that case the null hypothesis is that there are some number c of clusters; we then determine if the reduction of the cluster criterion due to an additional cluster is statistically significant. Competitive learning is an on-line neural network clustering algorithm in which the cluster center most similar to an input pattern is modified to become more like that pattern. In order to guarantee that learning stops for an arbitrary data set, the learning rate must decay. Competitive learning can be modified to allow for the creation of new cluster centers, if no center is sufficiently similar to a particular input pattern, as in leader-follower clustering and Adaptive Resonance. While these methods have many advantages, such as computational ease and tracking gradual variations in the data, they rarely optimize an easily specified global criterion such as sum-of-squared error. Graph theoretic methods in clustering treat the data as points, to be linked according to a number of heuristics and distance measures. The clusters produced by these methods can exhibit chaining or other intricate structures, and rarely optimize an easily specified global cost function. Graph methods are, moreover, generally more sensitive to details of the data. Component analysis seeks to find directions or axes in feature space that provide an improved, lower-dimensional representation for the full data space. In (linear) principal component analysis, such directions are merely the largest eigenvectors of the covariance matrix of the full data; this optimizes a sum-squared-error criterion. Nonlinear principal components, for instance as learned in an internal layer an autoencoder neural network, yields curved surfaces embedded in the full d-dimensional feature space, onto which an arbitrary pattern x is projected. The goal in independent component analysis -- which uses gradient descent in an entropy criterion -- is to determine the directions in feature space that are statistically most independent. Such directions may reveal the true sources (assumed independent) and can be used for segmentation and blind source separation. Two general methods for dimensionality reduction is self-organizing feature maps and multidimensional scaling. Self-organizaing feature maps can be highly nonlinear, and represents points close in the source space by points close in the lower-dimensional target space. In preserving neighborhoods in this way, such maps also called "topologically correct." The source and target spaces can be of very general shapes, and the 68 CHAPTER 10. UNSUPERVISED LEARNING AND CLUSTERING mapping will depend upon the the distribution of samples within the source space. Multidimensional scaling similarly learns a nonlinear mapping that, too, seeks to preserve neighborhoods, and is often used for data visualization. Because the basic method requires all the inter-point distances for minimizing a global criterion function, its space complexity limits the usefulness of multidimensional scaling to problems of moderate size. Bibliographical and Historical Remarks Historically, t... View Full Document