BME6360 Sp11 III P64-72

BME6360 Sp11 III P64-72 - [EL 64 A BRIEF INTRODUCTION TO...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 8
Background image of page 9
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: [EL 64 A BRIEF INTRODUCTION TO NEURAL NETWORKS This introduction to Neural Networks takes a different approach than that chosen in the overwhelming fraction of the large number of books and articles written on the subject. In those articles a token amount is written to suggest that the neural networks do something which is really similar to what the brain does. Whereas one often gets the impression that some authors believe that this similarity means that their algorithms are automatically infused with the intelligence of a person, a more accurate description is that the algorithms are “neurally inspired”, meaning that the originators were thinking about the nervous system when they were trying to deal with difficult problems. When that is the case, neural networks are an example of one direction of the flow of ideas in cybernetics, or, roughly, what happens whenever biologists and engineers trade ideas. This is the side where engineers look for ways to solve problems and talk to biologists among others, taking a variety of ideas but always putting as paramount the solution of engineering problems. The other direction in which information flows is that biologists take engineering concepts and apply them to biological problems. Most of what is done is at the level of the conceptual organizing principle. For instance, feedback is a decidedly engineering concept —— it was formalized in order to explain the behavior of amplifiers in the 1920’s —— which has been adopted by biologists, social scientists, and many others in order to explain behavior and even as a descriptor for good business practices. The concept of the phase-locked loop —— again, an engineering idea -— is currently in vogue, albeit with another name, in explanations of how the brain works. The approach to neural networks which is taken here is to begin by exploring them, and in particular the simple backpropagation network, as a means of computing the pattern recognition problem above. We know of good statistical techniques for this problem, and we know its limitations. It will be seen that theneural net often solves the problems associated with peculiar statistical distributions without us having to know what they are beforehand. In essence, the neural net can substitute for a lot of graduate training in statistics by making you capable of solving difficult problems without knowing very much. How does this relate to biology and biological modeling? In the same way in which all mathematical modeling relates to biology: we used well—understood control systems models to explore our understanding of the pupillary control system. Now we should use well understood neural net models —- which are, after all, just pattern recognizers of a new sort —- to model computation in the brain. We’ll present a model application of a region of the brain where associations are made which appear consistent with a neural net model. The modeling here will be no better nor worse than modeling in any other biological problem with any other mathematical technique: it is the quality of the insights gained which determine the effectiveness of the modeling work. Perceptron Learning Probably the first implemented neural net examples were those of Frank Rosenblatt and coworkers at Cornell in the 1950’s and 1960’s. They built physical electrical and electronic (tubes!) implementations. The first was called the Perceptron from which the label Perceptron Learning is derived. The device took in lots of data vectors x which were the light intensities on a 5 by 7 grid. The outputs were coded values for letters of the alphabet. The algorithm works by taking the weighted sum of the inputs in a manner identical to taking a dot product but also suggestive of how a neuron sums its many inputs, some of which are close to the soma, some of which are far away. Initially the weights are chosen randomly, but the algorithm’s job is to change the weights with a rule so that the come to generate outputs which are desired -- “learning”. To do so you have to go through a training phase in which you compare the outputs to the desired outputs and make corrections to the weights so that the next time the results are better. Also, there are large numbers of sets of input data and desired outputs, not all of which are consistent. Thus it is quite likely that it is not possible for the algorithm to be perfect. III. 65 Here is an algorithm which has maintains the essential flavor of Rosenblatt’s Perceptron : Given: lots of data vectors x_(k) associated with either class A(+1) or class B(—1) Initial conditions: random wi’s such that llwll = 1. (Think of this as a unit vector pointing somewhere, which continually asks the question “How much does the data look like me?”) Iterate for each vector k, then repeat the whole data set many (1000’s) times. (= “multiple presentations") 0 Compute y = wT 2; “how much does g look like w?” 0 Ify > 0 and class A if the output is high and it is supposed to be high increment each weight element wi= w+xi 0 8 “reward by making it look more like 2g 0 If y < 0 and class B if the output is low and it is supposed to be low decrement each wi = wi — xi 5 “reward w” by making it look more like -24_ ° Renormalize the weights so llwll = 1 keep w as a unit vector Result: As best as possible, the Perceptron will learn to distinguish A from B by using the projection ng to distinguish A from B. The Perceptron has to find the optimal angle for the w axis. As you might have guessed, the solutions to this problem are well known in the linear statistics literature. In fact there are much better classifiers than the Perceptron classifer as was noted above. You should compare this diagram with the cluster diagrams for the action potentials described earlier in the course. 2—D Case X2 A f E A variety of weight adjustment rules will work. Since that time we have become fairly sophisticated in proposing new kinds of “rewards” and “penalties” to these functions to make them work faster. Don’t extrapolate from the anthropomorphic descriptions the idea that there is something especially biological in them. The same words are used in “penalty Lagrangians”, a non—linear optimization technique popular before the recent craze for neural nets. Also, any iterative numerical technique uses penalties and rewards, although most do not use that terminology. III. 66 Sigmoidal Output Functions 1 1 0.5 0.5 0 >05 —1 -5 0 5 y: 1( 7x) y: tanh(x) y = max(-l, max(l,x)) “ . . H erbolic Tanoent Linear Limiter The logistic function. yp D Above are three sigmoidal output functions (sometimes called “squashing functions”) which are popularly used in neural networks. The essential feature is that the output is bounded for both large negative and positive inputs. There is nothing magic about whether negative inputs are or are not used, or whether the curves are centered at x=0 as shown or at x=0.5 as is often the case. These input/output value ranges are critical to a computational algorithm, of course, but not to the overall strategy in using the sigmoidal functions to limit outputs. Use of the sigmoidal function is usually described as: y=o( wixi) Why use a squashing function? There are many problems for which it is useful to have limited outputs which report the idea that “the input is big” with little discrimination among degrees of bigness, which probably originate from inputs beyond the range expected in the modeling problem. The squashing function is pervasive in nature: neurally, it corresponds to the ideat that, not matter how much input a neuron receives, it cannot fire faster than some amount, perhaps several hundred AP’s per second, nor can it fire fewer than zero AP’s per second. In photography, the squashing function is extremely well-characterized as it relates the density of silver grains versus the input intensity of light to the film —- the film sensivity relates to the input value at the midpoint of the curve, while contrast is essentially the slope of the curve at the midpoint. The sigmoidal function does not affect the boundaries in the one layer Perceptron described above. This results from the fact that the sigmoidal function is monotonic: if the test is whether or not G<Zwixi) is greater than a threshold, the equivalent test is whether or not wa is greater than a possibly different threshold. Thus the decision boundary is still a straight line. This is not true, however, for two and three layer Perceptrons: curved boundaries can be established, as we’ll see below. This is another reason to use the sigmoidal function. Bias The typical “neuron” in a neural network computes the following function, where G is the sigmoidal or squashing function, and b is a bias or offset. y=c5(§_:wixi + b) The bias is useful in establishing the “operating point” on the sigmoidal curve. In the Perceptron model above, the lines being drawn are all constrained to go through the origin. If you wanted a decision boundary which did not go through the origin, you would have to add a bias constant to the formula (just as in the formula for a straight line is y=mx+b.). The bias can be handled as just another input when the learning algorithm is applied, by thinking of b as the weight for an input which is always equal to one. This trick is extremely useful in writing vectorized programs for doing neural network analysis. N N+l y=c5£2fwixi +b]=0[2}wixij; wherexN+1 =1 and wN+1=b l: 1: 111.67 Generalized One Layer Perceptron Perceptrons generalize in a variety of ways, including this model where you try to force the system to have one output which is large, while all the rest are small. The implications for pattern recognition should be obvious. Referring back to the EEG analysis with the AutoRegressive coefficients, you should note that the goal is to have the AR coefficients as input and to have an algorithm indicate which one of the EEG states is consistent with the EEG signal. This is precisely the application made by some neural network researchers. The schematic diagram of an n dimensional input and In class problem is made, along with the a case where there are two inputs and three classes. T he one layer Perceptron is illustrated schematically in the diagram to the left. If it were trained on the action potentials from the data illustrated earlier, it would generate decision boundaries for classes 1, 2, 3, and 4 as illustrated to the right. In each of the “quadrants” one of the outputs would be maximal, indicating that the input pattern best matched the given template. I'll ' ‘ o r . L. ‘ 4 + a " _. 4" IIIIIIEII 409500 -2000 -1E)O -1000 600 O 500 1000 1500 2003 . Hand Drawn Lines! No accuracy implied 111.68 Two Layer Perceptron with Back Propagation Counting the Number of Lavers: The schematic diagram of a 2 Layer Perceptron is given below. It includes a set of inputs, which many people refer to as an input layer, an output layer, and a middle layer called a “hidden layer”. The input “layer” has no “neural” processing: it merely reports the values of the inputs to the next layer. Many people call this a 3 layer perceptron, although there is a movement toward a nomenclature that counts only the number of layers of “neurons”. You will have to tolerate the confusion in terminology. Decision Boundaries with Two-Laver Perceptrons The two layer perceptron has proven to be very useful; it is much more useful than the one layer perceptron which can only “draw straight lines” to make classifications. The two layer perceptron can use the hidden layer to “draw straight lines”, then apply the sigmoidal function, and let the output layer combine the lines to perform AND, OR, and XOR functions with boundaries which are not quite linear. The figure to the right is an example of how the network can divide up the feature space for purposes of classification. The non—linearity and the second layer permit the creation of convex decision regions which may match the underlying probability distribution much more exactly. If a third layer is added, you can have entirely arbitrarily shaped decision regions, which could be of an advantage when you have very strange distributions. You should refer to the article by Lippmann for a discussion of how the backpropagation network divides up the feature space. The downside to this ability to create unusual decision boundaries is that it is necessary to get lots of examples in order to create them. For a high dimensional data set, this could be thousands or millions of examples. For many problems, it is impossible to get this many examples; researchers will often expand their set of examples by adding Gaussian distributed errors to the measurements. This works for purposes of training, but has the result that the distributions are simply Gaussian and the decision boundaries can be determined faster using the covariance matrix as described above. Iliir _ Inflifilmin Ag .amamcu‘ IIIQ-dfllfi’lllléflk layman-II IIVAIEIII -2000 4500 4000 600 O 1000 1500 2000 1500 1000 .J ‘ q a ! "—" fi A ~15_ Hand drawn lines. No accuracy assumed 111.69 Backpropagation Algorithm: Forward propagation refers to the normal computation, in which the inputs are used to compute the outputs of the hidden layer, whose outputs are used to compute the outputs of the next hidden layer, until the outputs of the output layer are computed. Backward pr0pagation is the use of the outputs of the output layer and the desired outputs to make corrections to the weights of the output layer, after which one uses those weights and the error signals to compute the equivalent error at the hidden layer, which is used to correct the weights of that layer, until the weights of the first layer are changed. The algorithm alternately computes the forward prediction and then backprojects the error to correct the weights. This is repeated thousands of times for the entire data set until the output error is acceptable. Forward Formulae y=O[WTx+_6_] ;=0[_W_;Zy+g] __..xy._ wxy’s, Wyz’s are weights; 0 = sigmoidal function; the 6’s, (D’s are bias offsets x’s are the inputs; y’s are the hidden layer outputs; z’s are the output layer outputs. Procedure: olnitialize weights randomly 0F or each input vector, repeat the following many times until results are satisfactory 0Compute ; for a given 3 oCompute e = z — d Compare _z_ to d the desired output oCorrect the flyz weights. Here’s the formula for the ij1h weight between the y and 2 layers. Awyjzk = n 5k yj :5k = Zk (1-Zk) (dk'Zk) The error in the output is (dk-zk); T] is a gain constant oCorrect the E“ weights waiwvj=n5jxt ;5j=yj(l—yj)k25szkyjzk where Eészkyjzk istheoutputerror k k propagated back to the hidden layer. This is part of the notion of backpropagation. (If there were more layers, this process would continue until the input is reached.) oStop when either the sum of squared errors stops changing very much, or when the algorithm classifies everything correctly. Equivalence to Partial Derivatives: The critical step of correcting each weight is identical to correcting in proportion to the partial derivative of the sum of squared errors with respect to the weight. In the case of the logistic function, this is exact; in the case of the others, it is approximate. Thus, effectively, back propagation is the same as Let e2: lld —;II a 62 partial derivative of Compute error with reSpect to d W any particular weight w . d e2 Update the value of the weight to wprevious — (constant) 51-D- Result: The backpropagation network can learn to reduce the error between the desired and computed output. When averaged over all the sets of data, the network will usually find an “optimal” set of weights. These are “optimal” in the sense that, empirically, they are usually very good and most often equal or exceed the performance of the simpler classical techniques. However, it is extremely difficult to prove rigorously any assertion about their performance, in contrast to classical techniques for which such error bounds often exist. There are many times, however, when a classical technique will equal the performance of the neural net, and some times when they will be better. The neural net approach is almost always easier to program than a more sophisiticated classical algorithm. The net effect is that this Backpropagation Neural Network functions as a Pattern Recognizer 111.70 Unsupervised Learning in a Perceptron The Perceptron and the Backpropagation Network are supervised networks. This means that there is a “teacher” which uses knowledge of the correct outputs to change the weight vectors so that, when the “test” comes, the network will provide answers which are correct as often as possible. For the OneeLayer Perceptron, with or without the sigmoid function, this essentially means making the weight vector look more like the mean of the input class for which you want large outputs. For the multilayer perceptron it is more difficult to describe what happens. What does it mean to have “unsupervised learning”? This is the case where either you don’t know the correct answer or you don‘t use the information. What is it that the network is supposed to learn? The answer is that the network will come to “learn” the statistical distribution of the inputs. For instance, if all the inputs are the same except for a modest amount of noise, then the weight vector will come to look like the mean of the distibution. If there are two outputs and there are two modes in the distribution, each output’s weight vector will come to look like one of the modes. This generalizes to many modes and many outputs. When there are either two few or too many outputs for the number of modes in the input distribution, the network reaches some compromise which is ill-defined. The algorithm below is essentially that of the Perceptron above. Select maximum “Winner takes all Given: lots of data vectors x_(k) whose association with a class is unknown. Initial conditions: random wii ’5 such that “will 2 1. (Think of this as a unit vector pointing somewhere, which continually asks the question “How much does the data look like me?”) Iterate for each vector k, then repeat the whole data set many (1000’s) times. (= “multiple presentations”) 0 For each output i compute y1 2 BLT x “how much does x look like 32,?” ° Find the largest output y] increment wi = 1255 + xk ' 5 “reward w,” by making it look more like xk 0 Renormalize the weights so llwkll = 1 keep wk as a unit vector Result: As best as possible, one of the outputs of the Perceptron will learn the “direction” of the mode of the distribution of inputs because, if it happens to point in the “right” direction initially, the reward algorithm will make it look more and more like the mean. A second output, which initially points in another direction, closer to a second mode of the distribution, will be rewarded only when the inputs are close to that mode and will come to “learn” that mode. The result of a case where the inputs are two dimensional and there are three outputs is illustrated below. I I region where Z / v1 wins / / I W] / X1 Example with two inputs v r and three outputs region where y: wins [[171 A NEUROBIOLOGICAL MODELING EXAMPLE Finally, we get to look at a case where a multilayer Backpropagation neural network is used as a modeling tool to understand biology. As you read the paper keep in mind the observations above that Backpropagation is a tool for recognizing patterns. Further, the patterns of the weights of the hidden layer are not well understood statistically. Although they clearly play a critical role in the algorithm, there is no unique set of weights which solves the problem. Not discussed above is the other major use of the supervised network: tt can be used for interpolation of mathematical functions whose analytic expression is either unknown or too complex, but for which example values are known. The paper is: “A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons” D. Zipser, R. A. Andersen, Science 1988 331:679—684 Zipser did the analysis of data which Andersen had recorded some time earlier. The recordings are from single units (neurons) in a monkey’s posterior parietal cortex, which is given the label “area 7a.” This area is in the anatomical and physiogical path from primary visual cortex to the control of eye position. Thus, this is part of the circuitry which impacts the ocular position response, including both the optokinetic system, which was studied earlier, and other systems, such as the smooth pursuit system. From past experimental work, this region is believed to use knowledge of head position and the position of an object on retina in order to compute the object’s position in head centered coordinates. Andersen had found three classes of neurons: a) retinotopic: 21% respond to visual stimuli only b) eye position: 15% respond to changes in the position of the eye only c) bo_th: 57% respond to either changes in the eye position or visual stimuli Anderson’s hypothesis about the computation performed in area 7a is summarized by this diagram, from which a neural network can be created. Hypothesis retinal head centered position cells responding to position eye (9 posmon both perform neural net transformation Retinotopic Inputs: From experimental recordings, Andersen knows the functional dependence of the retinotopic cells under controlled experimental conditions. Essentially, the monkey is trained to focus on a specific point on a screen. Then points of light are shown at positions off—center while the recordings are made. The experimenter can then reconstruct a function of firing rate versus retinal position. In the actual experiments, 17 measurements were made -- one at the center, and four each on circles which were 10, 20, 30, and 40 degrees (visual angle) away from the center. The general tendency is for one neuron to fire maximally a light stimulus at a particular location, and for its response to fall off as the light is moved away. Bimodal responses —— e.g. to two locations ~— are rare. The neural network model has 64 neurons each of which has a best location, equally spaced over the field of view. Each has the same experimentally derived response function centered around its best location. (See Figure X) Eye Position Inputs: From experimental recordings, Andersen knows the functional dependence of these cells to changes in gaze location. The monkey is trained to focus on spots on a screen and the absolute angles (up/down; left/right) of the eye with respect to the head are measured along with the firing rate of the neurons. It is found there are four classes of cells which respond to changes in eye position. One class has a higher firing rate the further to the left the eye is positioned, but has a zero response to rightward, upward, or downward positions. The other three classes favor the other three directions. HI.72 The neural network model has 32 inputs to represent these data, 8 each to represent left, right, up and down inputs Head Centered Position Outputs: There are no direct neural data which support this part of the hypothesis. It is known that the brain is wonderful and that you know where things are in terms of head centered coordinates. Otherwise you would not be able to reach out and touch someone without great difficulties (or AT&T). Generally, there must be some computation of this kind in order to correct head position so as to maintain fixed position of an object on the retina. However, there are no recorded signals of neurons which have the appropriate firing pattern. Thus, Zipser made two assumptions about the form of the output, one of which assumed that head centered position was coded similarly to the known form of the retinotopic neurons, the other that it was similar to that used for eye position. Hidden Layer Neurons: Although Andersen has data which he thinks represent the outputs of the “hidden layer neurons” —— the cells which respond to both retinal stimuli and eye position —- Zipser does not use these data directly. Instead, Zipser trains the neural network to make the computation (retinal position) + (eye position) : (position in head centered coordinates). Whatever linearities and non—linearities are necessary to perform this transformation are learned by the sets of weights between the input and hidden layers, and between the hidden and output layers. After the neural network has learned the association, the responses of the neural network hidden layer and the brain’s “hidden layer” are compared. Zipser claims that there are striking similarities in the responses and that this strongly supports the idea that the brain is acting as a neural network and that area 7a is performing this particular transformation. Classical Analogies: The neural network is performing something very close to the pattern recognition described above. If there were only one output active at a single time, it would have been exact in the sense that the neural network would have been asked to determine which head centered position coordinate most nearly matched the inputs coming from the retina and the eye position sensors. The second analogy is that the neural network is being asked to do functional approximation, which has not been described in this course. Given a variety of pairs of inputs and outputs, can the network learn to interpolate between multiple outputs when the inputs do not exactly match the training inputs? In either case, there is a first observation: it is not at all surprising that the neural net could achieve this transformation, because it is a simple, straightforward linear transformation, whether it be thought of as a pattern recognition problem or as a functional interpolation problem. There are many algorithms which will produce the same results. Hidden Layer Response Insights: Zipser’s main argument is that the form of the hidden layer response is similar to that of the real neurons and that this supports the hypothesis that they are involved in the computation. There are several problems with the argument, one of which has been mentioned in the paragraphs immediately above: the transformation is not unique, nor does it have any a priori special properties (e.g. linearity or fidelity to underlying physiology) which recommend it. The second problem is that there is no measure of the similarity of the responses. Certainly, they visually appear to have some similarities. To be more rigorous one would have to establish measures to quantify features of the responses such as their multimodal nature which Zipser cites. Since this feature morphology is an entirely new aspect of modeling, it will require many investigators showing how such features help in identifying models before such measures are routinely accepted. Thus, although intuitively appealing, there is no scientific basis for the argument. Modeling Insights: To be balanced, one should remember that many criticisms can be leveled at linear mouels. Linear systems models are usually not tied to the underlying physiology and few biological systems are very linear. When one finds one that is linear, it is of more significance as at least the system is relatively predictable. The purpose of the modeling exercise is rarely to predict, but rather to gain insight into the biology. The question here is whether or not one has learned something about area 7a as a result of the neural net model. With 20/20 hindsight it is easy to say that nothing new was added; however, before hand it may not have been so obvious that there was a strong likelihood that the “hidden layer neurons” were participating in the computation of head centered coordinates. The next question is what further hypotheses can Zipser and Andersen generate as a result of the modeling exercise. ...
View Full Document

Page1 / 9

BME6360 Sp11 III P64-72 - [EL 64 A BRIEF INTRODUCTION TO...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online