This preview shows pages 1–9. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: [EL 64
A BRIEF INTRODUCTION TO NEURAL NETWORKS This introduction to Neural Networks takes a different approach than that chosen in the
overwhelming fraction of the large number of books and articles written on the subject. In those articles
a token amount is written to suggest that the neural networks do something which is really similar to
what the brain does. Whereas one often gets the impression that some authors believe that this similarity
means that their algorithms are automatically infused with the intelligence of a person, a more accurate
description is that the algorithms are “neurally inspired”, meaning that the originators were thinking
about the nervous system when they were trying to deal with difficult problems. When that is the case,
neural networks are an example of one direction of the flow of ideas in cybernetics, or, roughly, what
happens whenever biologists and engineers trade ideas. This is the side where engineers look for ways
to solve problems and talk to biologists among others, taking a variety of ideas but always putting as paramount the solution of engineering problems.
The other direction in which information flows is that biologists take engineering concepts and apply them to biological problems. Most of what is done is at the level of the conceptual organizing
principle. For instance, feedback is a decidedly engineering concept —— it was formalized in order to
explain the behavior of amplifiers in the 1920’s —— which has been adopted by biologists, social
scientists, and many others in order to explain behavior and even as a descriptor for good business
practices. The concept of the phaselocked loop —— again, an engineering idea — is currently in vogue,
albeit with another name, in explanations of how the brain works. The approach to neural networks which is taken here is to begin by exploring them, and in
particular the simple backpropagation network, as a means of computing the pattern recognition problem
above. We know of good statistical techniques for this problem, and we know its limitations. It will be
seen that theneural net often solves the problems associated with peculiar statistical distributions without
us having to know what they are beforehand. In essence, the neural net can substitute for a lot of
graduate training in statistics by making you capable of solving difficult problems without knowing very much.
How does this relate to biology and biological modeling? In the same way in which all mathematical modeling relates to biology: we used well—understood control systems models to explore
our understanding of the pupillary control system. Now we should use well understood neural net
models — which are, after all, just pattern recognizers of a new sort — to model computation in the brain.
We’ll present a model application of a region of the brain where associations are made which appear
consistent with a neural net model. The modeling here will be no better nor worse than modeling in any
other biological problem with any other mathematical technique: it is the quality of the insights gained
which determine the effectiveness of the modeling work. Perceptron Learning
Probably the first implemented neural net examples were those of Frank Rosenblatt and coworkers at Cornell in the 1950’s and 1960’s. They built physical electrical and electronic (tubes!)
implementations. The first was called the Perceptron from which the label Perceptron Learning is
derived. The device took in lots of data vectors x which were the light intensities on a 5 by 7 grid. The outputs were coded values for letters of the alphabet.
The algorithm works by taking the weighted sum of the inputs in a manner identical to taking a dot product but also suggestive of how a neuron sums its many inputs, some of which are close to the
soma, some of which are far away. Initially the weights are chosen randomly, but the algorithm’s job is
to change the weights with a rule so that the come to generate outputs which are desired  “learning”.
To do so you have to go through a training phase in which you compare the outputs to the desired
outputs and make corrections to the weights so that the next time the results are better. Also, there are
large numbers of sets of input data and desired outputs, not all of which are consistent. Thus it is quite likely that it is not possible for the algorithm to be perfect. III. 65 Here is an algorithm which has maintains the essential ﬂavor of Rosenblatt’s Perceptron : Given: lots of data vectors x_(k) associated with either class A(+1) or class B(—1)
Initial conditions: random wi’s such that llwll = 1. (Think of this as a unit vector pointing somewhere, which continually asks the question “How much does the data look like me?”) Iterate for each vector k, then repeat the whole data set many (1000’s) times. (= “multiple
presentations") 0 Compute y = wT 2; “how much does g look like w?” 0 Ify > 0 and class A if the output is high and it is supposed to be high
increment each weight element wi= w+xi 0 8 “reward by making it look more like 2g 0 If y < 0 and class B if the output is low and it is supposed to be low
decrement each wi = wi — xi 5 “reward w” by making it look more like 24_ ° Renormalize the weights so llwll = 1 keep w as a unit vector Result: As best as possible, the Perceptron will learn to distinguish A from B by using the projection ng to distinguish A from B. The Perceptron has to find the optimal angle for the w axis. As you might
have guessed, the solutions to this problem are well known in the linear statistics literature. In fact there
are much better classifiers than the Perceptron classifer as was noted above. You should compare this
diagram with the cluster diagrams for the action potentials described earlier in the course. 2—D Case X2 A f E A variety of weight adjustment rules will work. Since that time we have become fairly sophisticated in
proposing new kinds of “rewards” and “penalties” to these functions to make them work faster. Don’t
extrapolate from the anthropomorphic descriptions the idea that there is something especially biological
in them. The same words are used in “penalty Lagrangians”, a non—linear optimization technique
popular before the recent craze for neural nets. Also, any iterative numerical technique uses penalties
and rewards, although most do not use that terminology. III. 66
Sigmoidal Output Functions 1 1 0.5 0.5 0 >05 —1
5 0 5 y: 1( 7x) y: tanh(x) y = max(l, max(l,x)) “ . . H erbolic Tanoent Linear Limiter
The logistic function. yp D Above are three sigmoidal output functions (sometimes called “squashing functions”) which are
popularly used in neural networks. The essential feature is that the output is bounded for both large
negative and positive inputs. There is nothing magic about whether negative inputs are or are not used,
or whether the curves are centered at x=0 as shown or at x=0.5 as is often the case. These input/output
value ranges are critical to a computational algorithm, of course, but not to the overall strategy in using
the sigmoidal functions to limit outputs. Use of the sigmoidal function is usually described as: y=o( wixi) Why use a squashing function? There are many problems for which it is useful to have limited
outputs which report the idea that “the input is big” with little discrimination among degrees of bigness,
which probably originate from inputs beyond the range expected in the modeling problem. The
squashing function is pervasive in nature: neurally, it corresponds to the ideat that, not matter how
much input a neuron receives, it cannot fire faster than some amount, perhaps several hundred AP’s per
second, nor can it fire fewer than zero AP’s per second. In photography, the squashing function is
extremely wellcharacterized as it relates the density of silver grains versus the input intensity of light to
the film — the film sensivity relates to the input value at the midpoint of the curve, while contrast is essentially the slope of the curve at the midpoint.
The sigmoidal function does not affect the boundaries in the one layer Perceptron described
above. This results from the fact that the sigmoidal function is monotonic: if the test is whether or not G<Zwixi) is greater than a threshold, the equivalent test is whether or not wa is greater than a possibly different threshold. Thus the decision boundary is still a straight line. This is not true, however, for two
and three layer Perceptrons: curved boundaries can be established, as we’ll see below. This is another reason to use the sigmoidal function. Bias
The typical “neuron” in a neural network computes the following function, where G is the
sigmoidal or squashing function, and b is a bias or offset. y=c5(§_:wixi + b) The bias is useful in establishing the “operating point” on the sigmoidal curve. In the Perceptron
model above, the lines being drawn are all constrained to go through the origin. If you wanted a
decision boundary which did not go through the origin, you would have to add a bias constant to the
formula (just as in the formula for a straight line is y=mx+b.). The bias can be handled as just another
input when the learning algorithm is applied, by thinking of b as the weight for an input which is always
equal to one. This trick is extremely useful in writing vectorized programs for doing neural network analysis. N N+l
y=c5£2fwixi +b]=0[2}wixij; wherexN+1 =1 and wN+1=b
l: 1: 111.67 Generalized One Layer Perceptron Perceptrons generalize in a variety of ways, including this model where you try to force the
system to have one output which is large, while all the rest are small. The implications for pattern
recognition should be obvious. Referring back to the EEG analysis with the AutoRegressive
coefficients, you should note that the goal is to have the AR coefficients as input and to have an
algorithm indicate which one of the EEG states is consistent with the EEG signal. This is precisely the
application made by some neural network researchers. The schematic diagram of an n dimensional
input and In class problem is made, along with the a case where there are two inputs and three classes. T he one layer Perceptron is illustrated schematically in the diagram to the left. If it were trained
on the action potentials from the data illustrated earlier, it would generate decision boundaries for
classes 1, 2, 3, and 4 as illustrated to the right. In each of the “quadrants” one of the outputs would be
maximal, indicating that the input pattern best matched the given template. I'll ' ‘
o r . L. ‘
4 + a " _. 4"
IIIIIIEII 409500 2000 1E)O 1000 600 O 500 1000 1500 2003 . Hand Drawn Lines! No accuracy implied 111.68 Two Layer Perceptron with Back Propagation Counting the Number of Lavers:
The schematic diagram of a 2 Layer Perceptron is given below. It includes a set of inputs, which many people refer to as an input layer, an output layer, and a middle layer called a “hidden layer”. The
input “layer” has no “neural” processing: it merely reports the values of the inputs to the next layer.
Many people call this a 3 layer perceptron, although there is a movement toward a nomenclature that
counts only the number of layers of “neurons”. You will have to tolerate the confusion in terminology.
Decision Boundaries with TwoLaver Perceptrons The two layer perceptron has proven to be very useful; it is much more useful than the one layer
perceptron which can only “draw straight lines” to make classifications. The two layer perceptron can
use the hidden layer to “draw straight lines”, then apply the sigmoidal function, and let the output layer
combine the lines to perform AND, OR, and XOR functions with boundaries which are not quite linear.
The figure to the right is an example of how the network can divide up the feature space for purposes of
classification. The non—linearity and the second layer permit the creation of convex decision regions
which may match the underlying probability distribution much more exactly. If a third layer is added,
you can have entirely arbitrarily shaped decision regions, which could be of an advantage when you
have very strange distributions. You should refer to the article by Lippmann for a discussion of how the backpropagation network divides up the feature space. The downside to this ability to create unusual decision boundaries is that it is necessary to get
lots of examples in order to create them. For a high dimensional data set, this could be thousands or
millions of examples. For many problems, it is impossible to get this many examples; researchers will
often expand their set of examples by adding Gaussian distributed errors to the measurements. This
works for purposes of training, but has the result that the distributions are simply Gaussian and the
decision boundaries can be determined faster using the covariance matrix as described above. Iliir _
Inﬂiﬁlmin Ag
.amamcu‘
IIIQdﬂlﬁ’lllléﬂk
laymanII
IIVAIEIII 2000 4500 4000 600 O 1000 1500 2000 1500 1000 .J ‘
q
a !
"—"
ﬁ A ~15_ Hand drawn lines. No accuracy assumed 111.69 Backpropagation Algorithm: Forward propagation refers to the normal computation, in which the inputs are used to compute
the outputs of the hidden layer, whose outputs are used to compute the outputs of the next hidden layer, until the outputs of the output layer are computed. Backward pr0pagation is the use of the outputs of
the output layer and the desired outputs to make corrections to the weights of the output layer, after
which one uses those weights and the error signals to compute the equivalent error at the hidden layer,
which is used to correct the weights of that layer, until the weights of the first layer are changed. The
algorithm alternately computes the forward prediction and then backprojects the error to correct the
weights. This is repeated thousands of times for the entire data set until the output error is acceptable. Forward Formulae
y=O[WTx+_6_] ;=0[_W_;Zy+g] __..xy._ wxy’s, Wyz’s are weights; 0 = sigmoidal function; the 6’s, (D’s are bias offsets
x’s are the inputs; y’s are the hidden layer outputs; z’s are the output layer outputs. Procedure: olnitialize weights randomly 0F or each input vector, repeat the following many times until results are satisfactory
0Compute ; for a given 3
oCompute e = z — d Compare _z_ to d the desired output
oCorrect the ﬂyz weights. Here’s the formula for the ij1h weight between the y and 2 layers. Awyjzk = n 5k yj :5k = Zk (1Zk) (dk'Zk) The error in the output is (dkzk); T] is a gain constant
oCorrect the E“ weights waiwvj=n5jxt ;5j=yj(l—yj)k25szkyjzk where Eészkyjzk istheoutputerror
k k propagated back to the hidden layer. This is part of the notion of backpropagation. (If there
were more layers, this process would continue until the input is reached.)
oStop when either the sum of squared errors stops changing very much, or when the algorithm classifies
everything correctly. Equivalence to Partial Derivatives: The critical step of correcting each weight is identical to correcting in proportion to the partial
derivative of the sum of squared errors with respect to the weight. In the case of the logistic function,
this is exact; in the case of the others, it is approximate. Thus, effectively, back propagation is the same
as Let e2: lld —;II
a 62 partial derivative of Compute error with reSpect to d W any particular weight w
. d e2
Update the value of the weight to wprevious — (constant) 51D Result: The backpropagation network can learn to reduce the error between the desired and computed
output. When averaged over all the sets of data, the network will usually find an “optimal” set of
weights. These are “optimal” in the sense that, empirically, they are usually very good and most often
equal or exceed the performance of the simpler classical techniques. However, it is extremely difficult
to prove rigorously any assertion about their performance, in contrast to classical techniques for which
such error bounds often exist. There are many times, however, when a classical technique will equal the
performance of the neural net, and some times when they will be better. The neural net approach is
almost always easier to program than a more sophisiticated classical algorithm. The net effect is that
this Backpropagation Neural Network functions as a Pattern Recognizer 111.70 Unsupervised Learning in a Perceptron
The Perceptron and the Backpropagation Network are supervised networks. This means that there is a “teacher” which uses knowledge of the correct outputs to change the weight vectors so that,
when the “test” comes, the network will provide answers which are correct as often as possible. For the
OneeLayer Perceptron, with or without the sigmoid function, this essentially means making the weight
vector look more like the mean of the input class for which you want large outputs. For the multilayer
perceptron it is more difficult to describe what happens. What does it mean to have “unsupervised learning”? This is the case where either you don’t
know the correct answer or you don‘t use the information. What is it that the network is supposed to
learn? The answer is that the network will come to “learn” the statistical distribution of the inputs. For
instance, if all the inputs are the same except for a modest amount of noise, then the weight vector will
come to look like the mean of the distibution. If there are two outputs and there are two modes in the
distribution, each output’s weight vector will come to look like one of the modes. This generalizes to
many modes and many outputs. When there are either two few or too many outputs for the number of
modes in the input distribution, the network reaches some compromise which is illdefined. The algorithm below is essentially that of the Perceptron above. Select maximum
“Winner takes all Given: lots of data vectors x_(k) whose association with a class is unknown. Initial conditions: random wii ’5 such that “will 2 1. (Think of this as a unit vector pointing somewhere, which continually asks the question “How much does the data look like me?”) Iterate for each vector k, then repeat the whole data set many (1000’s) times. (= “multiple
presentations”) 0 For each output i compute y1 2 BLT x “how much does x look like 32,?”
° Find the largest output y] increment wi = 1255 + xk ' 5 “reward w,” by making it look more like xk
0 Renormalize the weights so llwkll = 1 keep wk as a unit vector Result: As best as possible, one of the outputs of the Perceptron will learn the “direction” of the mode
of the distribution of inputs because, if it happens to point in the “right” direction initially, the reward
algorithm will make it look more and more like the mean. A second output, which initially points in
another direction, closer to a second mode of the distribution, will be rewarded only when the inputs are
close to that mode and will come to “learn” that mode. The result of a case where the inputs are two dimensional and there are three outputs is illustrated below.
I I region where
Z / v1 wins
/
/ I W]
/ X1 Example with two inputs v r
and three outputs region where
y: wins [[171
A NEUROBIOLOGICAL MODELING EXAMPLE Finally, we get to look at a case where a multilayer Backpropagation neural network is used as a
modeling tool to understand biology. As you read the paper keep in mind the observations above that
Backpropagation is a tool for recognizing patterns. Further, the patterns of the weights of the hidden
layer are not well understood statistically. Although they clearly play a critical role in the algorithm,
there is no unique set of weights which solves the problem. Not discussed above is the other major use
of the supervised network: tt can be used for interpolation of mathematical functions whose analytic
expression is either unknown or too complex, but for which example values are known. The paper is:
“A backpropagation programmed network that simulates response properties of a subset
of posterior parietal neurons” D. Zipser, R. A. Andersen, Science 1988 331:679—684 Zipser did the analysis of data which Andersen had recorded some time earlier. The recordings
are from single units (neurons) in a monkey’s posterior parietal cortex, which is given the label “area
7a.” This area is in the anatomical and physiogical path from primary visual cortex to the control of eye
position. Thus, this is part of the circuitry which impacts the ocular position response, including both
the optokinetic system, which was studied earlier, and other systems, such as the smooth pursuit system. From past experimental work, this region is believed to use knowledge of head position and
the position of an object on retina in order to compute the object’s position in head centered
coordinates. Andersen had found three classes of neurons: a) retinotopic: 21% respond to visual stimuli only b) eye position: 15% respond to changes in the position of the eye only c) bo_th: 57% respond to either changes in the eye position or visual stimuli
Anderson’s hypothesis about the computation performed in area 7a is summarized by this diagram, from
which a neural network can be created. Hypothesis retinal head centered
position cells responding to position
eye (9
posmon both perform neural
net transformation Retinotopic Inputs: From experimental recordings, Andersen knows the functional dependence
of the retinotopic cells under controlled experimental conditions. Essentially, the monkey is trained to
focus on a specific point on a screen. Then points of light are shown at positions off—center while the
recordings are made. The experimenter can then reconstruct a function of firing rate versus retinal
position. In the actual experiments, 17 measurements were made  one at the center, and four each on
circles which were 10, 20, 30, and 40 degrees (visual angle) away from the center. The general
tendency is for one neuron to fire maximally a light stimulus at a particular location, and for its response
to fall off as the light is moved away. Bimodal responses —— e.g. to two locations ~— are rare. The neural network model has 64 neurons each of which has a best location, equally spaced over
the field of view. Each has the same experimentally derived response function centered around its best
location. (See Figure X) Eye Position Inputs: From experimental recordings, Andersen knows the functional dependence
of these cells to changes in gaze location. The monkey is trained to focus on spots on a screen and the
absolute angles (up/down; left/right) of the eye with respect to the head are measured along with the
firing rate of the neurons. It is found there are four classes of cells which respond to changes in eye
position. One class has a higher firing rate the further to the left the eye is positioned, but has a zero
response to rightward, upward, or downward positions. The other three classes favor the other three directions. HI.72 The neural network model has 32 inputs to represent these data, 8 each to represent left, right, up and down inputs
Head Centered Position Outputs: There are no direct neural data which support this part of the
hypothesis. It is known that the brain is wonderful and that you know where things are in terms of head centered coordinates. Otherwise you would not be able to reach out and touch someone without great
difficulties (or AT&T). Generally, there must be some computation of this kind in order to correct head
position so as to maintain fixed position of an object on the retina. However, there are no recorded
signals of neurons which have the appropriate firing pattern. Thus, Zipser made two assumptions
about the form of the output, one of which assumed that head centered position was coded similarly to
the known form of the retinotopic neurons, the other that it was similar to that used for eye position.
Hidden Layer Neurons: Although Andersen has data which he thinks represent the outputs of
the “hidden layer neurons” —— the cells which respond to both retinal stimuli and eye position — Zipser
does not use these data directly. Instead, Zipser trains the neural network to make the computation
(retinal position) + (eye position) : (position in head centered coordinates). Whatever linearities and
non—linearities are necessary to perform this transformation are learned by the sets of weights between
the input and hidden layers, and between the hidden and output layers. After the neural network has
learned the association, the responses of the neural network hidden layer and the brain’s “hidden layer”
are compared. Zipser claims that there are striking similarities in the responses and that this strongly
supports the idea that the brain is acting as a neural network and that area 7a is performing this particular transformation. Classical Analogies: The neural network is performing something very close to the pattern
recognition described above. If there were only one output active at a single time, it would have been
exact in the sense that the neural network would have been asked to determine which head centered
position coordinate most nearly matched the inputs coming from the retina and the eye position sensors. The second analogy is that the neural network is being asked to do functional approximation,
which has not been described in this course. Given a variety of pairs of inputs and outputs, can the
network learn to interpolate between multiple outputs when the inputs do not exactly match the training
inputs? In either case, there is a first observation: it is not at all surprising that the neural net could
achieve this transformation, because it is a simple, straightforward linear transformation, whether it be
thought of as a pattern recognition problem or as a functional interpolation problem. There are many algorithms which will produce the same results. Hidden Layer Response Insights: Zipser’s main argument is that the form of the hidden layer
response is similar to that of the real neurons and that this supports the hypothesis that they are involved in the computation. There are several problems with the argument, one of which has been mentioned in
the paragraphs immediately above: the transformation is not unique, nor does it have any a priori
special properties (e.g. linearity or fidelity to underlying physiology) which recommend it. The second
problem is that there is no measure of the similarity of the responses. Certainly, they visually appear to
have some similarities. To be more rigorous one would have to establish measures to quantify features
of the responses such as their multimodal nature which Zipser cites. Since this feature morphology is an
entirely new aspect of modeling, it will require many investigators showing how such features help in
identifying models before such measures are routinely accepted. Thus, although intuitively appealing,
there is no scientific basis for the argument. Modeling Insights: To be balanced, one should remember that many criticisms can be leveled at linear mouels. Linear systems models are usually not tied to the underlying physiology and few biological systems are very linear. When one finds one that is linear, it is of more significance as at least
the system is relatively predictable. The purpose of the modeling exercise is rarely to predict, but rather
to gain insight into the biology. The question here is whether or not one has learned something about
area 7a as a result of the neural net model. With 20/20 hindsight it is easy to say that nothing new was
added; however, before hand it may not have been so obvious that there was a strong likelihood that the
“hidden layer neurons” were participating in the computation of head centered coordinates. The next
question is what further hypotheses can Zipser and Andersen generate as a result of the modeling exercise. ...
View Full
Document
 Spring '08
 PRINCIPE
 Signal Processing

Click to edit the document details