This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Neural Networks CPS 170 Ron Parr Neural Network Mo6va6on •
•
•
• Human brains are only known example of actual intelligence Individual neurons are slow, boring Brains succeed by using massive parallelism Idea: Copy what works • Raises many issues: – Is the computa6onal metaphor suited to the computa6onal hardware? – How do we know if we are copying the important part? – Are we aiming too low? 1 Why Neural Networks? Maybe computers should be more brain
like: Computers Brains Computational Units 108 gates/CPU 1011 neurons Storage Units
Cycle Time 1010 bits RAM
1013 bits HD
109 S 1011 neurons
1014 synapses
103 S Bandwidth 1010 bits/s* 1014 bits/s Compute Power 1010 Ops/s 1014 Ops/s Comments on Jaguar (world’s fastest supercomputer as of 4/10) •
•
•
•
•
• 2,332 Teraﬂops 1015 Ops/s (Jaguar) vs. 1014 Ops/s (brain) 224,256 processor cores 300 TB RAM (1015 bits) 10 PB Disk storage 7 Megawa]s power
(~$500K/year in electricity [my es6mate]) • ~$100M cost • 4400 sq d size (very large house) • Pictures and other details: h]p://www.nccs.gov/jaguar/ 2 More Comments on Modern Supercomputers vs. Brains • What is wrong with this picture? – Weight – Size – Power Consump6on • What is missing? – S6ll can’t replicate human abili6es (though vastly exceeds human abili6es in many areas) – Are we running the wrong programs? – Is the architecture well suited to the programs we might need to run? Ar6ﬁcial Neural Networks • Develop abstrac'on of func6on of actual neurons • Simulate large, massively parallel ar6ﬁcial neural networks on conven6onal computers • Some have tried to build the hardware too • Try to approximate human learning, robustness to noise, robustness to damage, etc. 3 Use of neural networks • Classic examples – Trained to pronounce English • Training set: Sliding window over text, sounds • 95% accuracy on training set • 78% accuracy on test set – Trained to recognize handwri]en digits w/>99% accuracy – Trained to drive (Pomerleau’s no
hands across America) • Current examples – Credit risk evalua6on, OCR systems, voice recogni6on, etc. (though not necessarily the best method for any of these tasks) – Built in to many sodware packages, e.g., matlab Neural Network Lore • Neural nets have been adopted with an almost religious fervor within the AI community
several 6mes • Oden ascribed near magical powers by people, usually those who know the least about computa6on or brains • For most AI people, magic is gone, but neural nets remain extremely interes6ng and useful mathema6cal objects 4 Ar6ﬁcial Neurons xj wj,i zi node/ neuron h ai = h(∑w j,i x j )
j
ai is the ac6va6on level of neuron I h can be any func6on, but usually a smoothed step func6on € Threshold Func6ons 1.5 1 0.5 h(x)=sgn(x) (perceptron) 0
0.5
1
1.5
10
5 0 5 1 10 0.5 h(x)=tanh(x) or 1/(1+exp(
x)) (logis6c sigmoid) 0
0.5
1
10
5 0 5 10 5 Network Architectures • Cyclic vs. Acyclic – Cyclic is tricky, but more biologically plausible • Hard to analyze in general • May not be stable • Need to assume latches to avoid race condi6ons – Hopﬁeld nets: special type of cyclic net useful for associa6ve memory • Single layer (perceptron) • Mul6ple layer Feedforward Networks • We consider acyclic networks • One or more computa6onal layers • En6re network can be viewed as compu6ng a complex non
linear func6on • Typical uses in learning: – Classiﬁca6on (usually involving complex pa]erns) – General con6nuous func6on approxima6on 6 Special Case: Perceptron xj wj node/ neuron Y h h is a simple step func6on (sgn) Perceptron Learning •
•
•
•
•
• We are given a set of inputs x(1)…x(n) t(1)…t(n) is a set of target outputs (boolean) {
1,1} w is our set of weights output of perceptron = wTx Perceptron_error(x(i), w) =
wTx(i) * t(i) Goal: Pick w to op6mize: min ∑ perceptron _ error ( x ( i), w)
w
i∈misclassified € 7 Update Rule Repeat un6l convergence: ∀ i ∈misclassified ∀ : w j ← w j + αx j( i)t ( i)
j “Learning Rate” (can be any constant) € • i iterates over samples • j iterates over weights h]p://neuron.eng.wayne.edu/java/Perceptron/New38.html Perceptron Learning
The Good News First
• For func6ons that are representable using the perceptron architecture (more on this later): – Perceptron learning rule converges to correct classiﬁer for any choice of a – Online classiﬁca6on possible for streaming data (very eﬃcient implementa6on) • Posi6ve perceptron results set oﬀ an explosion of research on neural neworks 8 Perceptron Learning
Now the Bad News
• Perceptron computes a linear func6on of its inputs, • Asks if the input lies above a line (hyperplane, in general) • Representable func6ons are func6ons that are “linearly separable”; i.e., there exists a line (hyperplane) that separates the posi6ve and nega6ve examples • If the training data are not linearly separable: – No guarantees – Perceptron learning rule may produce oscilla6ons Visualizing Linearly Separable Functions Is red linearly separable from green?
Are the circles linearly separable from the squares? 9 Observa6ons • Linear separability is fairly weak • We have other tricks: – Func6ons that are not linearly separable in one space, may be linearly separable in another space – If we engineer our inputs to our neural network, then we change the space in which we are construc6ng linear separators – Every func6on has a linear separator (in some space) • Perhaps other network architectures will help Separability in One Dimension If we have just a single input x, there is no way a perceptron can correctly classiﬁy these data x=0 Copyright © 2001, 2003, Andrew W. Moore 10 Harder 1
dimensional dataset Remember how permi•ng non
linear basis func6ons made linear regression so much nicer? Let’s permit them here too, using 1,x,x2 as inputs to the perceptron x=0 Copyright © 2001, 2003, Andrew W. Moore Mul6layer Networks • Once people realized how simple perceptrons were, they lost interest in neural networks for a while (feature engineering turned out to be imprac6cal in many cases) • Mul6layer networks turn out to be much more expressive
(with a smoothed step func6on) – Use sigmoid, e.g., tanh(wTx) or logis6c sigmoid: 1/(1+exp(
x)) – With 2 layers, can represent any con6nuous func6on – With 3 layers, can represent many discon6nuous func6ons • Tricky part: How to adjust the weights 11 Smoothing Things Out • Idea: Do gradient descent on a smooth error func6on • Error func6on is sum of squared errors • Consider a single training example ﬁrst E = 0.5error ( X ( i ),w )2
∂E
∂ E ∂a j
=
∂w ij ∂a j ∂w ij
∂E
=δj
∂a j € ∂a j
= zi
∂w ij ai z i
i ∂E
= δ jzi
∂w ij
a j = ∑w ijzi
i
j
wij zj
z j = h(a j )
€ € Propaga6ng Errors • For output units (assuming no weights on outputs) ∂E
=δj = y −t
∂a j
a j = ∑w ijzi
i
ai wij • For hidden units Chain rule € i j zj=output z j = f (a j )
€
∂E
∂E ∂a k
∂E
∂h
= δi = ∑
=∑
w ki i = h' (ai )∑w kiδ k
∂
€ai
k ∂a k ∂a i
k ∂a k
k
∂ai
All upstream nodes from i € 12 Diﬀeren6a6ng h
• Recall the logis6c sigmoid: ex
1
h( x ) =
=
1 + ex 1 + e−x
€ e−x
1
1 − h( x ) =
=
1 + e−x 1 + ex
• Diﬀeren6a6ng: €
e −x
1
e −x
h' ( x ) =
= h( x )(1 − h( x ))
−x 2 =
−x
(1 + e )
(1 + e ) (1 + e − x )
€ Pu•ng it together • Apply input x to network (sum for mul6ple inputs) – Compute all ac6va6on levels – Compute ﬁnal output (forward pass) • Compute δ for output units δ = y − t
• Backpropagate δ’s to hidden units ∂E ∂a k
δj = ∑
= h' (a j )∑w kjδ k
€ k ∂a k ∂a j
k • Compute gradient update: € ∂E
= δ jai
∂w ij
€
13 Summary of Gradient Update • Gradient calcula6on, parameter update have recursive formula6on • Decomposes into: – Local message passing – No transcendentals: • h’(x)=1
h(x)2 for tanh(x) • h’(x)=h(x)(1
h(x)) for logis6c sigmoid • Highly parallelizable • Biologically plausible(?) • Celebrated backpropaga'on algorithm Good News • Can represent any con6nuous func6on with two layers (1 hidden) • Can represent essen6ally any func6on with 3 layers • (But how many hidden nodes?) • Mul6layer nets are a universal approxima6on architecture with a highly parallelizable training algorithm 14 Back
prop Issues •
•
•
• Backprop = gradient descent on an error func6on Func6on is nonlinear (= powerful) Func6on is nonlinear (= local minima) Big nets: – Many parameters • Many op6ma • Slow gradient descent • Risk of overﬁ•ng – Biological plausibility ≠ Electronic plausibility • Many NN experts became experts in numerical analysis (by necessity) Neural Network Tricks • Many gradient descent accelera6on tricks • Early stopping (prevents overﬁ•ng) • Methods of enforcing transforma6on invariance (e.g. if you have symmetric inputs) – Modify error func6on – Transform/augment training data – Weight sharing • Handcraded network architectures 15 Neural Nets in Prac6ce • Many applica6ons for pa]ern recogni6on tasks • Very powerful representa6on – Can overﬁt – Can fail to ﬁt with too many parameters, poor features • Very widely deployed AI technology, but – Few open research ques6ons (Best way to get a machine learning paper rejected: “Neural Network” in 6tle.) – Connec6on to biology s6ll uncertain – Results are hard to interpret • “Second best way to solve any problem” – Can do just about anything w/enough twiddling – Now third or fourth to SVMs, boos6ng, and ??? 16 ...
View
Full
Document
This note was uploaded on 02/17/2012 for the course COMPSCI 170 taught by Professor Parr during the Spring '11 term at Duke.
 Spring '11
 Parr
 Artificial Intelligence

Click to edit the document details