neuralnets[1] - Neural Networks CPS 170 Ron Parr...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Neural Networks CPS 170 Ron Parr Neural Network Mo6va6on •  •  •  •  Human brains are only known example of actual intelligence Individual neurons are slow, boring Brains succeed by using massive parallelism Idea: Copy what works •  Raises many issues: –  Is the computa6onal metaphor suited to the computa6onal hardware? –  How do we know if we are copying the important part? –  Are we aiming too low? 1 Why Neural Networks? Maybe computers should be more brain ­like: Computers Brains Computational Units 108 gates/CPU 1011 neurons Storage Units Cycle Time 1010 bits RAM 1013 bits HD 10-9 S 1011 neurons 1014 synapses 10-3 S Bandwidth 1010 bits/s* 1014 bits/s Compute Power 1010 Ops/s 1014 Ops/s Comments on Jaguar (world’s fastest supercomputer as of 4/10) •  •  •  •  •  •  2,332 Teraflops 1015 Ops/s (Jaguar) vs. 1014 Ops/s (brain) 224,256 processor cores 300 TB RAM (1015 bits) 10 PB Disk storage 7 Megawa]s power (~$500K/year in electricity [my es6mate]) •  ~$100M cost •  4400 sq d size (very large house) •  Pictures and other details: h]p:// 2 More Comments on Modern Supercomputers vs. Brains •  What is wrong with this picture? –  Weight –  Size –  Power Consump6on •  What is missing? –  S6ll can’t replicate human abili6es (though vastly exceeds human abili6es in many areas) –  Are we running the wrong programs? –  Is the architecture well suited to the programs we might need to run? Ar6ficial Neural Networks •  Develop abstrac'on of func6on of actual neurons •  Simulate large, massively parallel ar6ficial neural networks on conven6onal computers •  Some have tried to build the hardware too •  Try to approximate human learning, robustness to noise, robustness to damage, etc. 3 Use of neural networks •  Classic examples –  Trained to pronounce English •  Training set: Sliding window over text, sounds •  95% accuracy on training set •  78% accuracy on test set –  Trained to recognize handwri]en digits w/>99% accuracy –  Trained to drive (Pomerleau’s no ­hands across America) •  Current examples –  Credit risk evalua6on, OCR systems, voice recogni6on, etc. (though not necessarily the best method for any of these tasks) –  Built in to many sodware packages, e.g., matlab Neural Network Lore •  Neural nets have been adopted with an almost religious fervor within the AI community  ­ several 6mes •  Oden ascribed near magical powers by people, usually those who know the least about computa6on or brains •  For most AI people, magic is gone, but neural nets remain extremely interes6ng and useful mathema6cal objects 4 Ar6ficial Neurons xj wj,i zi node/ neuron h ai = h(∑w j,i x j ) j ai is the ac6va6on level of neuron I h can be any func6on, but usually a smoothed step func6on € Threshold Func6ons 1.5 1 0.5 h(x)=sgn(x) (perceptron) 0  ­0.5  ­1  ­1.5  ­10  ­5 0 5 1 10 0.5 h(x)=tanh(x) or 1/(1+exp( ­x)) (logis6c sigmoid) 0  ­0.5  ­1  ­10  ­5 0 5 10 5 Network Architectures •  Cyclic vs. Acyclic –  Cyclic is tricky, but more biologically plausible •  Hard to analyze in general •  May not be stable •  Need to assume latches to avoid race condi6ons –  Hopfield nets: special type of cyclic net useful for associa6ve memory •  Single layer (perceptron) •  Mul6ple layer Feedforward Networks •  We consider acyclic networks •  One or more computa6onal layers •  En6re network can be viewed as compu6ng a complex non ­linear func6on •  Typical uses in learning: –  Classifica6on (usually involving complex pa]erns) –  General con6nuous func6on approxima6on 6 Special Case: Perceptron xj wj node/ neuron Y h h is a simple step func6on (sgn) Perceptron Learning •  •  •  •  •  •  We are given a set of inputs x(1)…x(n) t(1)…t(n) is a set of target outputs (boolean) { ­1,1} w is our set of weights output of perceptron = wTx Perceptron_error(x(i), w) =  ­wTx(i) * t(i) Goal: Pick w to op6mize: min ∑ perceptron _ error ( x ( i), w) w i∈misclassified € 7 Update Rule Repeat un6l convergence: ∀ i ∈misclassified ∀ : w j ← w j + αx j( i)t ( i) j “Learning Rate” (can be any constant) € •  i iterates over samples •  j iterates over weights h]p:// Perceptron Learning The Good News First •  For func6ons that are representable using the perceptron architecture (more on this later): –  Perceptron learning rule converges to correct classifier for any choice of a –  Online classifica6on possible for streaming data (very efficient implementa6on) •  Posi6ve perceptron results set off an explosion of research on neural neworks 8 Perceptron Learning Now the Bad News •  Perceptron computes a linear func6on of its inputs, •  Asks if the input lies above a line (hyperplane, in general) •  Representable func6ons are func6ons that are “linearly separable”; i.e., there exists a line (hyperplane) that separates the posi6ve and nega6ve examples •  If the training data are not linearly separable: –  No guarantees –  Perceptron learning rule may produce oscilla6ons Visualizing Linearly Separable Functions Is red linearly separable from green? Are the circles linearly separable from the squares? 9 Observa6ons •  Linear separability is fairly weak •  We have other tricks: –  Func6ons that are not linearly separable in one space, may be linearly separable in another space –  If we engineer our inputs to our neural network, then we change the space in which we are construc6ng linear separators –  Every func6on has a linear separator (in some space) •  Perhaps other network architectures will help Separability in One Dimension If we have just a single input x, there is no way a perceptron can correctly classifiy these data x=0 Copyright © 2001, 2003, Andrew W. Moore 10 Harder 1 ­dimensional dataset Remember how permi•ng non ­ linear basis func6ons made linear regression so much nicer? Let’s permit them here too, using 1,x,x2 as inputs to the perceptron x=0 Copyright © 2001, 2003, Andrew W. Moore Mul6layer Networks •  Once people realized how simple perceptrons were, they lost interest in neural networks for a while (feature engineering turned out to be imprac6cal in many cases) •  Mul6layer networks turn out to be much more expressive (with a smoothed step func6on) –  Use sigmoid, e.g., tanh(wTx) or logis6c sigmoid: 1/(1+exp( ­x)) –  With 2 layers, can represent any con6nuous func6on –  With 3 layers, can represent many discon6nuous func6ons •  Tricky part: How to adjust the weights 11 Smoothing Things Out •  Idea: Do gradient descent on a smooth error func6on •  Error func6on is sum of squared errors •  Consider a single training example first E = 0.5error ( X ( i ),w )2 ∂E ∂ E ∂a j = ∂w ij ∂a j ∂w ij ∂E =δj ∂a j € ∂a j = zi ∂w ij ai z i i ∂E = δ jzi ∂w ij a j = ∑w ijzi i j wij zj z j = h(a j ) € € Propaga6ng Errors •  For output units (assuming no weights on outputs) ∂E =δj = y −t ∂a j a j = ∑w ijzi i ai wij •  For hidden units Chain rule € i j zj=output z j = f (a j ) € ∂E ∂E ∂a k ∂E ∂h = δi = ∑ =∑ w ki i = h' (ai )∑w kiδ k ∂ €ai k ∂a k ∂a i k ∂a k k ∂ai All upstream nodes from i € 12 Differen6a6ng h •  Recall the logis6c sigmoid: ex 1 h( x ) = = 1 + ex 1 + e−x € e−x 1 1 − h( x ) = = 1 + e−x 1 + ex •  Differen6a6ng: € e −x 1 e −x h' ( x ) = = h( x )(1 − h( x )) −x 2 = −x (1 + e ) (1 + e ) (1 + e − x ) € Pu•ng it together •  Apply input x to network (sum for mul6ple inputs) –  Compute all ac6va6on levels –  Compute final output (forward pass) •  Compute δ for output units δ = y − t •  Backpropagate δ’s to hidden units ∂E ∂a k δj = ∑ = h' (a j )∑w kjδ k € k ∂a k ∂a j k •  Compute gradient update: € ∂E = δ jai ∂w ij € 13 Summary of Gradient Update •  Gradient calcula6on, parameter update have recursive formula6on •  Decomposes into: –  Local message passing –  No transcendentals: •  h’(x)=1 ­h(x)2 for tanh(x) •  h’(x)=h(x)(1 ­h(x)) for logis6c sigmoid •  Highly parallelizable •  Biologically plausible(?) •  Celebrated backpropaga'on algorithm Good News •  Can represent any con6nuous func6on with two layers (1 hidden) •  Can represent essen6ally any func6on with 3 layers •  (But how many hidden nodes?) •  Mul6layer nets are a universal approxima6on architecture with a highly parallelizable training algorithm 14 Back ­prop Issues •  •  •  •  Backprop = gradient descent on an error func6on Func6on is nonlinear (= powerful) Func6on is nonlinear (= local minima) Big nets: –  Many parameters •  Many op6ma •  Slow gradient descent •  Risk of overfi•ng –  Biological plausibility ≠ Electronic plausibility •  Many NN experts became experts in numerical analysis (by necessity) Neural Network Tricks •  Many gradient descent accelera6on tricks •  Early stopping (prevents overfi•ng) •  Methods of enforcing transforma6on invariance (e.g. if you have symmetric inputs) –  Modify error func6on –  Transform/augment training data –  Weight sharing •  Handcraded network architectures 15 Neural Nets in Prac6ce •  Many applica6ons for pa]ern recogni6on tasks •  Very powerful representa6on –  Can overfit –  Can fail to fit with too many parameters, poor features •  Very widely deployed AI technology, but –  Few open research ques6ons (Best way to get a machine learning paper rejected: “Neural Network” in 6tle.) –  Connec6on to biology s6ll uncertain –  Results are hard to interpret •  “Second best way to solve any problem” –  Can do just about anything w/enough twiddling –  Now third or fourth to SVMs, boos6ng, and ??? 16 ...
View Full Document

This note was uploaded on 02/17/2012 for the course COMPSCI 170 taught by Professor Parr during the Spring '11 term at Duke.

Ask a homework question - tutors are online