Adaptive_Critic__Class_presentation111902

Adaptive_Critic__Class_presentation111902 - Creative...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Creative Learning for Intelligent Autonomous Mobile Robots Beyond Adaptive Critic Xiaoqun L iao Er nie H all Center of Robotics Resear ch U niver sity of Cincinnati http://w w w .r obotics.uc.edu Outline A r tificial N eur al N etw or k D ynamic Pr ogr amming L ear ning Algor ithms Adaptive Cr itic L ear ning A daptive I ntelligent Robot Contr oller C ontr C r eative L ear ning and Contr ol C ontr F utur e Resear ch Trains Locomotives Air Mobility Command Cargo Holding Fuel Maintenance Air Mobility Command Ramp Space Cargo Handling Reinforcement Sensory Input Action Paul Werbos The Brain As a Whole System The Brain As a Whole System Is an Intelligent Controller R obot Contr ol Str ategies Str • H uman br ain has been the model information-processing device: intelligent computers, or neural computers. Human Brain Neural Computer • • A large interconnected mass of simple processing elements, or artificial neurons The functionality of this mass, called the artificial neural network Processing Elements Artificial Neurons Functionality of the mass Artificial Neural Network Ar tificial N eur al N etw or k s Ar - T opology One-hidden Layer Sigmoidal Neural Network Input-to-node variable 1 11 Output: z = NN(p) Input: p Input weight matrix: w, W ≡ { ij } ( s × q ) Output weight: v Input, output bias: d, b p ∈ ℜ q , v, d ∈ ℜ s . . . . ∑ 1 n1 1 σ1 w ∑ 1 n2 2 d 2 σ2 ∑ s v z b sq ∑ 1 ds ns . . . σs 1 v v w s - Hidden nodes d −∞<n<∞ e −1 σ ( n) = n , e + 1 − 1 < σ ( n) < 1 n ( ( ) ) Artificial Neural Networks ­ Learning Rules Supervised learning: the desired output of the neuron is Supervised known, perhaps by providing training samples Unsupervised learning: there are no teaching examples, Unsupervised built-in rules are used for self-modification Reinforcement learning – adaptive critic learning Reinforcement adaptive Creative learning Creative L ear ning & Appr oximation for B etter M aximizing U tility Over Better T ime ANNs can provide suitable solutions for problems which Model of reality Utility function U Dynamic programming generally are characterized by: Secondary, or strategic utility function J There is only one exact method for solving There problems of optimization over time, in the optimization in general case of nonlinearity with random nonlinearity disturbance: dynamic programming (DP) disturbance 1) Nonlinearities 2) High dimensionality 3) Noisy, complex, imprecise, imperfect and/or error prone sensor data 4) A lack of a clearly stated mathematical solution or algorithm A daptive Cr itic Adaptive In dynamic programming, normally the user provides the function U(X(t), u(t)), an interest rate r, and a stochastic model. Then the analyst tries to solve for another function J(X(t)), so as to satisfy some form of Bellman equation. The nonlinear function approximator J or J(X,W), defined by a set of parameters or weights W , is called a “Critic”. If the weights W are adapted or iteratively solved for, in real time learning or offline iteration, we call the Critic as Adaptive Critic. Paul Werbos A daptive Cr itic A critic learning provides a grade to the controller of an action module such as a robot, e.g. distance criteria. A critic may be considered as a teach programmer, customer, plant manager, quality inspector. If the ultimate critic is the consumer, then the quality inspector must model the consumer’s decision and use this model in the design and manufacturing operations. ˆ J (t ) C r itic X (t ) Action L evel 1: BSAM odel fr ee Adaptive Critics u (t ) ˆ J (t + 1) C r it ic R(t +1) J ’ Cr itic J’(t) pr edict ed X (t ) R(t ) M odel u (t ) Act ion X (t ) A ction u (t ) L evel 3: H DP+ BAC L evel 2: Act ion-dependent adapt ive cr it ic: T r ain u (t ) t o M aximize pr edict ed (t J’(X (t ),u (t )) J ’( The third level HDP: model­based dynamic The third level HDP: model­based dynamic programming. The systems designed to train the “critic” to approximate J and the derivatives calculated by generalized BP. P.Werbos N ext L evel … Heuristic Dynamic Programming R(t+1) A(t+1) CRITIC J(t+1) R(t+1) MODEL A(t) ACTION ∂J(t)/∂A(t) CRITIC 1 J(t) γJ(t+1)+U(t) Adaptation Signal R(t) A(t) CRITIC ­ ∑ + J(t) U(t) R(t) a) Critic Adaptation b) Action Adaptation a) The critic output J(t+1) is necessary in order to give us the training signal e J(t+1)+U(t), ηJ(t+1)+U(t), which is the target value of J(t), b) R is a vertor of observables. A is a control vector. 1 is b) used as error signal in order to train the action network to minimize J. used D ual H eur istic Pr ogr amming CRI TI C R(t+1) λ (t+1)= ∂U(t) ∂J(t+1) ∂R(t+1) M ODEL A(t) ∂A(t) Ut ilit y R(t) ACTI ON ∂U(t) ∑ CRI TI C ­ ∂R(t) λ (t)= ∂J(t) ∂R(t) DHP has a critic network that estimates the ordered derivatives of J* with respect to the vector R, and it is a form of indirect adaptive control. Dual Heuristic Programming Application of the chain rule for derivations yields: d d d d J(t + λ 1) (t + Ri(t + 1)⋅ 1) + λ (t + Ri(t + A k(t) 1)⋅ 1)⋅ i i d j(t) R d j(t) R d k(t) A d j(t) R i= 1 k= 1 i= 1 where λi(t+1)= dJ(t+1)/dRi(t+1), and n, m are the number of outputs of the model where i(t+1)= and the action networks, respectively. and ∑∑ ∑ m n mn E 2 ( t) j d d d d γ J( t + ⋅ 1) + ( t) + U U( t) ⋅ J( t) d j( t) R d j( t) R d k( t) A d j( t) R k= 1 ∑ Globalized Dual Heuristic Programming HDP – style HDP Critic Dual Network Network Y(t) Y(t) GDHP – style GDHP Critic J(t) ∂+J(t) J(t) ∂Y(t) I llustr ation of GD H P concept Globalized Dual Heuristic Programming Globalized Dual Heuristic Programming J(t+1) Adaptation Signal 2 U(t) ∑ ∂+J(t+1)/ ∂R(t) J(t) ∂+J(t)/∂R(t) R(t) HDP – style X(t) HDP Critic Dual Network Network 1 ∑ ∂ +U(t)/ ∂ R(t) Adaptation Signal 1 ∂ J(t)/∂R(t) ∂Wc 2 e Critic’s adaptation in the general GDHP design Globalized Dual Heuristic Programming Training the critic network in GDHP utilizes an error measure which is a combination of the error measures of HDP and DHP. This results in the following LMS updates rule for the critic's weights: d d d ( J(t + ⋅ ∆ 1⋅) − ) − J(t) − E 2 ⋅ W c − γ1 U(t) ) η⋅ J( t ηj ⋅ J(t) 2 dc W d j(t) d c R W j= 1 Where E2j is given in DHP and η1, η2 are learning rates ∑ n d d d d E2(t) γ J(t + U t) + ⋅ 1) + ( U t)⋅ J(t) ( j d j(t) R d j(t) R d k(t) A d j(t) R k= 1 ∑ m To Move Beyond ACD Clear need for more advanced architectures Clear Cellular structures necessary Cellular Same with SRNs (Werbos & Pang, ’96 & ’98) Same Combine them and ACDs Combine D. Wunsch (NSF workshop, 2002) Creative Learning Creative Learning Explore the unpredictable environment Permit the discovery of unknown problems Generalize the highest level of human learning – imagination Learning Domain knowledge to escape local Experience with the guidance of a mobile robot has motivated this study to progress from simple line following to the more complex navigation and control in an unstructured environment. optima I ntelligent Robot Contr oller N eur al netw or k appr oaches to r obot contr ol contr - Supervised control: is a trainable controller, allows responsiveness to sensory inputs Direct inverse control: is trained for the inverse dynamic of the robot Neural adaptive control: neural nets combined with adaptive controllers result in greater robustness and the ability to handle nonlinearity. Backpropagation of utility: involves information flowing backward through time. Adaptive critic method: uses a critic evaluating robot performance during training Creative control: involves self initiating and corrective action Creative Yd + - Primary Controller Secondary Controller Sensors + τ Robot Y Robot Controller Learning Yd + -- A primary controller, a feedforward (inner­loop) design to track the desired trajectory under ideal conditions. A secondary controller, a feedback (outer­loop) designs to compensate for undesirable deviations (disturbances) Recur r ent Neur al Cont r oller Secondar y Cont r oller + + F Y Robot A neural learning controller is designed to utilize the available data from the repetition in robot operation. Based on the recurrent neural network architecture, it has the time­variant feature that once a trajectory is learned, it should learn a second one in a shorter time. Sensor s Creative Controller for Intelligent Robot Cr eat ive Cont r oller Robot Controller Yd Pr imar y Cont r oller + + + F Y Robot Secondar y Cont r oller -- Sensor s A creative controller is designed to integrate domain knowledge or criteria database or task control center into the adaptive critic neural network controller. It’s needed to be well­defined structure according to the autonomous mobile robot application. Creative Learning Architecture Criteria filters … Adaptive critic learning system J(t+1) Xdk+1 Critic n Critic 2 Criteria (Critic) Knowledge Database Critic Network - C Utility function Critic 1 J(t) Action Network Xk Xk Xdk Xdk+1 Model­ based Action Y - Task Control Center Z­1 Task control center integrates domain knowledge or criteria database. It provides a variety of control constructs that are commonly needed in mobile robot applications, and other autonomous mobile systems. The goal of the center is to enable autonomous mobile robot system to easily specify hierarchical task­decomposition strategies, such as how to navigate to a particular location or how to collect a desired sample or follow a track in an unstructured environment. task Task Control Task Center J Dynamic Database Adaptive critic Controller Controller goal Dynamic knowledge Database Task Control Task Center Modeling Modeling … Task Control Management Inter-Process Communication Task Description Language (TDL) Language Task Task Program Program Dynamic Dynamic Database Database Analysis Analysis Multiple (TDL) Adaptive Critic Controller – online training ∫ ∑ T ˆT ' W1 σ ( x1 )V1 ρ Kv R ∑ Critic Critic Network Network r Performance Evaluator ˆ V1σ' ( x1 ) T W1 v(t) ∑ Kv Action Action Network Network ˆ g ( x) ∑ U(t) X(t) Model Model Network Network d(t) n­link robot arm manipulator, Dynamics equation: trajectory tracking τ2 τ1 M ( q ) q + V ( q, q ) + F ( q, q ) + G ( q ) + τ d = τ Intelligent Robot Learning Mission for robot – e.g. mobile robot Task for robot to follow – J : task control Track for robot to follow Learn non­linear system model­ model discovery Learn unknown parameters such as kinematics, dynamics parameters Future Research Discussion Future Research Discussion Task control management Planning & execution Domain knowledge training (learning) Well­defined structures of creative learning Combine all the components of adaptive critics Intelligent mobile robot as test­bed More simulations, real implementations Q & A? www.robotics.uc.edu ...
View Full Document

This note was uploaded on 10/13/2010 for the course MINE 636 taught by Professor Hall during the Fall '06 term at University of Cincinnati.

Ask a homework question - tutors are online