l23_nr_mdp

l23_nr_mdp - Markov Decision Processes (and a small amount...

Info iconThis preview shows pages 1–6. Sign up to view the full content.

View Full Document Right Arrow Icon
Markov Decision Processes ( and a small amount of reinforcement learning) Nicholas Roy 16.410/13 Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Session 23 l i i i i i l l i l i • Li i l l ll How Shou d a Rover Search for ts Land ng Craft? State Space Search? As a Constra nt Sat sfact on Prob em? Goa -d rected P ann ng? near Programm ng? Is the rea wor d we -behaved? Landing Craft 1
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
l i i if i di i ili i ll How Shou d a Rover Search for ts Land ng Craft? What each act on can have one of a set of fferent outcomes? What f the outcomes occur probab st ca y? Landing Craft 2
Background image of page 2
i is l l i l , i l i i lici i ll si i l i l i i i lici ll l i i i li i i l l i i i l i Ideas n th ecture Prob em s to accumu ate rewards rather than to ach eve goa states. Approach s to generate react ve po es for how to act n a tuat ons, rather than p ans for a s ng e start ng s tuat on. Po es fa out of va ue funct ons , wh ch descr be the greatest fet me reward ach evab e at every state. Va ue funct ons are terat ve y approx mated. l l i s 0 r 0 a 0 s 1 a 1 r 1 s 2 a 2 r 2 s 3 i Gi i l li i imi li i MDP Prob em: Mode Agent Env ronment State Reward Act on ven an env ronment mode as a MDP create a po cy for act ng that max zes fet me reward 3
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Markov Decision Processes ( ) l: • Finite set of states, S • Finite set of actions, A • (Probabilistic) state transitions, Τ ( s i ,a j , s k ) • Reward for each state and action, R ( s i ,a i ) G l iti l l iti s 0 r 0 a 0 s 1 a 1 r 1 s 2 a 2 r 2 s 3 le : s t i i t i i i i r t s t i (s t t , s t ) MDPs Mode Process: 10 10 10 Lega trans ons shown Rewards on un abe ed trans ons are 0. Examp s1 a1 Observe state n S Choose act on a n A Rece ve mmed ate reward State changes to some +1 accord ng to T , a +1 i i i i i l i p(s t | a t , s t ) = p(s t | a t , s t ) r(s t t ) ( s t t ) i i p(s t | a t , s t ) r ini ic MDP Env ronment Assumpt ons Markov Assumpt on: Next state and reward s a funct on on y of the current state and act on: +1 , a t-1 , s t-1 , a t-2 ,... +1 , a , s t-1 , a t-1 , s t-2 ,... = r , a Uncerta n and Unknown Env ronment: +1 and may be nondeterm st and unknown 4
Background image of page 4
An MDP solution is a policy π : S A • Selects an action for each state. Optimal policy : S A • Selects action for each state that maximizes lifetime reward. G G So what is ’ to an MDP? 10 10 10 10 10 10 the solution lici ll ily i l. l i l lici G G G There are many po es, not a are necessar 10 10 10 5
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 6
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 16

l23_nr_mdp - Markov Decision Processes (and a small amount...

This preview shows document pages 1 - 6. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online