07-reinforcement

07-reinforcement - 9/28/2009 Reinforcement Learning Markov...

This preview shows pages 1–3. Sign up to view the full content.

9/28/2009 1 Markov Decision Processes and Reinforcement Learning CS4700 – Fall 2009 Jesse Simons (based on notes by T. Joachims) Reinforcement Learning Problem – Make sequence of decisions (policy) to get to goal / maximize utility Search Problems so far – Known environment • State space • Consequences of actions – Known utility / cost function – First compute the sequence of decisions, then execute (potentially re- compute) Real-World Problems – Environment is unknown a priori and needs to be explored – Utility function unknown – only examples are available for some states • No feedback on individual actions • Learn to act and to assign blame/credit to individual actions – Need to quickly react to unforeseen events (have learned what to do) Reinforcement Learning Issues – Agent can be passive (watch) or active (explore) – Feedback (i.e. rewards) in terminal states only; or a bit of feedback in any state – How to measure and estimate the utility of each action – Environment fully observable, or partially observable – Have model of environment and effects of action…or not Reinforcement Learning will address these issues! Markov Decision Process Representation of Environment: – finite set of states S – set of actions A for each state s in S Process – At each discrete time step, the agent • observes state s t in S and then • chooses action a t in A. – After that, the environment • gives agent an immediate reward r t • changes state to s t+1 (can be probabilistic) –Examp le s Markov Decision Process Model: – Initial state: S 0 – Transition function: T(s,a,s’) T(s,a,s’) is the probability of moving from state s to s’ when executing action a. – Reward function: R(s) Real valued reward that the agent receives for entering state s. Assumptions – Markov property: T(s,a,s’) and R(s) only depend on current state s, but not on any states visited earlier. – Extension: Function R may be non-deterministic as well Utilities Rating a state sequence [s 0 , s 1 , s 2 , …] We want preferences to be stationary If [s 0 , s 1 , s 2 , …] better than [s 0 , s’ 1 , s’ 2 , …] implies [s 1 , s 2 , …] better than [s’ 1 , s’ 2 , …] Two ways for stationary utility Additive rewards: •U h ([s 0 , s 1 , s 2 , …] ) = R(s 0 ) + R(s 1 ) + R(s 2 ) + … Discounted rewards: h ([s 0 , s 1 , s 2 , …] ) = R(s 0 ) + γ R(s 1 ) + γ 2 R(s 2 ) + … Reward vs Utility

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
9/28/2009 2 Example 2 3 - 1 + 1 0.8 0.1 0.1 Reward: In terminal states reward of +1 / -1 and agent gets “stuck” Each other state has a reward of -0.04. What is the probability that [up, up, right, right, right] ends in (4,3) 123 1 4 START • move into desired direction with prob 80% • move 90 degrees to left with prob 10% • move 90 degrees to right with prob 10% Policy Definition: – A policy π describes which action an agent selects in
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 05/30/2010 for the course CS 4700 taught by Professor Joachims during the Fall '07 term at Cornell University (Engineering School).

Page1 / 5

07-reinforcement - 9/28/2009 Reinforcement Learning Markov...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online