{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

mdps-f10

# mdps-f10 - Markov Decision Processes Based in part on...

This preview shows pages 1–12. Sign up to view the full content.

1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Atomic Model for stochastic environments with generalized rewards h Deterministic worlds + goals of attainment h Atomic model: Graph search h Propositional models: The PDDL planning that we discussed. . h Stochastic worlds +generalized rewards h An action can take you to any of a set of states with known probability h You get rewards for visiting each state h Objective is to increase your “cumulative” reward… 2
3

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Optimal Policies depend on horizon, rewards. . - - - -
5 Percepts Actions ???? World perfect fully observable instantaneous deterministic Classical Planning Assumptions sole source of change

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
6 Percepts Actions ???? World perfect fully observable instantaneous stochastic Stochastic/Probabilistic Planning: Markov Decision Process (MDP) Model sole source of change
7 Types of Uncertainty h Disjunctive (used by non-deterministic planning) Next state could be one of a set of states. h Stochastic/Probabilistic Next state is drawn from a probability distribution over the set of states. How are these models related?

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
8 Markov Decision Processes h An MDP has four components: S , A , R , T : 5 (finite) state set S (|S| = n) 5 (finite) action set A (|A| = m) 5 (Markov) transition function T(s,a,s’) = Pr(s’ | s,a) g Probability of going to state s’ after taking action a in state s g How many parameters does it take to represent? 5 bounded, real-valued (Markov) reward function R(s) g Immediate reward we get for being in state s g For example in a goal-based domain R(s) may equal 1 for goal states and 0 for all others g Can be generalized to include action costs: R(s,a) g Can be generalized to be a stochastic function h Can easily generalize to countable or continuous state and action spaces (but algorithms will be different)
9 Graphical View of MDP St Rt St+1 A t Rt+1 St+2 A t+1 Rt+2

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
10 Assumptions h First-Order Markovian dynamics (history independence) 5 Pr(S t+1 |A t ,S t ,A t-1 ,S t-1 ,..., S 0 ) = Pr(S t+1 |A t ,S t ) 5 Next state only depends on current state and current action h First-Order Markovian reward process 5 Pr(R t |A t ,S t ,A t-1 ,S t-1 ,..., S 0 ) = Pr(R t |A t ,S t ) 5 Reward only depends on current state and action 5 As described earlier we will assume reward is specified by a deterministic function R(s) g i.e. Pr(R t =R(S t ) | A t ,S t ) = 1 h Stationary dynamics and reward 5 Pr(S t+1 |A t ,S t ) = Pr(S k+1 |A k ,S k ) for all t, k 5 The world dynamics do not depend on the absolute time h Full observability 5 Though we can’t predict exactly which state we will reach when we execute an action, once it is realized, we know what it is
11 Policies (“plans” for MDPs) h Nonstationary policy [ Even though we have stationary dynamics and reward?? ] 5 π:S x T → A, where T is the non-negative integers 5 π(s,t) is action to do at state s with t stages-to-go 5 What if we want to keep acting indefinitely? h

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

### Page1 / 41

mdps-f10 - Markov Decision Processes Based in part on...

This preview shows document pages 1 - 12. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online