mdp1 - M aki C om pl D eci ons ng ex si CSci 5512:...

Info iconThis preview shows pages 1–7. Sign up to view the full content.

View Full Document Right Arrow Icon
M aking Com plex Decisions CSci 5512: Artifcial Intelligence II
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Sequential Decision Problems Search Planning Markov decision problems (MDPs) Decision theoretic planning Partially observable MDPs (POMDPs) explicit actions and subgoals uncertainty and utility uncertainty and utility uncertain sensing (belief states) explicit actions and subgoals
Background image of page 2
Markov Decision Process 123 1 2 3 1 + 1 4 START 0.8 0.1 0.1 States s S , actions a A Model T ( s , a , s ± ) P ( s ± | s , a ) Reward function R ( s ) (or R ( s , a ), R ( s , a , s ± )) R ( s )= ± - 0 . 04 (small penalty) for nonterminal states ± 1 for terminal states
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Solving MDPs In search problems, aim is to fnd an optimal sequence In MDPs, aim is to fnd an optimal policy π ( s ) Best action For every possible state s Cannot predict where one will end up Optimal policy maximizes expected sum of rewards Optimal policy when state penalty R ( s ) is –0.04: 123 1 2 3 1 + 1 4
Background image of page 4
Reward and Optimal Policy 1 + 1 r = [ ! 0.4278 : ! 0.0850] 1 + 1 r = [ ! 0.0480 : ! 0.0274] 1 + 1 r = [ ! 0.0218 : 0.0000] 1 + 1 r = [ ! : ! 1.6284] 8
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Utility of State Sequences Need to understand preferences between sequences of states Typically consider stationary preferences on reward sequences [ r , r 0 , r 1 , r 2 ,... ] ± [ r , r ± 0 , r ± 1 , r ± 2 ] [ r 0 , r 1 , r 2 ] ± [ r ± 0 , r ± 1 , r ± 2 ] Theorem : Only two ways to combine rewards over time: 1) Additive utility function: U ([ s 0 , s 1 , s 2 ]) = R ( s 0 )+ R ( s 1 R ( s 2 ··· 2) Discounted
Background image of page 6
Image of page 7
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 16

mdp1 - M aki C om pl D eci ons ng ex si CSci 5512:...

This preview shows document pages 1 - 7. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online