9/28/2009
1
Markov Decision Processes and
Reinforcement Learning
CS4700 – Fall 2009
Jesse Simons
(based on notes by T. Joachims)
Reinforcement Learning
•
Problem
– Make sequence of decisions (policy) to get to goal / maximize utility
•
Search Problems so far
– Known environment
• State space
• Consequences of actions
– Known utility / cost function
– First compute the sequence of decisions, then execute (potentially re-
compute)
•
Real-World Problems
– Environment is unknown a priori and needs to be explored
– Utility function unknown – only examples are available for some states
• No feedback on individual actions
• Learn to act and to assign blame/credit to individual actions
– Need to quickly react to unforeseen events (have learned what to do)
Reinforcement Learning
•
Issues
– Agent can be passive (watch) or active (explore)
– Feedback (i.e. rewards) in terminal states only; or a bit of
feedback in any state
– How to measure and estimate the utility of each action
– Environment fully observable, or partially observable
– Have model of environment and effects of action…or not
Reinforcement Learning will address these issues!
Markov Decision Process
•
Representation of Environment:
– finite set of states S
– set of actions A for each state s in S
•
Process
– At each discrete time step, the agent
• observes state s
t
in S and then
• chooses action a
t
in A.
– After that, the environment
• gives agent an immediate reward r
t
• changes state to s
t+1
(can be probabilistic)
–Examp
le
s
Markov Decision Process
•
Model:
– Initial state: S
0
– Transition function: T(s,a,s’)
T(s,a,s’) is the probability of moving from state s to s’
when executing action a.
– Reward function: R(s)
Real valued reward that the agent receives for entering
state s.
•
Assumptions
– Markov property: T(s,a,s’) and R(s) only depend on current
state s, but not on any states visited earlier.
– Extension: Function R may be non-deterministic as well
Utilities
•
Rating a state sequence [s
0
, s
1
, s
2
, …]
–
We want preferences to be
stationary
–
If [s
0
, s
1
, s
2
, …] better than [s
0
, s’
1
, s’
2
, …] implies
[s
1
, s
2
, …] better than [s’
1
, s’
2
, …]
•
Two ways for stationary utility
–
Additive rewards:
•U
h
([s
0
, s
1
, s
2
, …] ) = R(s
0
) + R(s
1
) + R(s
2
) + …
–
Discounted rewards:
h
([s
0
, s
1
, s
2
, …] ) = R(s
0
) +
γ
R(s
1
) +
γ
2
R(s
2
) + …
•
Reward vs Utility