mdp-4up - 1 Machine Learning CS6375 ---Fall 2010 a Markov...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 Machine Learning CS6375 ---Fall 2010 a Markov Decision Processes Reading: Sections 16.1-16.6, 17.1-17.4, R&N 2 Rewards An assistant professor gets paid, say, 20K per year. How much, in total, will the A.P. earn in his life? 20 + 20 + 20 + 20 + 20 + …= Infinity What’s wrong with this argument? 3 Horizon Problem •The problem is that we did not put any limit on the “future”, so this sum can be infinite. •This definition is useless unless we consider a finite time horizon. •But, in general, we don’t have a good way to define such a time horizon. 4 Discounted Rewards “A reward (payment) in the future is not worth quite as much as a reward now.” • Because of chance of obliteration •Because of inflation Example: Being promised $10,000 next year is worth only 90% as much as receiving $10,000 right now. Assuming payment n years in future is worth only (0.9) n of payment now, what is the A.P.’s Discounted Sum of Future Rewards ? 5 Discount Factors People in economics and probabilistic decision making do this all the time. The discounted sum of future rewards using discount factor γ is (reward now) + γ (reward in 1 time step) + γ 2 (reward in 2 time steps) + γ 3 (reward in 3 time steps) + : : (infinite sum) 6 Discounting •Always converges if γ < 1 and the reward function, R(.), is bounded • γ close to 0 instant gratification, don’t pay attention to future reward • γ close to 1 extremely conservative, consider profits/losses no matter how far in the future •The resulting model is the discounted reward •Prefers expedient solutions (models impatience) •Compensates for uncertainty in available time (models mortality) 7 The Academic Life Define: J A = expected discounted future rewards starting in state A J B = expected discounted future rewards starting in state B J T = expected discounted future rewards starting in state T J S = expected discounted future rewards starting in state S J D = expected discounted future rewards starting in state D How do we compute J A , J B , J T , J S , J D ? 8 A Markov System with Rewards •Has a set of states { S 1 S 2 ··S N } •Has a transition probability matrix P 11 P 12 ··P 1N P= P 21 ·. P ij = Prob(NextState= S j | ThisState= S i ) : P N1 ·· P NN •Each state has a reward. { r 1 r 2 ··r N } •There’s a discount factor γ . 0 < γ < 1 On each time step : 1.Call current state S i 2.Receive reward r i 3.Randomly move to another state S j with probabiltyP ij 4.All future rewards are discounted by γ 9 Solving a Markov System Write J*(S i ) = expected discounted sum of future rewards starting in state S...
View Full Document

This note was uploaded on 11/03/2010 for the course UNIVERSITY CS6375 taught by Professor Vicentng during the Fall '10 term at University of Texas at Dallas, Richardson.

Page1 / 10

mdp-4up - 1 Machine Learning CS6375 ---Fall 2010 a Markov...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online