MIT16_410F10_lec23

# MIT16_410F10_lec23 - 16.410/413 Principles of Autonomy and...

This preview shows pages 1–7. Sign up to view the full content.

16.410/413 Principles of Autonomy and Decision Making Lecture 23: Markov Decision Processes Policy Iteration Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology December 1, 2010 Frazzoli (MIT) Lecture 23: MDPs December 1, 2010 1 / 22

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Assignments Readings Lecture notes [AIMA] Ch. 17.1-3. Frazzoli (MIT) Lecture 23: MDPs December 1, 2010 2 / 22
Searching over policies Value iteration converges exponentially fast, but still asymptotically. Recall how the best policy is recovered from the current estimate of the value function: π i ( s ) = arg max a E R ( s , a , s 0 ) + γ V i ( s 0 ) , s ∈ S . In order to figure out the optimal policy, it should not be necessary to compute the optimal value function exactly... Since there are only finitely many policies in a finite-state, finite-action MDP, it is reasonable to expect that a search over policies should terminate in a finite number of steps. Frazzoli (MIT) Lecture 23: MDPs December 1, 2010 3 / 22

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Policy evaluation Let us assume we have a policy, e.g., π : S → A , that assigns an action to each state. I.e., action π ( s ) will be chosen each time the system is at state s . Once the actions taken at each state are fixed, the MDP is turned into a Markov chain (with rewards). one can compute the expected utility collected over time using that policy In other words, one can evaluate how well a certain policy does by computing the value function induced by that policy. Frazzoli (MIT) Lecture 23: MDPs December 1, 2010 4 / 22
Policy evaluation example — na¨ ıve method Same planning problem as the previous lecture, in a smaller world (4x4). Simple policy π : always go right, unless at the goal (or inside obstacles). Expected utility (value function) starting from top left corner (cell 2 , 2): V π (2 , 2) 0 . 06 · 8 . 1 = 0 . 5 · · · · · · · · · · · · · · Path Prob. Utility 0.75 0 0.08 0 0.08 0 ↓→ 0.06 8.1 . . . . . . . . . Frazzoli (MIT) Lecture 23: MDPs December 1, 2010 5 / 22

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Policy evaluation Recalling the MDP properties, one can write the value function at a state as the expected reward collected at the first step + expected discounted value at the next state under the given policy V π ( s ) = E R ( s , π ( s ) , s 0 ) + γ V ( s 0 ) = X s 0 ∈S T ( s , π ( s ) , s 0 ) R ( s , π ( s ) , s 0 ) + γ V ( s 0 ) , s ∈ S
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

### What students are saying

• As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

Kiran Temple University Fox School of Business ‘17, Course Hero Intern

• I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

Dana University of Pennsylvania ‘17, Course Hero Intern

• The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

Jill Tulane University ‘16, Course Hero Intern