cs221-practice-sol6d

cs221-practice-sol6d - CS221 Midterm: Question 6d...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS221 Midterm: Question 6d supplement solution 1 CS 221, Fall 2009 Practice Midterm Solutions: Question 6d Supplement MDP question: We have an MDP with reward function R(s), transition probabilities Psa (s′ ), and discount factor 0 ≤ γ < 1. We are also given a biased coin that lands Tails with probability α and Heads with probability (1 − α), where α is known. Suppose we have a policy π that behaves in the following way: for a given state s, the policy first tosses the biased coin. If the coin lands Heads, it executes the optimal policy π ⋆ (i.e., chooses action π ⋆ (s)); if the coin lands Tails, it executes some other fixed policy π . In this problem, we ˆ will prove a bound on difference between V π (the value function for π ) and the optimal value ∗ function V π (which is also written V ∗ ). a. [4 points] Express the transition probability Psπ(s) (s′ ) in terms of α, Psπ⋆ (s) (s′ ) and Psπ(s) (s′ ). ˆ b. [4 points] For any state s, prove the following statement relating the value function V π ⋆ for policy π with the value function V π for the optimal policy π ⋆ : V π (s) − V π (s) ⋆ =γ s′ ∈S Psπ(s) (s′ ) V π (s′ ) − V π (s′ ) + Psπ(s) (s′ ) − Psπ⋆ (s) (s′ ) V π (s′ ) ⋆ ⋆ ⋆ Hint: Recall Bellman’s equations for V π and V π : V π (s) = R(s)+γ s′ ∈S Psπ(s) (s′ )V π (s′ ) , V π (s) = R(s)+γ s′ ∈S ⋆ Psπ⋆ (s) (s′ )V π (s′ ) ⋆ c. [4 points] Using the result of (a) and (b), prove the following for any state s: V π (s) − V π (s) ≤ γ s′ ∈S ⋆ Psπ(s) (s′ ) V π (s′ ) − V π (s′ ) + α Psπ(s) (s′ ) − Psπ⋆ (s) (s′ ) · V π (s′ ) ˆ ⋆ ⋆ d. [4 points] Recall that the “infinity norm” of a value function V is defined as: ||V ||∞ = max |V (s)| . s Without further assumptions, prove the following using the result of (c): ||V π − V π ||∞ ≤ Detailed explanation of part (d): ||V π − V π ||∞ ≤ max γ s s′ ∈S ⋆ ⋆ ⋆ 2γα ||V π ||∞ 1−γ (1) Psπ(s) (s′ ) V π (s′ ) − V π (s′ ) + α Psπ(s) (s′ ) − Psπ⋆ (s) (s′ ) V π (s′ ) ˆ ⋆ ⋆ (2) CS221 Midterm: Question 6d supplement solution 2 ≤ max γ s s′ ∈S Psπ(s) (s′ )||V π − V π ||∞ + α Psπ(s) (s′ ) − Psπ⋆ (s) (s′ ) ||V π ||∞ ˆ Psπ(s) (s′ )||V π − V π ||∞ + α Psπ(s) (s′ ) + Psπ⋆ (s) (s′ ) ||V π ||∞ ˆ s′ ∈S ⋆ ⋆ ⋆ ⋆ (3) (4) ≤ max γ s ≤ max γ s ||V π − V π ||∞ s′ ∈S ⋆ ⋆ Psπ(s) (s′ ) + α||V π ||∞ s′ ∈S ⋆ ⋆ Psπ(s) (s′ ) + ˆ s′ ∈S Psπ⋆ (s) (s′ ) (5) (6) (7) ≤ max γ ||V π − V π ||∞ + 2 · α||V π ||∞ s = γ ||V π − V π ||∞ + γ · 2 · α||V π ||∞ . To explain each step in more detail: • from (1) to (2): you use part (c) • from (2) to (3): use the definition of ||||∞ as the maximum over all states • from (3) to (4): use triangle inequality on Psπ(s) (s′ ) − Psπ⋆ (s) (s′ ) ˆ • from (4) to (5): push in summations and drop the absolute values around the probability values (since they are always non-negative) • from (4) to (5): push in summations and drop the absolute values around the probability values (since they are always non-negative) • from (5) to (6): use the fact that probabilities sum to 1 As in the reference solution, collecting the ||V π − V π ||∞ terms on the left and dividing by 1 − γ yields the result. ⋆ ⋆ ⋆ ...
View Full Document

This note was uploaded on 12/15/2009 for the course CS 221 taught by Professor Koller,ng during the Fall '09 term at Stanford.

Ask a homework question - tutors are online