This preview shows page 1. Sign up to view the full content.
Unformatted text preview: CS221 Midterm: Question 6d supplement solution 1 CS 221, Fall 2009 Practice Midterm Solutions: Question 6d Supplement
MDP question: We have an MDP with reward function R(s), transition probabilities Psa (s′ ), and discount factor 0 ≤ γ < 1. We are also given a biased coin that lands Tails with probability α and Heads with probability (1 − α), where α is known. Suppose we have a policy π that behaves in the following way: for a given state s, the policy ﬁrst tosses the biased coin. If the coin lands Heads, it executes the optimal policy π ⋆ (i.e., chooses action π ⋆ (s)); if the coin lands Tails, it executes some other ﬁxed policy π . In this problem, we ˆ will prove a bound on diﬀerence between V π (the value function for π ) and the optimal value ∗ function V π (which is also written V ∗ ). a. [4 points] Express the transition probability Psπ(s) (s′ ) in terms of α, Psπ⋆ (s) (s′ ) and Psπ(s) (s′ ). ˆ b. [4 points] For any state s, prove the following statement relating the value function V π ⋆ for policy π with the value function V π for the optimal policy π ⋆ : V π (s) − V π (s)
⋆ =γ
s′ ∈S Psπ(s) (s′ ) V π (s′ ) − V π (s′ ) + Psπ(s) (s′ ) − Psπ⋆ (s) (s′ ) V π (s′ )
⋆ ⋆ ⋆ Hint: Recall Bellman’s equations for V π and V π : V π (s) = R(s)+γ
s′ ∈S Psπ(s) (s′ )V π (s′ ) , V π (s) = R(s)+γ
s′ ∈S ⋆ Psπ⋆ (s) (s′ )V π (s′ ) ⋆ c. [4 points] Using the result of (a) and (b), prove the following for any state s: V π (s) − V π (s) ≤ γ
s′ ∈S
⋆ Psπ(s) (s′ ) V π (s′ ) − V π (s′ ) + α Psπ(s) (s′ ) − Psπ⋆ (s) (s′ ) · V π (s′ ) ˆ ⋆ ⋆ d. [4 points] Recall that the “inﬁnity norm” of a value function V is deﬁned as: V ∞ = max V (s) .
s Without further assumptions, prove the following using the result of (c): V π − V π ∞ ≤ Detailed explanation of part (d): V π − V π ∞ ≤ max γ
s s′ ∈S
⋆ ⋆ ⋆ 2γα V π ∞ 1−γ (1) Psπ(s) (s′ ) V π (s′ ) − V π (s′ ) + α Psπ(s) (s′ ) − Psπ⋆ (s) (s′ ) V π (s′ ) ˆ
⋆ ⋆ (2) CS221 Midterm: Question 6d supplement solution 2 ≤ max γ
s s′ ∈S Psπ(s) (s′ )V π − V π ∞ + α Psπ(s) (s′ ) − Psπ⋆ (s) (s′ ) V π ∞ ˆ Psπ(s) (s′ )V π − V π ∞ + α Psπ(s) (s′ ) + Psπ⋆ (s) (s′ ) V π ∞ ˆ
s′ ∈S
⋆ ⋆ ⋆ ⋆ (3) (4) ≤ max γ
s ≤ max γ
s V π − V π ∞
s′ ∈S
⋆ ⋆ Psπ(s) (s′ ) + αV π ∞
s′ ∈S
⋆ ⋆ Psπ(s) (s′ ) + ˆ
s′ ∈S Psπ⋆ (s) (s′ ) (5) (6) (7) ≤ max γ V π − V π ∞ + 2 · αV π ∞
s = γ V π − V π ∞ + γ · 2 · αV π ∞ . To explain each step in more detail: • from (1) to (2): you use part (c) • from (2) to (3): use the deﬁnition of ∞ as the maximum over all states • from (3) to (4): use triangle inequality on Psπ(s) (s′ ) − Psπ⋆ (s) (s′ ) ˆ • from (4) to (5): push in summations and drop the absolute values around the probability values (since they are always nonnegative) • from (4) to (5): push in summations and drop the absolute values around the probability values (since they are always nonnegative) • from (5) to (6): use the fact that probabilities sum to 1 As in the reference solution, collecting the V π − V π ∞ terms on the left and dividing by 1 − γ yields the result.
⋆ ⋆ ⋆ ...
View
Full
Document
This note was uploaded on 12/15/2009 for the course CS 221 taught by Professor Koller,ng during the Fall '09 term at Stanford.
 Fall '09
 KOLLER,NG

Click to edit the document details