Lecture 21 Notes

# T bd dp example should i stay or should i go qa stay

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: (s) % J*(s) or h(s, a) % Q*(s,a) If we have h(s) = J*(s), only need to build ﬁrst two levels of tree (action and outcome) to choose optimal action at s1 With h(s, a) = Q*(s,a), only need to build ﬁrst (action) level Often try to use h % J# or Q# for some good # Roll-outs Want h(s) % J#(s) Starting from s1 = s, sample a1 ~ #(a | s1), set c1 = c(s1,a1), sample s2 ~ T(s’ | s1,a1) Repeat until goal (or until γt small) Take h(s) = (1–γ)/γ &amp;t γtct Used in UCT (best algorithm for Go) Dynamic programming If there are a small number of states and actions, makes sense to memoize tree search ‣ compute an entire level of the tree at a time, working from bottom up ‣ store only S \$ A numbers r.t. bd DP example: should I stay or should I go? Q(A, stay) Q(A, go) J(A) DP example 2 each step costs 1 discount 0.8 DP example 2—iteration 0 Q(s,left) 1.5 1 0.5 0 0.5 0 5 10 15 5 10 15 Q(s,right) 1.5 1 0.5 0 0.5 0 State DP example 2—iteration 1 Q(s,left) 1.5 1 0.5 0 0.5 0 5 10 15 5 10 15 Q(s,right) 1.5 1 0.5 0 0.5 0 State DP example 2—iteration 3 Q(s,left) 1.5 1 0.5 0 0.5 0 5 10 15 5 10 15 Q(s,right) 1.5 1 0.5 0 0.5 0 State DP example 2—iteration 4 Q(s,left) 1.5 1 0.5 0 0.5 0 5 10 15 5 10 15 Q(s,right) 1.5 1 0.5 0 0.5 0 State DP example 2—iteratio...
View Full Document

Ask a homework question - tutors are online