Lecture 21 Notes

Vs as before markov decision process state space

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: arkov decision process: inﬂuence diagram States, actions, initial state s1, (expected) costs C(s,a) ∈ [Cmin, Cmax], transitions T(s’ | s, a) Inﬂuence diagrams Like a Bayes net, except: ‣ diamond nodes are costs/rewards ‣ must have no children ‣ square nodes are decisions ‣ we pick the CPTs (before seeing anything) ‣ minimize expected cost Circles are ordinary r.v.s as before Markov decision process: state space diagram goal: all costs = 0, self-transition 100% States, actions, costs C(s,a) ∈ [Cmin, Cmax], transitions T(s’ | s, a), initial state s1 Choosing actions Execution trace: τ = (s1, a1, c1, s2, a2, c2, …) ‣ c1 = C(s1, a1), c2 = C(s2, a2), etc. ‣ s2 ~ T(s | s1, a1), s3 ~ T(s | s2, a2), etc. Policy #: S → A ‣ or randomized, #(a | s) Trace from #: a1 ~ #(a | s1), etc. ‣ τ is then an r.v. with known distribution ‣ we’ll write τ ~ # (rest of MDP implicit) Choosing good actions discount factor in (0,1) Value of a policy: J = π Objective: J∗ π ∗ ￿ ￿ ￿ ￿ ￿ 1−γ t E γ ct ￿ τ ∼ π ￿ γ t = min J π ∈ arg min J π π π Why a...
View Full Document

This note was uploaded on 01/24/2014 for the course CS 15-780 taught by Professor Bryant during the Fall '09 term at Carnegie Mellon.

Ask a homework question - tutors are online