pomdps[1] - Par)ally Observable MDPs(POMDPs) CPS...

Unformatted text preview: 4/1/10 Par)ally Observable MDPs (POMDPs) CPS 170 Ron Parr With thanks to Christopher Painter ­Wakefield Example POMDP Uniden)fied incoming target: Observe, Update P(Hostile) Wait or shoot? Must weigh cost of friendly fire vs. cost of poten)al aPack What is the state in this problem??? 1 4/1/10 Other Example POMPs •  Pa)ent diagnosis/treatment •  Machine maintenance •  Robo)c search problems (e.g., de ­mining) Straw Man •  What if we treat the observa)on as the state? •  Violates Markov assump)on •  Can’t dis)nguish between two states that coincidentally produce similar observa)ons (no way to improve your es)mate of what’s going on over )me) •  Leads to subop)mal policies 2 4/1/10 Par)ally Observable MDP (POMDP) •  •  •  •  State space: s ∈ S Ac)on space: a ∈ A Observa)on space: z ∈ Z Reward model: R(s,a) •  Transi)on model: P(s’|s,a) •  Observa)on model: P(z|s’,a) •  Discount: γ ∈ [0,1] •  MDP dynamics (transi)ons, rewards) are unchanged. •  A^er a state transi)on, agent observes z with probability P(z|s’,a). •  State is hidden; agent only sees observa)on. Belief States True state is only par$ally observable •  b = belief state •  b[s] = probability of state s •  At each step, the agent –  takes some ac)on a –  transi)ons to some state s' with probability p(s'|s,a) –  makes observa)on z with probability p(z|s',a) •  Posterior belief given z, a, b: b' ( s' ) = α p( z | s', a)∑ p( s' | s, a)b( s) € Compare with HMMs! s 3 4/1/10 Belief Space •  Since belief is a probability distribu)on: –  For n states, belief has n ­1 degrees of freedom –  Beliefs live in a n ­1 dimensional simplex n = 3 n = 4 n = 2 Belief Space Illustrated s1 1 |S| = 3 b( s2 ) = 1 − b( s1 ) − b( s0 ) b(s1) b(s2) € s2 0 b(s0) 1 s0 4 4/1/10 POMDP Value Func)ons •  Bellman equa)on for POMDPs: Ⱥ Ⱥ V * (b) = max Ⱥρ (b, a) + γ ∑ p(b' | a, b)V * (b' ) Ⱥ a Ⱥ Ⱥ b' Expecta)on of R given b, a: = ∑ R( s, a)b( s) s € Belief transi)on probability derived from POMDP transi)on/observa)on models: = ∑ ∑ p( z | s', a)∑ p( s' | s, a) s z:baz = b' s' •  Why sum and not integral? € € Finite State Machine Policies •  Policies represented as finite state machine. –  States μ1… μm labeled with ac)ons –  Determinis)c transi)on func)on δ(μ,z) –  Belief state not used in following policy 5 4/1/10 POMDP Policy Evalua)on •  Policy x POMDP induces a Markov chain –  States: σμ,s (∀ s ∈ S, μ ∈ FSM) –  Reward func)on: ρμ,s = R(s,aμ) –  Transi)on func)on: τ(σμ,s , σμ’,s’) = P(s’|s,aμ) {z: δΣ,z)= μP (z|s’,aμ) (μ ’} Pr(μ’,s’| μ,s) Pr(s’| μ,s) Pr(μ’| s’,μ,s) –  Discount factor: γ •  POMDP value func)on can be extracted from Markov chain value func)on POMDP Value Func)ons Γ = {α1…αn} V is max surface of Γ α1 V V(b) = maxα∈Γ α b α3 α2 1 0 Facets correspond to machine states b B 0 1 6 4/1/10 Policy Itera)on for POMDPs (one of several possible methods) •  Basic idea of MDP policy itera)on carries over to POMDPs •  Implementa)on is tricky •  Highlights: –  Set of rules for adding new machine states to finite state controller, such that new controller is guaranteed to improve on old one –  Alternate between policy evalua)on phases and policy improvement phases •  Good news: Turns a nasty, con)nuous problem into a somewhat manageable discrete one •  Bad news: May add O(m#Z) new FSC states per itera)on (m = current number of states, #Z = number of possible observa)ons) •  In prac)ce, it is possible to find op)mal solu)ons only for fairly small POMDPs (high 10’s to low 100’s of states) POMDP Conclusions •  Generalize MDPs to include imperfect informa)on about the state •  Like HMMs in that we track a distribu)on over underlying states •  Every POMDP is a con)nuous state MDP, where MDP states correspond to POMDP belief states •  POMDPs are quite tricky and computa)onally expensive to solve in prac)ce 7 ...
