Pomdps[1] - Par)ally Observable MDPs(POMDPs) CPS 170 Ron Parr With thanks to Christopher Painter ­Wakefield Example POMDP

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 4/1/10 Par)ally Observable MDPs (POMDPs) CPS 170 Ron Parr With thanks to Christopher Painter ­Wakefield Example POMDP Uniden)fied incoming target: Observe, Update P(Hostile) Wait or shoot? Must weigh cost of friendly fire vs. cost of poten)al aPack What is the state in this problem??? 1 4/1/10 Other Example POMPs •  Pa)ent diagnosis/treatment •  Machine maintenance •  Robo)c search problems (e.g., de ­mining) Straw Man •  What if we treat the observa)on as the state? •  Violates Markov assump)on •  Can’t dis)nguish between two states that coincidentally produce similar observa)ons (no way to improve your es)mate of what’s going on over )me) •  Leads to subop)mal policies 2 4/1/10 Par)ally Observable MDP (POMDP) •  •  •  •  State space: s ∈ S Ac)on space: a ∈ A Observa)on space: z ∈ Z Reward model: R(s,a) •  Transi)on model: P(s’|s,a) •  Observa)on model: P(z|s’,a) •  Discount: γ ∈ [0,1] •  MDP dynamics (transi)ons, rewards) are unchanged. •  A^er a state transi)on, agent observes z with probability P(z|s’,a). •  State is hidden; agent only sees observa)on. Belief States True state is only par$ally observable •  b = belief state •  b[s] = probability of state s •  At each step, the agent –  takes some ac)on a –  transi)ons to some state s' with probability p(s'|s,a) –  makes observa)on z with probability p(z|s',a) •  Posterior belief given z, a, b: b' ( s' ) = α p( z | s', a)∑ p( s' | s, a)b( s) € Compare with HMMs! s 3 4/1/10 Belief Space •  Since belief is a probability distribu)on: –  For n states, belief has n ­1 degrees of freedom –  Beliefs live in a n ­1 dimensional simplex n = 3 n = 4 n = 2 Belief Space Illustrated s1 1 |S| = 3 b( s2 ) = 1 − b( s1 ) − b( s0 ) b(s1) b(s2) € s2 0 b(s0) 1 s0 4 4/1/10 POMDP Value Func)ons •  Bellman equa)on for POMDPs: Ⱥ Ⱥ V * (b) = max Ⱥρ (b, a) + γ ∑ p(b' | a, b)V * (b' ) Ⱥ a Ⱥ Ⱥ b' Expecta)on of R given b, a: = ∑ R( s, a)b( s) s € Belief transi)on probability derived from POMDP transi)on/observa)on models: = ∑ ∑ p( z | s', a)∑ p( s' | s, a) s z:baz = b' s' •  Why sum and not integral? € € Finite State Machine Policies •  Policies represented as finite state machine. –  States μ1… μm labeled with ac)ons –  Determinis)c transi)on func)on δ(μ,z) –  Belief state not used in following policy 5 4/1/10 POMDP Policy Evalua)on •  Policy x POMDP induces a Markov chain –  States: σμ,s (∀ s ∈ S, μ ∈ FSM) –  Reward func)on: ρμ,s = R(s,aμ) –  Transi)on func)on: τ(σμ,s , σμ’,s’) = P(s’|s,aμ) {z: δΣ,z)= μP (z|s’,aμ) (μ ’} Pr(μ’,s’| μ,s) Pr(s’| μ,s) Pr(μ’| s’,μ,s) –  Discount factor: γ •  POMDP value func)on can be extracted from Markov chain value func)on POMDP Value Func)ons Γ = {α1…αn} V is max surface of Γ α1 V V(b) = maxα∈Γ α b α3 α2 1 0 Facets correspond to machine states b B 0 1 6 4/1/10 Policy Itera)on for POMDPs (one of several possible methods) •  Basic idea of MDP policy itera)on carries over to POMDPs •  Implementa)on is tricky •  Highlights: –  Set of rules for adding new machine states to finite state controller, such that new controller is guaranteed to improve on old one –  Alternate between policy evalua)on phases and policy improvement phases •  Good news: Turns a nasty, con)nuous problem into a somewhat manageable discrete one •  Bad news: May add O(m#Z) new FSC states per itera)on (m = current number of states, #Z = number of possible observa)ons) •  In prac)ce, it is possible to find op)mal solu)ons only for fairly small POMDPs (high 10’s to low 100’s of states) POMDP Conclusions •  Generalize MDPs to include imperfect informa)on about the state •  Like HMMs in that we track a distribu)on over underlying states •  Every POMDP is a con)nuous state MDP, where MDP states correspond to POMDP belief states •  POMDPs are quite tricky and computa)onally expensive to solve in prac)ce 7 ...
View Full Document

This note was uploaded on 02/17/2012 for the course COMPSCI 170 taught by Professor Parr during the Spring '11 term at Duke.

Ask a homework question - tutors are online