efficient-mdp-f10 - Summary of MDPs(until Now Finitehorizon...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Summary of MDPs (until Now) Finitehorizon MDPs Nonstationary policy Value iteration Infinitehorizon MDPs Stationary policy Value iteration Indefinite horizon MDPs Policy iteration Vk is computed in Stochastic Shortest Path problems (with initial state given) Proper policies terms of Vk1 Can exploit start state Policy k is MEU Compute V0 ..Vk .. VT the value functions for k stages to go Converges because of contraction property of Bellman operator U s Ideas for Efficient Algorithms.. e " Use heuristic search f (and reachability a information) LAO*, RTDP c t Use execution and/or o Simulation r "Actual Execution" Reinforcement learning e d (Main motivation for RL is to "learn" the model) " "Simulation" simulate the given model to sample possible futures H e u r i s t i c s e a r c h ( e . g . A * / A O Heuristic Search vs. Dynamic Programming (Value/Policy Iteration) VI and PI approaches use Dynamic Programming Update Set the value of a state in terms of the maximum expected value achievable by doing actions from that state. They do the update for every state in the state space Wasteful if we know the initial state(s) that the agent is starting from Real Time Dynamic Programming [Barto, Bradtke, Singh'95] Trial: simulate greedy policy starting from start state; perform Bellman backup on visited states RTDP: repeat Trials until cost function converges RTDP was originally introduced for Reinforcement Learning For RL, instead of "simulate" you "execute" You also have to do "exploration" in addition to "exploitation" with probability p, follow the greedy policy with 1p pick a random action Stochastic Shortest Path MDP 0 RTDP Trial Qn+1(s0,a) agreedy = a2 a1 Jn+1(s0) s0 a2 a3 ? Min ? Jn ? Jn Jn Jn Jn Goal Jn Jn Greedy "OnPolicy" RTDP without execution Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you reach a terminating state. Update the values along this path. Loop back--until the values stabilize Comments Properties if all states are visited infinitely often then Jn J* Only relevant states will be considered A state is relevant if the optimal policy could visit it. Notice emphasis on "optimal policy"--just because a rough neighborhood surrounds National Mall doesn't mean that you will need to know what to do in that neighborhood Advantages Do we care about complete convergence? Anytime: more probable states explored quickly Think Cpt. Sullenberger Disadvantages Labeled RTDP [Bonet&Geffner'03] Initialise J0 with an admissible heuristic co st s Jn monotonically increases Converged means bellman residual is less than s ? t both s and t get solved together G h Q if the Jn for that state has converged best action co st s Backpropagate `solved' labeling Stop trials when they reach any solved state hi g s G ) J(s) won't change! hi g h Q Label a state as solved Properties admissible J0 optimal J* heuristicguided explores a subset of reachable state space anytime focusses attention on more probable states fast convergence Recent Advances: Focused RTDP [Smith&Simmons'06] Similar to Bounded RTDP except a more sophisticated definition of priority that combines gap and prob. of reaching the state adaptively increasing the maxtrial length Recent Advances: Learning DFS [Bonet&Geffner'06] Iterative Deepening A* equivalent for MDPs Find strongly connected components to check for a state being solved. Other Advances Ordering the Bellman backups to maximise information flow. [Wingate & Seppi'05] [Dai & Hansen'07] Partition the state space and combine value iterations from different partitions. [Wingate & Seppi'05] [Dai & Goldsmith'07] External memory version of value iteration [Edelkamp, Jabbar & Bonet'07] Probabilistic Planning The competition (IPPC) The Action language.. (PPDDL) Factored Representations: Actions Actions can be represented directly in terms of their effects on the individual state variables (fluents). The CPTs of the BNs can be represented compactly too! Write a Bayes Network relating the value of fluents at the state before and after the action Bayes networks representing fluents at different time points are called "Dynamic Bayes Networks" We look at 2TBN (2timeslice dynamic bayes nets) Go further by using STRIPS assumption Fluents not affected by the action are not represented explicitly in the model Called Probabilistic STRIPS Operator (PSO) model Action CLK Not ergodic How to compete? Offline policy generation First compute the whole policy Policy Computation Exec Select Online action selection Loop e x Select e x Select e x Select e x Get the initial state Compute the optimal policy given the initial state and the goals Compute the best action for the current state execute it get the new state Then just execute the policy Pros: Provides fast first response IPPC (Probabilistic Planning Competition) Two Models of Evaluating Probabilistic Planning Evaluate on the quality of the policy Converging to optimal policy faster How often did you reach the goal under the given time constraints LRTDP mGPT Kolobov's approach FFHOP FFReplan 1st IPPC & PostMortem.. IPPC Competitors Most IPPC competitors used different approaches for offline policy generation. One group implemented a simple online "replanning" Results and Post mortem To everyone's surprise, the replanning approach wound up winning the competition. Lots of hand wringing ensued.. May be we should require that the FF-Replan Simple replanner Determinizes the probabilistic problem Solves for a plan in the determinized problem a3 a2 a1 a2 a3 a4 a4 G a5 G S All Outcome Replanning (FFRA) ICAPS-07 Probability1 Action Probability2 Effect 2 Action 2 Effect 2 Effect 1 Action 1 Effect 1 27 Reducing calls to FF.. We can reduce calls to FF by memoizing successes If we were given s0 and sG as the problem, and solved it using our determinization to get the plan s0--a0--s1--a1--s2--a2--s3...an--sG Then in addition to sending a1 to the simulator, we can memoize {si--ai} as the partial policy. Whenever a new state is given by the simulator, we can see if it is already in the partial policy Additionally, FFreplan can consider every state in the Hindsight Optimization for Anticipatory Planning/Scheduling Consider a deterministic planning (scheduling) domain, where the goals arrive probabilistically Using up resources and/or doing greedy actions may preclude you from exploiting the later opportunities Answer: If you have a distribution of the goal arrival, then How do you select actions to perform? Sample goals upto a certain horizon using this Probabilistic Planning (goal-oriented) Left Outcomes are more likely Time 1 Maximize Goal Achievement I A1 A2 Action Probabilistic Outcome A1 Time 2 A2 A1 A2 A1 A2 A1 A2 Action State Dead End Goal State 30 Probabilistic Planning All Outcome Determinization Find Goal I A1 Time 1 A1-1 A1-2 A2-1 A2-2 Action Probabilistic Outcome A2 A1 Time 2 A2 A1 A2 A1 A2 A1 A2 A1-1A1-2 A2-1A2-2A1-1A1-2 A2-1A2-2A1-1A1-2 A2-1A2-2A1-1A1-2 A2-1A2-2 Action State Dead End Goal State 31 Probabilistic Planning All Outcome Determinization Find Goal I A1 Time 1 A1-1 A1-2 A2-1 A2-2 Action Probabilistic Outcome A2 A1 Time 2 A2 A1 A2 A1 A2 A1 A2 A1-1A1-2 A2-1A2-2A1-1A1-2 A2-1A2-2A1-1A1-2 A2-1A2-2A1-1A1-2 A2-1A2-2 Action State Dead End Goal State 32 Problems of FF-Replan and better alternative sampling FF-Replan's Static Determinizations don't respect probabilities. We need "Probabilistic and Dynamic Determinization" Sample Future Outcomes and Determinization in Hindsight Each Future Sample Becomes a Known-Future Deterministic Problem 33 Hindsight Optimization (Online Computation of VHS ) Pick action a with highest Q(s,a,H) where VHS overestimates V* Why? Q(s,a,H) = R(s,a) + T(s,a,s')V*(s',H1) Compute V* by sampling Hhorizon future FH for M = [S,A,T,R] Mapping of state, action and time (h<H) to a state S A h S Intuitively, because VHS can assume that it can use different policies in different futures; while V* needs to pick one policy that works best (in expectation) in all futures. Commonrandom number (correlated) vs. independent futures.. Timeindependent vs. Timedependent futures But then, VFFRa overestimates VHS 34 Viewed in terms of J*, VHS Solving stochastic planning problems via determinizations Quite an old idea (e.g. envelope extension methods) What is new is that there is increasing realization that determinizing approaches provide state-of-the-art performance Even for probabilistically interesting domains Should be a happy occasion.. Hindsight Optimization (Online Computation of VHS ) H-horizon future FH for M = [S,A,T,R] Mapping of state, action and time (h<H) to a state Each Future is a SAhS R(s,FH, ) Deterministic Problem Value of a policy for FH Pick action a with highest Q(s,a,H) where Q(s,a,H) = R(s) + V*(s,H-1) V*(s,H) = max EFH [ R(s,FH,) ] Done by FF Compare this and the real value VHS(s,H) = EFH [max R(s,FH,)] VFFRa(s) = maxF V(s,F) VHS(s,H) V*(s,H) Q(s,a,H) = (R(a) + EFH-1 [max R(a(s),FH-1,)] ) In our proposal, computation of max R(s,FH-1,) is36 Implementation FF-Hindsight Constructs a set of futures Solves the planning problem using the H-horizon futures using FF Sums the rewards of each of the plans Chooses action with largest Qhs value Left Outcomes are more likely Time 1 Probabilistic Planning Maximize Goal Achievement I A1 A2 (goal-oriented) Action Probabilistic Outcome A1 Time 2 A2 A1 A2 A1 A2 A1 A2 Action State Dead End Goal State 38 Improvement Ideas Reuse Generated futures that are still relevant Scoring for action branches at each step If expected outcomes occur, keep the plan Not just probabilistic Somewhat even distribution of the space Dynamic width and horizon for sampling Actively detect and avoid unrecoverable failures on top of sampling Future generation Adaptation Left Outcomes are more likely Time 1 Hindsight Sample 1 Maximize Goal Achievement I A1 A2 Action Probabilistic Outcome A1 Time 2 A2 A1 A2 A1 A2 A1 A2 Action State A1: 1 A2: 0 Dead End Goal State 40 Exploiting Determinism Find the longest prefix for all plans Apply the actions in the prefix to continuously until one is not applicable Resume ZSL/OSL steps Exploiting Determinism Plans generated for chosen action, a* S1 S1 S1 Longest prefix for each plan is identified and executed without running ZSL, OSL or FF! a* a* a* G G G Handling unlikely outcomes: All-outcome Determinization Assign each possible outcome an action Solve for a plan Combine the plan with the plans from the HOP solutions Deterministic Techniques for Stochastic Planning No longer the Rodney Dangerfield of Stochastic Planning? Determinizations Mostlikely outcome determinization Inadmissible e.g. if only path to goal relies on less likely outcome of an action Admissible, but not very informed e.g. Very unlikely action leads you straight to goal All outcomes determinization Determinizations can also be used as a basis for heuristics to initialize the V for value iteration [mGPT; GOTH etc] Heuristics come from relaxation We can relax along two separate dimensions: Relaxations for Stochastic Planning Relax ve interactions Consider +ve interactions alone using relaxed planning graphs Relax uncertainty Solving Determinizations If we relax ve interactions Then compute relaxed plan Admissible if optimal relaxed plan is computed Inadmissible otherwise If we keep ve interactions Then use a deterministic planner (e.g. FF/LPG) Inadmissible unless the underlying planner is optimal Dimensions of Relaxation Increasing consideration 3 Negative Interactions 4 1 2 Uncertainty 1 Relaxed Plan Heuristic 2 McLUG 3 FF/LPG Reducing Uncertainty Bound the number of stochastic outcomes Stochastic "width" 4 Limited width stochastic planning? Dimensions of Relaxation Uncertainty ve interactions None None Some Full FF/LPG Limited width Stoch Planning Relaxed Plan Some McLUG Full Expressiveness v. Cost Limited width stochastic planning Node Expansions v. Heuristic Computation Cost F F FF-Replan McLUG Nodes Expanded Computation Cost h=0 FF R FF Reducing Heuristic Computation Cost by exploiting factored representations The heuristics computed for a state might give us an idea about the heuristic value of other "similar" states Similarity is possible to determine in terms of the state structure Exploit overlapping structure of heuristics for different states E.g. SAG idea for McLUG E.g. Triangle tables idea for plans (c.f. Kolobov) Suppose we have a plan A Plan is a Terrible Thing to Waste s0--a0--s1--a1--s2--a2--s3...an--sG We realized that this tells us not just the estimated value of s0, but also of s1,s2...sn So we don't need to compute the heuristic for them again If we have states and actions in factored representation, then we can explain exactly what Is that all? Triangle Table Memoization Use triangle tables / memoization C B A A B C If the above problem is solved, then we don't need to call FF again for the below: B A A B Explanationbased Generalization (of Successes and Failures) Suppose we have a plan P that solves a problem [S, G]. We can first find out what aspects of S does this plan actually depend on Explain (prove) the correctness of the plan, and see which parts of S actually contribute to this proof Now you can memoize this plan for just that subset of S Factored Representations: Reward, Value and Policy Functions Reward functions can be represented in factored form too. Possible representations include Decision trees (made up of fluents) ADDs (Algebraic decision diagrams) Value functions are like reward functions (so they too can be represented similarly) Bellman update can then be done directly using factored representations.. SPUDDs use of ADDs Direct manipulation of ADDs in SPUDD ...
View Full Document

Ask a homework question - tutors are online