mdp[1] - Markov Decision Processes(MDPs) Ron Parr...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Markov Decision Processes (MDPs) Ron Parr CPS 170 Covered in First Lecture •  Decision Theory Review •  MDPs •  Algorithms for MDPs –  Value DeterminaHon –  OpHmal Policy SelecHon •  Value IteraHon •  Policy IteraHon •  Linear Programming 1 Decision Theory What does it mean to make an opHmal decision? •  •  •  •  •  Asked by economists to study consumer behavior Asked by MBAs to maximize profit Asked by leaders to allocate resources Asked in OR to maximize efficiency of operaHons Asked in AI to model intelligence •  Asked (sort of) by any intelligent person every day UHlity FuncHons •  A u"lity func"on is a mapping from world states to real numbers •  Also called a value func"on •  RaHonal or opHmal behavior is typically viewed as maximizing expected uHlity: max ∑ P( s | a)U( s) a s a = acHons, s = states € 2 Swept under the today •  UHlity of money (assumed 1:1) •  How to determine costs/uHliHes •  How to determine probabiliHes Playing a Game Show •  Assume series of quesHons –  Increasing difficulty –  Increasing payoff •  Choice: –  Accept accumulated earnings and quit –  ConHnue and risk losing everything •  “Who wants to be a millionaire?” 3 State RepresentaHon (simplified game) Start $100 1 correct $1,000 2 correct $10,000 3 correct $50,000 $61,100 $0 $0 $100 $0 $0 $1,100 $11,100 Making OpHmal Decisions •  Work backwards from future to present •  Consider $50,000 quesHon –  Suppose P(correct) = 1/10 –  V(stop)=$11,100 –  V(conHnue) = 0.9*$0 + 0.1*$61.1K = $6.11K •  OpHmal decision stops 4 Working Recursively V=$3,749 V=$4,166 9/10 $0 V=$5,555 3/4 X $100 V=$11.1K 1/2 X $0 X $0 $0 $1,100 1/10 $11,100 Decision Theory Review •  Provides theory of opHmal decisions •  Principle of maximizing uHlity •  Easy for small, tree structured spaces with –  Known uHliHes –  Known probabiliHes 5 Covered in Today •  Decision Theory •  MDPs •  Algorithms for MDPs –  Value DeterminaHon –  OpHmal Policy SelecHon •  Value IteraHon •  Policy IteraHon •  Linear Programming Dealing with Loops Suppose you can pay $1000 (from any losing state) to play again 9/10 3/4 $0 $ ­1000 1/2 $0 $100 $0 $0 $1,100 1/10 $11,100 6 From Policies to Linear Systems •  Suppose we always pay unHl we win. •  What is value of following this policy? V ( s0 ) = 0.10( −1000 + V ( s0 )) + 0.90V ( s1 ) V ( s1 ) = 0.25( −1000 + V ( s0 )) + 0.75V ( s2 ) V ( s2 ) = 0.50( −1000 + V ( s0 )) + 0.50V ( s3 ) V ( s3 ) = 0.90( −1000 + V ( s0 )) + 0.10(61100) Return to Start ConHnue € And the soluHon is… V=$3,749 V=$4,166 V=$5,555 V=$11.11K V=$32.47K V=$32.58K V=$32.95K w/o cheat V=$34.43K 9/10 3/4 1/2 1/10 $ ­1000 Is this opHmal? How do we find the opHmal policy? 7 The MDP Framework •  State space: S •  AcHon space: A •  TransiHon funcHon: P •  Reward funcHon: R •  Discount factor: •  Policy: ObjecHve: Maximize expected, discounted return (decision theoreHc opHmal behavior) ApplicaHons of MDPs •  AI/Computer Science –  RoboHc control (Koenig & Simmons, Thrun et al., Kaelbling et al.) –  Air Campaign Planning (Meuleau et al.) –  Elevator Control (Barto & Crites) –  ComputaHon Scheduling (Zilberstein et al.) –  Control and AutomaHon (Moore et al.) –  Spoken dialogue management (Singh et al.) –  Cellular channel allocaHon (Singh & Bertsekas) 8 ApplicaHons of MDPs •  Economics/OperaHons Research –  Fleet maintenance (Howard, Rust) –  Road maintenance (Golabi et al.) –  Packet Retransmission (Feinberg et al.) –  Nuclear plant management (Rothwell & Rust) ApplicaHons of MDPs •  EE/Control –  Missile defense (Bertsekas et al.) –  Inventory management (Van Roy et al.) –  Football play selecHon (Patek & Bertsekas) •  Agriculture –  Herd management (Kristensen, Tos) 9 The Markov AssumpHon •  Let St be a random variable for the state at Hme t •  P(St|At ­1St ­1,…,A0S0) = P(St|At ­1St ­1) •  Markov is special kind of condiHonal independence •  Future is independent of past given current state Understanding DiscounHng •  MathemaHcal moHvaHon –  Keeps values bounded –  What if I promise you $0.01 every day you visit me? •  Economic moHvaHon –  Discount comes from inflaHon –  Promise of $1.00 in future is worth $0.99 today •  Probability of dying –  Suppose e probability of dying at each decision interval –  TransiHon w/prob ε to state with value 0 –  Equivalent to 1 ­ ε discount factor 10 DiscounHng in PracHce •  Osen chosen unrealisHcally low –  Faster convergence –  Slightly myopic policies •  Can reformulate most algs. for avg. reward –  MathemaHcally uglier –  Somewhat slower run Hme Covered Today •  Decision Theory •  MDPs •  Algorithms for MDPs –  Value DeterminaHon –  OpHmal Policy SelecHon •  Value IteraHon •  Policy IteraHon •  Linear Programming 11 Value DeterminaHon Determine the value of each state under policy π V ( s) = R( s, π ( s)) + γ ∑s' P( s' | s, π ( s))V ( s' ) Bellman EquaHon 0.4 0.6 € R=1 S2 S3 S1 V ( s1 ) = 1 + γ (0.4V ( s2 ) + 0.6V ( s3 )) € Matrix Form ȹ P( s1 | s1, π ( s1 )) P( s2 | s1, π ( s1 )) P( s3 | s1, π ( s1 )) ȹ ȹ ȹ P = ȹ P( s1 | s2 , π ( s2 )) P( s2 | s2 , π ( s2 )) P( s3 | s2 , π ( s2 ))ȹ ȹ ȹ ȹ P( s1 | s3 , π ( s3 )) P( s2 | s3 , π ( s3 )) P( s3 | s3 , π ( s3 ))Ⱥ V = γPπ V + R € How do we solve this system? € 12 Solving for Values V = γPπ V + R For moderate numbers of states we can solve this system exacty: V = (I − γPπ ) −1 R € Guaranteed inverHble because has spectral radius <1 € IteraHvely Solving for Values V = γPπ V + R For larger numbers of states we can solve this system indirectly: € i +1 i V = γPπ V + R Guaranteed convergent because has spectral radius <1 € 13 Establishing Convergence •  Eigenvalue analysis •  Monotonicity –  Assume all values start pessimisHc –  One value must always increase –  Can never overesHmate •  ContracHon analysis… ContracHon Analysis •  Define maximum norm V ∞ = max i Vi •  Consider V1 and V2 € V1 − V2 ∞ =ε •  WLOG say € V1 ≤ V2 + ε (Vector of all ε’s) € 14 ContracHon Analysis Contd. •  At next iteraHon for V2: V ' = R + γPV2 2 •  For V1 V 1 ' = R + γP(V 1 ) ≤ R + γP(V2 + ε ) = R + γPV2 + γPε = R + γPV2 + γε € Distribute •  Conclude: € V2 ' − V 1 ' ∞ ≤ γε € Importance of ContracHon •  Any two value funcHons get closer •  True value funcHon V* is a fixed point •  Max norm distance from V* decreases drama"cally quickly with iteraHons V (0) − V * ∞ = ε → V ( n) − V * ≤ γ nε ∞ NB: (Superscripts) indicate iteraHons here € 15 IteraHve Policy EvaluaHon 9/10 3/4 1/2 1/10 $111,100 $ ­1000 0.00  ­100.00  ­335.00  ­718.50 914.90 0.00  ­250.00  ­650.00 1207.50 989.75 0.00  ­500.00 2055.00 1892.50 1595.00 0.00 5210.00 4908.00 9908.50 4563.35 882.26 1174.97 2239.12 6033.41 IteraHons IteraHons ConHnued iteraHon V(S0) V(S1) V(S2) V(S3) 0 0.0 0.0 0.0 0.0 1  ­100.0  ­250.0  ­500.0 5210.0 2  ­335.0  ­650.0 2055.0 4908.0 3  ­718.5 1207.5 1892.5 9908.5 4 914.9 989.8 1595.0 4563.4 5 882.3 1175.0 2239.1 6033.4 10 2604.5 3166.7 4158.8 7241.8 20 5994.8 6454.5 7356.0 10.32K 200 29.73K 29.25K 29.57K 31.61K 2000 32.47K 32.58K 32.95K 34.43K Note: Slow convergence because γ=1 16 Covered Today •  Decision Theory •  MDPs •  Algorithms for MDPs –  Value DeterminaHon –  OpHmal Policy SelecHon •  Value IteraHon •  Policy IteraHon •  Linear Programming Finding Good Policies Suppose an expert told you the “value” of each state: V(S1) = 10 V(S2) = 5 0.5 S1 0.7 S1 0.5 S2 0.3 S2 AcHon 1 AcHon 2 17 Improving Policies •  How do we get the opHmal policy? •  Take the opHmal acHon in every state •  Fixed point equaHon with choices: V * ( s) = maxa ∑s' R( s, a) + γP( s' | s, a)V * ( s' ) Decision theoreHc opHmal choice given V* € Value IteraHon We can’t solve the system directly with a max in the equaHon Can we solve it by iteraHon? V i+1 ( s) = maxa ∑s' R( s, a) + γP( s' | s, a)V i ( s' ) • Called value itera"on or simply successive approxima"on • Same as value determinaHon, but we can change acHons € • Convergence: •  Can’t do eigenvalue analysis (not linear) •  SHll monotonic •  SHll a contracHon in max norm (exercise) •  Converges quickly 18 OpHmality •  VI converges to opHmal policy •  Why? •  OpHmal policy is staHonary •  Why? Covered Today •  Decision Theory •  MDPs •  Algorithms for MDPs –  Value DeterminaHon –  OpHmal Policy SelecHon •  Value IteraHon •  Policy IteraHon •  Linear Programming 19 Greedy Policy ConstrucHon Pick acHon with highest expected future value: π ( s) = arg maxa R( s, a) + γ ∑s' P( s' | s, a)V ( s' ) v ExpectaHon over next ­state values € πv = greedy(V ) € Consider our first policy V=$3.7K V=$4.1K 9/10 V=$5.6K 3/4 V=$11.1K 1/2 X w/o cheat 1/10 $-1000 Recall: We played until last state, then quit Is this greedy with cheat option? Value of continuing in last state is: 0.1*111,100 + 0.9*(3,749-1000)=$13584 20 Bootstrapping: Policy IteraHon Idea: Greedy selecHon is useful even with subopHmal V Guess πv=π0 Vπ = value of acHng on π (solve linear system) πv←greedy(Vπ) Repeat unHl policy doesn’t change Guaranteed to find opHmal policy Usually takes very small number of iteraHons CompuHng the value funcHons is the expensive part Comparing VI and PI •  VI –  Value changes at every step –  Policy may change at every step –  Many cheap iteraHons •  PI –  Alternates policy/value updates –  Solves for value of each policy exactly –  Fewer, slower iteraHons (need to invert matrix) •  Convergence –  Both are contracHons in max norm –  PI is shockingly fast in pracHce (why?) 21 ComputaHonal Complexity •  VI and PI are both contracHon mappings w/rate γ (we didn’t prove this for PI in class) •  VI costs less per iteraHon •  PI tends to take O(n) iteraHons in pracHce •  Open quesHon: SubexponenHal bound on PI •  Is there a guaranteed poly Hme MDP algorithm??? Covered Today •  Decision Theory •  MDPs •  Algorithms for MDPs –  Value DeterminaHon –  OpHmal Policy SelecHon •  Value IteraHon •  Policy IteraHon •  Linear Programming 22 Linear Programming Review T •  Minimize: c x •  Subject to: Ax ≥ b € •  Can be solved in weakly polynomial Hme € •  Arguably most common and important opHmizaHon technique in history Linear Programming V ( s) = R( s, a) + γ maxa ∑s' P( s' | s, a)V ( s' ) Issue: Turn the non ­linear max into a collecHon of linear constraints ∀s, a : V ( s) ≥ R( s, a) + γ ∑s' P( s' | s, a)V ( s' ) € MINIMIZE: € ∑V ( s) s OpHmal acHon has Hght constraints Weakly polynomial; slower than PI in pracHce. € 23 MDP DifficulHes → RL •  MDP operate at the level of states –  States = atomic events –  We usually have exponenHally (infinitely) many of these •  We assume P and R are known •  Machine learning to the rescue! –  Infer P and R (implicitly or explicitly from data) –  Generalize from small number of states/policies Coming Up Next •  MulHple agents •  ParHal observability 24 ...
View Full Document

Ask a homework question - tutors are online