SP11 Midterm Review 6PP - CS 188: Artificial Intelligence...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 188: Artificial Intelligence Spring 2011 Midterm Review 3/14/2011 Recap Search I   Agents that plan ahead  formalization: Search   Search problem:   States (configurations of the world)   Successor function: a function from states to lists of (state, action, cost) triples; drawn as a graph   Start state and goal test   Search tree:   Nodes: represent plans for reaching states   Plans have costs (sum of action costs) Pieter Abbeel – UC Berkeley   Search Algorithm:   Systematically builds a search tree   Chooses an ordering of the fringe (unexplored nodes) Recap Search II   Tree Search vs. Graph Search   Priority queue to store fringe: different priority functions  different search method   Uninformed Search Methods   Depth-First Search   Breadth-First Search   Uniform-Cost Search   Heuristic Search Methods   Greedy Search   A* Search --- heuristic design!   Admissibility: h(n) <= cost of cheapest path to a goal state. Ensures when goal node is expanded, no other partial plans on fringe could be extended into a cheaper path to a goal state   Consistency: c(n->n’) >= h(n) – h(n’). Ensures when any node n is expanded during graph search the partial plan that ended in n is the cheapest way to reach n.   Time and space complexity, completeness, optimality   Iterative Deepening (space complexity!) Search Problems Goal-based Agents Reflex Agent   Choose action based on current percept (and maybe memory)   May have memory or a model of the world s current state   Do not consider the future consequences of their actions   Plan ahead   Ask what if   Decisions based on (hypothesized) consequences of actions   Must have a model of how the world evolves in response to actions   Act on how the world IS   Act on how the world WOULD BE   Can a reflex agent be rational? Example State Space Graph   A search problem consists of:   A state space GOAL a 2 2 c b 1 3 2   A successor function N, 1.0 e d 3 9 8 START   A start state and a goal test   A solution is a sequence of actions (a plan) which transforms the start state to a goal state p 15 f 2 h 4 1 E , 1.0 2 8 4 q 1 r Ridiculously tiny search graph for a tiny search problem 1 Search Trees N , 1.0 General Tree Search E, 1.0   A search tree:           This is a what if tree of plans and outcomes Start state at the root node Children correspond to successors Nodes contain states, correspond to PLANS to those states For most problems, we can never actually build the whole tree Tree Search: Extra Work!   Failure to detect repeated states can cause exponentially more work. Why?   Important ideas:   Fringe   Expansion   Exploration strategy   Main question: which fringe nodes to explore? Graph Search   Very simple fix: never expand a state twice   Can this wreck completeness? Optimality? Admissible Heuristics   A heuristic h is admissible (optimistic) if: where is the true cost to a nearest goal   Often, admissible heuristics are solutions to relaxed problems, with new actions ( some cheating ) available   Examples: 15 •  Number of misplaced tiles •  Sum over all misplaced tiles of Manhattan distances to goal positions Trivial Heuristics, Dominance   Dominance: ha ≥ hc if   Heuristics form a semi-lattice:   Max of admissible heuristics is admissible   Trivial heuristics   Bottom of lattice is the zero heuristic (what does this give us?)   Top of lattice is the exact heuristic 2 A* heuristics --- pacman trying to eat all food pellets Consistency   Consistency:   Required for A* graph search to be optimal   Consistency implies admissibility   Consider an algorithm that takes the distance to the closest food pellet, say at (x,y). Then it adds the distance between (x,y) and the closest food pellet to (x,y), and continues this process until no pellets are left, each time calculating the distance from the last pellet. Is this heuristic admissible?   What if we used the Manhattan distance rather than distance in the maze in the above procedure? 14 A* heuristics Recap CSPs   CSPs are a special kind of search problem:   A particular procedure to quickly find a perhaps suboptimal solution to the search problem is in general not admissible.   States defined by values of a fixed set of variables   Goal test defined by constraints on variable values   It is only admissible if it always finds the optimal solution (but then it is already solving the problem we care about, hence not that interesting as a heustic).   A particular procedure to quickly find a perhaps suboptimal solution to a relaxed version of the search problem need not be admissible.   Backtracking = depth-first search (why?, tree or graph search?) with   Branching on only one variable per layer in search tree   Incremental constraint checks ( Fail fast )   Heuristics at our points of choice to improve running time:   Ordering variables: Minimum Remaining Values and Degree Heuristic   Ordering of values: Least Constraining Value   Filtering: forward checking, arc consistency  computation of heuristics   It will be admissible if it always finds the optimal solution to the relaxed problem.   Structure: Disconnected and tree-structured CSPs are efficient 15   Non-tree-structured CSP can become tree-structured after some variables have been assigned values   Iterative improvement: min-conflicts is usually effective in practice Example: Map-Coloring Consistency of An Arc WA NT SA 16 Q NSW V   An arc X → Y is consistent iff for every x in the tail there is some y in the head which could be assigned without violating a constraint   Variables:   Domain: Delete from tail!   Constraints: adjacent regions must have different colors   Implicit:   Explicit:   Solutions are assignments satisfying all constraints, e.g.: 17 •  •  •  •  If X loses a value, neighbors of X need to be rechecked! Arc consistency detects failure earlier than forward checking, but more work! Can be run as a preprocessor or after each assignment Forward checking = Enforcing consistency of each arc pointing to the new 18 assignment 3 Tree-Structured CSPs   Theorem: if the constraint graph has no loops, the CSP can be solved in O(n d2) time   Compare to general CSPs, where worst-case time is O(dn)   This property also applies to probabilistic reasoning (later): an important example of the relation between syntactic restrictions and the complexity of reasoning. 19 20 Nearly Tree-Structured CSPs Hill Climbing   Simple, general idea:   Start wherever   Always choose the best neighbor   If no neighbors have better scores than current, quit   Why can this be a terrible idea?   Conditioning: instantiate a variable, prune its neighbors' domains   Complete?   Optimal?   Cutset conditioning: instantiate (in all ways) a set of variables such that the remaining constraint graph is a tree   Cutset size c gives runtime O( (dc) (n-c) d2 ), very fast for small c   What s good about it? 21 22 Hill Climbing Diagram Recap Games   Want algorithms for calculating a strategy (policy) which recommends a move in each state   Deterministic zero-sum games   Minimax   Alpha-Beta pruning:   speed-up up to: O(bd)  O(bd/2)   exact for root (lower nodes could be approximate)   Speed-up (suboptimal): Limited depth and evaluation functions   Iterative deepening (can help alpha-beta through ordering!)   Stochastic games   Expectimax   Random restarts?   Random sideways steps?   Non-zero-sum games 23 24 4 Minimax Properties Pruning   Optimal against a perfect player. Otherwise? max   Time complexity?   O(bm) min   Space complexity?   O(bm)   For chess, b ≈ 35, m ≈ 100 10 10 9 100 3   Exact solution is completely infeasible   But, do we need to explore the whole tree? 12 8 2 14 5 2 25 26 Evaluation Functions Expectimax   With depth-limited search   Partial plan is returned   Only first move of partial plan is executed   When again maximizer s turn, run a depthlimited search again and repeat 3   How deep to search? 12 9 2 4 6 15 6 0 27 Stochastic Two-Player 28 Non-Zero-Sum Utilities   E.g. backgammon   Expectiminimax (!)   Similar to minimax:   Environment is an extra player that moves after each agent   Chance nodes take expectations, otherwise like minimax 29   Terminals have utility tuples   Node values are also utility tuples   Each player maximizes its own utility and propagate (or back up) nodes from children   Can give rise to cooperation and competition dynamically… 1,6,6 7,1,2 6,1,2 7,2,1 5,1,7 1,5,2 7,7,1 5,2,5 30 5 Recap MDPs and RL Markov Decision Processes   Markov Decision Processes (MDPs)   Formalism (S, A, T, R, gamma)   Solution: policy pi which describes action for each state   Value Iteration (vs. Expectimax --- VI more efficient through dynamic programming)   Policy Evaluation and Policy Iteration   Reinforcement Learning (don’t know T and R)   Function approximation --- generalization   Prob that a from s leads to s   i.e., P(s | s,a)   Also called the model   Sometimes just R(s) or R(s )   A start state (or distribution)   Maybe a terminal state Direct Evaluation [performs policy evaluation] Temporal Difference Learning [performs policy evaluation] Q-Learning [learns optimal state-action value function Q*] Policy Search [learns optimal policy from subset of all policies]   Exploration   A set of states s ∈ S   A set of actions a ∈ A   A transition function T(s,a,s )   A reward function R(s, a, s )   Model-based Learning: estimate T and R first   Model-free Learning: learn without estimating T or R           An MDP is defined by:   MDPs are a family of nondeterministic search problems 31   Reinforcement learning: MDPs where we don t know the transition or reward functions What is Markov about MDPs?   32 Value Iteration   Idea: Markov generally means that given the present state, the future and the past are independent   Vi*(s) : the expected discounted sum of rewards accumulated when starting from state s and acting optimally for a horizon of i time steps.   Value iteration:   Start with V0*(s) = 0, which we know is right (why?)   Given Vi*, calculate the values for all states for horizon i+1:   For Markov decision processes, Markov means:   This is called a value update or Bellman update   Repeat until convergence   Theorem: will converge to unique optimal values   Can make this happen by proper choice of state space   Basic idea: approximations get refined towards optimal values   Policy may converge long before values do   At convergence, we have found the optimal value function V* for the discounted infinite horizon problem, which satisfies the Bellman 34 equations: Complete Procedure Policy Iteration   Policy evaluation: with fixed current policy π, find values with simplified Bellman updates:   1. Run value iteration (off-line)  Returns V, which (assuming sufficiently many iterations is a good approximation of V*)   Iterate until values converge   2. Agent acts. At time t the agent is in state st and takes the action at:   Policy improvement: with fixed utilities, find the best action according to one-step look-ahead 35   Will converge (policy will not change) and resulting policy optimal 36 6 Temporal-Difference Learning Sample-Based Policy Evaluation?   Big idea: learn from every experience!   Who needs T and R? Approximate the expectation with samples (drawn from T!) π(s) s, π(s) π(s)   Temporal difference learning s, π(s) s, π(s),s s2 s   Update V(s) each time we experience (s,a,s ,r)   Likely s will contribute updates more often s s1 s s3   Policy still fixed!   Move values toward value of whatever successor occurs: running average! s Sample of V(s): Almost! (i) Will only be in state s once and then land in s’ hence have only one sample  have to keep all samples around? (ii) Where 37 do we get value for s’? Exponential Moving Average Update to V(s): Same update: 38 Detour: Q-Value Iteration   Value iteration: find successive approx optimal values   Exponential moving average   Start with V0(s) = 0, which we know is right (why?)   Given Vi, calculate the values for all states for depth i+1:   Makes recent samples more important   Forgets about the past (distant past values were wrong anyway)   Easy to compute from the running average   But Q-values are more useful!   Start with Q0(s,a) = 0, which we know is right (why?)   Given Qi, calculate the q-values for all q-states for depth i+1:   Decreasing learning rate can give converging averages 39 Q-Learning 40 Exploration Functions   Learn Q*(s,a) values   Simplest: random actions (ε greedy)   Every time step, flip a coin   With probability ε, act randomly   With probability 1-ε, act according to current policy   Receive a sample (s,a,s ,r)   Consider your new sample estimate:   Problems with random actions?   Incorporate the new estimate into a running average:   You do explore the space, but keep thrashing around once learning is done   One solution: lower ε over time   Exploration functions   Amazing result: Q-learning converges to optimal policy   If you explore enough   If you make the learning rate small enough but not decrease it too quickly!   Neat property: off-policy learning   learn optimal policy without following it 41   Explore areas whose badness is not (yet) established   Take a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important) ￿ ￿ Qi+1 (s, a) ← (1 − α)Qi (s, a) + α R(s, a, s￿ ) + γ max Qi (s￿ , a￿ ) a￿ now becomes: ￿ ￿ Qi+1 (s, a) ← (1 − α)Qi (s, a) + α R(s, a, s￿ ) + γ max f (Qi (s￿ , a￿ ), N (s￿ , a￿ )) ￿ a 7 Feature-Based Representations   Solution: describe a state using a vector of features   Using a feature representation, we can write a q function (or value function) for any state using a few weights:   Features are functions from states to real numbers (often 0/1) that capture important properties of the state   Example features:             Linear Feature Functions Distance to closest ghost Distance to closest dot Number of ghosts 1 / (dist to dot)2 Is Pacman in a tunnel? (0/1) …… etc.   Can also describe a q-state (s, a) with features (e.g. action moves closer to food) 43   Advantage: our experience is summed up in a few powerful numbers   Disadvantage: states may share features but be very different in value! Overfitting 30 44 Policy Search 25 20   Problem: often the feature-based policies that work well aren t the ones that approximate V / Q best   Solution: learn the policy that maximizes rewards rather than the value that predicts rewards   This is the idea behind policy search, such as what controlled the upside-down helicopter   Simplest policy search: Degree 15 polynomial 15 10 5 0   Start with an initial linear value function or Q-function   Nudge each feature weight up and down and see if your policy is better than before -5 -10 -15   Problems: 0 2 4 6 8 10 12 14 16 18 20 45   How do we tell the policy got better?   Need to run many sample episodes!   If there are a lot of features, this can be impractical Part II: Probabilistic Reasoning Probability recap   Conditional probability   Probability             Random Variables Joint and Marginal Distributions Conditional Distribution Inference by Enumeration Product Rule, Chain Rule, Bayes Rule Independence   Product rule   Chain rule   X, Y independent iff: equivalently, iff: equivalently, iff:   Distributions over LARGE Numbers of Random Variables   Representation   Inference [not yet covered for large numbers of random variables] 46 ∀x, y : P (x|y ) = P (x) ∀x, y : P (y |x) = P (y )   X and Y are conditionally independent given Z iff: 47 equivalently, iff: ∀x, y, z : P (x|y, z ) = P (x|z ) equivalently, iff: ∀x, y, z : P (y |x, z ) = P (y |z ) 48 8 Inference by Enumeration   P(sun)?   Chain rule: can always write any joint distribution as an incremental product of conditional distributions T hot sun 0.30 summer hot rain 0.05 summer cold sun 0.10 summer cold rain 0.05 winter hot sun 0.10 winter   P(sun | winter, hot)? S summer   P(sun | winter)? W Chain Rule  Bayes net P hot rain 0.05 winter cold sun 0.15 winter cold rain 0.20   Bayes nets: make conditional independence assumptions of the form: P (xi |x1 · · · xi−1 ) = P (xi |parents(Xi )) giving us: 49 50 Probabilities in BNs Example: Alarm Network 0.001 ¬b 0.999 Burglary Earthqk P(E) 0.002 ¬e P(B) +b +e 0.998 E A P(A|B,E) +b +e +a 0.95 +b +e ¬a 0.05 +b   As a product of local conditional distributions   To see what probability a BN gives to a full assignment, multiply all the relevant conditionals together: E B B   Bayes nets implicitly encode joint distributions ¬e +a 0.94 Alarm   Example: John calls Mary calls A   The topology enforces certain conditional independencies A M P(M|A) +b ¬e ¬a 0.06 +j 0.9 +a +m 0.7 ¬b +e +a 0.29 +a ¬j 0.1 +a ¬m 0.3 ¬b +e ¬a 0.71 ¬a 51 P(J|A) +a   This lets us reconstruct any entry of the full joint   Not every BN can represent every joint distribution J +j 0.05 ¬a +m 0.01 ¬b ¬e +a 0.001 ¬a ¬j 0.95 ¬a ¬m 0.99 ¬b ¬e ¬a 0.999 9 ...
View Full Document

This note was uploaded on 08/26/2011 for the course CS 188 taught by Professor Staff during the Spring '08 term at University of California, Berkeley.

Ask a homework question - tutors are online