Unformatted text preview: CS 188: Artificial Intelligence
Spring 2011
Midterm Review
3/14/2011 Recap Search I Agents that plan ahead formalization: Search Search problem: States (configurations of the world) Successor function: a function from states to
lists of (state, action, cost) triples; drawn as a graph Start state and goal test Search tree: Nodes: represent plans for reaching states Plans have costs (sum of action costs) Pieter Abbeel – UC Berkeley Search Algorithm: Systematically builds a search tree Chooses an ordering of the fringe (unexplored nodes) Recap Search II Tree Search vs. Graph Search Priority queue to store fringe: different priority functions
different search method Uninformed Search Methods DepthFirst Search BreadthFirst Search UniformCost Search Heuristic Search Methods Greedy Search A* Search  heuristic design! Admissibility: h(n) <= cost of cheapest path to a goal state. Ensures when goal node is expanded, no
other partial plans on fringe could be extended into a cheaper path to a goal state Consistency: c(n>n’) >= h(n) – h(n’). Ensures when any node n is expanded during graph search the
partial plan that ended in n is the cheapest way to reach n. Time and space complexity, completeness, optimality Iterative Deepening (space complexity!) Search Problems Goalbased Agents Reflex Agent Choose action based on
current percept (and
maybe memory) May have memory or a
model of the world s
current state Do not consider the future
consequences of their
actions Plan ahead Ask what if Decisions based on
(hypothesized)
consequences of
actions Must have a model of
how the world evolves
in response to actions Act on how the world IS Act on how the world
WOULD BE Can a reflex agent be
rational? Example State Space Graph A search problem consists of: A state space GOAL a 2 2
c b
1 3 2 A successor function N,
1.0 e d 3 9 8 START A start state and a goal test A solution is a sequence of actions (a plan)
which transforms the start state to a goal state p 15 f 2 h 4 1
E , 1.0 2 8 4
q 1
r Ridiculously tiny search graph
for a tiny search problem 1 Search Trees
N , 1.0 General Tree Search E,
1.0 A search tree: This is a what if tree of plans and outcomes
Start state at the root node
Children correspond to successors
Nodes contain states, correspond to PLANS to those states
For most problems, we can never actually build the whole tree Tree Search: Extra Work! Failure to detect repeated states can cause
exponentially more work. Why? Important ideas: Fringe Expansion Exploration strategy Main question: which fringe nodes to explore? Graph Search Very simple fix: never expand a state twice Can this wreck completeness? Optimality? Admissible Heuristics A heuristic h is admissible (optimistic) if:
where
is the true cost to a nearest goal Often, admissible heuristics are solutions to
relaxed problems, with new actions ( some
cheating ) available Examples:
15
• Number of misplaced tiles
• Sum over all misplaced tiles of
Manhattan distances to goal positions Trivial Heuristics, Dominance Dominance: ha ≥ hc if Heuristics form a semilattice: Max of admissible heuristics is admissible Trivial heuristics Bottom of lattice is the zero heuristic (what
does this give us?) Top of lattice is the exact heuristic 2 A* heuristics  pacman trying to eat all
food pellets Consistency Consistency: Required for A* graph search to be optimal Consistency implies admissibility Consider an algorithm that takes the distance to the
closest food pellet, say at (x,y). Then it adds the
distance between (x,y) and the closest food pellet to
(x,y), and continues this process until no pellets are left,
each time calculating the distance from the last pellet. Is
this heuristic admissible? What if we used the Manhattan distance rather than
distance in the maze in the above procedure? 14 A* heuristics Recap CSPs CSPs are a special kind of search problem: A particular procedure to quickly find a perhaps
suboptimal solution to the search problem is in
general not admissible. States defined by values of a fixed set of variables Goal test defined by constraints on variable values It is only admissible if it always finds the optimal
solution (but then it is already solving the problem we
care about, hence not that interesting as a heustic). A particular procedure to quickly find a perhaps
suboptimal solution to a relaxed version of the
search problem need not be admissible. Backtracking = depthfirst search (why?, tree or graph search?) with Branching on only one variable per layer in search tree Incremental constraint checks ( Fail fast ) Heuristics at our points of choice to improve running time: Ordering variables: Minimum Remaining Values and Degree Heuristic Ordering of values: Least Constraining Value Filtering: forward checking, arc consistency computation of heuristics It will be admissible if it always finds the optimal
solution to the relaxed problem. Structure: Disconnected and treestructured CSPs are efficient
15 Nontreestructured CSP can become treestructured after some variables have
been assigned values Iterative improvement: minconflicts is usually effective in practice Example: MapColoring Consistency of An Arc WA NT
SA 16 Q
NSW V An arc X → Y is consistent iff for every x in the tail there is some y in
the head which could be assigned without violating a constraint Variables: Domain: Delete
from tail! Constraints: adjacent regions must have
different colors Implicit: Explicit: Solutions are assignments satisfying all
constraints, e.g.:
17 •
•
•
• If X loses a value, neighbors of X need to be rechecked!
Arc consistency detects failure earlier than forward checking, but more work!
Can be run as a preprocessor or after each assignment
Forward checking = Enforcing consistency of each arc pointing to the new 18
assignment 3 TreeStructured CSPs Theorem: if the constraint graph has no loops, the CSP can be
solved in O(n d2) time Compare to general CSPs, where worstcase time is O(dn) This property also applies to probabilistic reasoning (later): an
important example of the relation between syntactic restrictions and
the complexity of reasoning.
19 20 Nearly TreeStructured CSPs Hill Climbing Simple, general idea: Start wherever Always choose the best neighbor If no neighbors have better scores than
current, quit Why can this be a terrible idea? Conditioning: instantiate a variable, prune its neighbors' domains Complete? Optimal? Cutset conditioning: instantiate (in all ways) a set of variables such
that the remaining constraint graph is a tree Cutset size c gives runtime O( (dc) (nc) d2 ), very fast for small c What s good about it?
21 22 Hill Climbing Diagram Recap Games Want algorithms for calculating a strategy (policy) which
recommends a move in each state Deterministic zerosum games Minimax AlphaBeta pruning: speedup up to: O(bd) O(bd/2) exact for root (lower nodes could be approximate) Speedup (suboptimal): Limited depth and evaluation functions Iterative deepening (can help alphabeta through ordering!) Stochastic games Expectimax Random restarts? Random sideways steps? Nonzerosum games
23 24 4 Minimax Properties Pruning Optimal against a perfect player. Otherwise?
max Time complexity? O(bm) min Space complexity? O(bm) For chess, b ≈ 35, m ≈ 100 10 10 9
100
3 Exact solution is completely infeasible But, do we need to explore the whole tree? 12 8 2 14 5 2 25 26 Evaluation Functions Expectimax With depthlimited search Partial plan is returned Only first move of partial plan is executed When again maximizer s turn, run a depthlimited search again and repeat
3 How deep to search? 12 9 2 4 6 15 6 0 27 Stochastic TwoPlayer 28 NonZeroSum Utilities E.g. backgammon Expectiminimax (!) Similar to
minimax: Environment is an
extra player that moves
after each agent Chance nodes take
expectations, otherwise
like minimax 29 Terminals have
utility tuples Node values
are also utility
tuples Each player
maximizes its
own utility and
propagate (or
back up) nodes
from children Can give rise
to cooperation
and
competition
dynamically… 1,6,6 7,1,2 6,1,2 7,2,1 5,1,7 1,5,2 7,7,1 5,2,5 30 5 Recap MDPs and RL Markov Decision Processes Markov Decision Processes (MDPs) Formalism (S, A, T, R, gamma) Solution: policy pi which describes action for each state Value Iteration (vs. Expectimax  VI more efficient through dynamic
programming) Policy Evaluation and Policy Iteration Reinforcement Learning (don’t know T and R) Function approximation  generalization Prob that a from s leads to s i.e., P(s  s,a) Also called the model Sometimes just R(s) or R(s ) A start state (or distribution) Maybe a terminal state Direct Evaluation [performs policy evaluation]
Temporal Difference Learning [performs policy evaluation]
QLearning [learns optimal stateaction value function Q*]
Policy Search [learns optimal policy from subset of all policies] Exploration A set of states s ∈ S A set of actions a ∈ A A transition function T(s,a,s ) A reward function R(s, a, s ) Modelbased Learning: estimate T and R first Modelfree Learning: learn without estimating T or R An MDP is defined by: MDPs are a family of nondeterministic search problems
31 Reinforcement learning: MDPs
where we don t know the
transition or reward functions What is Markov about MDPs? 32 Value Iteration Idea: Markov generally means that given the present state,
the future and the past are independent Vi*(s) : the expected discounted sum of rewards accumulated when
starting from state s and acting optimally for a horizon of i time steps. Value iteration: Start with V0*(s) = 0, which we know is right (why?) Given Vi*, calculate the values for all states for horizon i+1: For Markov decision processes, Markov means: This is called a value update or Bellman update Repeat until convergence Theorem: will converge to unique optimal values Can make this happen by proper choice of state space Basic idea: approximations get refined towards optimal values Policy may converge long before values do At convergence, we have found the optimal value function V* for the
discounted infinite horizon problem, which satisfies the Bellman
34
equations: Complete Procedure Policy Iteration Policy evaluation: with fixed current policy π, find values
with simplified Bellman updates: 1. Run value iteration (offline)
Returns V, which (assuming sufficiently many
iterations is a good approximation of V*) Iterate until values converge 2. Agent acts. At time t the agent is in state st
and takes the action at: Policy improvement: with fixed utilities, find the best
action according to onestep lookahead 35 Will converge (policy will not change) and resulting policy
optimal 36 6 TemporalDifference Learning SampleBased Policy Evaluation? Big idea: learn from every experience! Who needs T and R? Approximate the
expectation with samples (drawn from T!) π(s)
s, π(s) π(s) Temporal difference learning s, π(s)
s, π(s),s
s2 s Update V(s) each time we experience (s,a,s ,r) Likely s will contribute updates more often s s1
s s3 Policy still fixed! Move values toward value of whatever
successor occurs: running average! s Sample of V(s):
Almost! (i) Will only be in
state s once and then land
in s’ hence have only one
sample have to keep all
samples around? (ii) Where
37
do we get value for s’? Exponential Moving Average Update to V(s):
Same update:
38 Detour: QValue Iteration Value iteration: find successive approx optimal values Exponential moving average Start with V0(s) = 0, which we know is right (why?) Given Vi, calculate the values for all states for depth i+1: Makes recent samples more important Forgets about the past (distant past values were wrong anyway) Easy to compute from the running average But Qvalues are more useful! Start with Q0(s,a) = 0, which we know is right (why?) Given Qi, calculate the qvalues for all qstates for depth i+1: Decreasing learning rate can give converging averages
39 QLearning 40 Exploration Functions Learn Q*(s,a) values Simplest: random actions (ε greedy) Every time step, flip a coin With probability ε, act randomly With probability 1ε, act according to current policy Receive a sample (s,a,s ,r) Consider your new sample estimate: Problems with random actions? Incorporate the new estimate into a running average: You do explore the space, but keep thrashing around once learning
is done One solution: lower ε over time Exploration functions Amazing result: Qlearning converges to optimal policy If you explore enough If you make the learning rate small enough but not decrease it
too quickly! Neat property: offpolicy learning learn optimal policy without following it 41 Explore areas whose badness is not (yet) established Take a value estimate and a count, and returns an optimistic
utility, e.g.
(exact form not important)
Qi+1 (s, a) ← (1 − α)Qi (s, a) + α R(s, a, s ) + γ max Qi (s , a )
a
now becomes:
Qi+1 (s, a) ← (1 − α)Qi (s, a) + α R(s, a, s ) + γ max f (Qi (s , a ), N (s , a ))
a 7 FeatureBased Representations Solution: describe a state using
a vector of features Using a feature representation, we can write a
q function (or value function) for any state
using a few weights: Features are functions from states
to real numbers (often 0/1) that
capture important properties of the
state Example features: Linear Feature Functions Distance to closest ghost
Distance to closest dot
Number of ghosts
1 / (dist to dot)2
Is Pacman in a tunnel? (0/1)
…… etc. Can also describe a qstate (s, a)
with features (e.g. action moves
closer to food)
43 Advantage: our experience is summed up in a
few powerful numbers Disadvantage: states may share features but
be very different in value! Overfitting 30 44 Policy Search 25 20 Problem: often the featurebased policies that work well
aren t the ones that approximate V / Q best Solution: learn the policy that maximizes rewards rather
than the value that predicts rewards This is the idea behind policy search, such as what
controlled the upsidedown helicopter Simplest policy search: Degree 15 polynomial 15 10 5 0 Start with an initial linear value function or Qfunction Nudge each feature weight up and down and see if your policy is
better than before 5 10 15 Problems:
0 2 4 6 8 10 12 14 16 18 20 45 How do we tell the policy got better? Need to run many sample episodes! If there are a lot of features, this can be impractical Part II: Probabilistic Reasoning Probability recap Conditional probability Probability Random Variables
Joint and Marginal Distributions
Conditional Distribution
Inference by Enumeration
Product Rule, Chain Rule, Bayes Rule
Independence Product rule Chain rule X, Y independent iff:
equivalently, iff:
equivalently, iff: Distributions over LARGE Numbers of Random
Variables Representation Inference [not yet covered for large numbers of
random variables] 46 ∀x, y : P (xy ) = P (x)
∀x, y : P (y x) = P (y ) X and Y are conditionally independent given Z iff: 47 equivalently, iff: ∀x, y, z : P (xy, z ) = P (xz )
equivalently, iff: ∀x, y, z : P (y x, z ) = P (y z ) 48 8 Inference by Enumeration P(sun)? Chain rule: can always write any joint distribution as an
incremental product of conditional distributions T
hot sun 0.30 summer hot rain 0.05 summer cold sun 0.10 summer cold rain 0.05 winter hot sun 0.10 winter P(sun  winter, hot)? S
summer P(sun  winter)? W Chain Rule Bayes net
P hot rain 0.05 winter cold sun 0.15 winter cold rain 0.20 Bayes nets: make conditional independence assumptions of the
form: P (xi x1 · · · xi−1 ) = P (xi parents(Xi ))
giving us: 49 50 Probabilities in BNs Example: Alarm Network
0.001 ¬b 0.999 Burglary Earthqk P(E)
0.002 ¬e P(B) +b +e 0.998 E A P(AB,E) +b +e +a 0.95 +b +e ¬a 0.05 +b As a product of local conditional distributions To see what probability a BN gives to a full assignment, multiply
all the relevant conditionals together: E B B Bayes nets implicitly encode joint distributions ¬e +a 0.94 Alarm Example: John
calls Mary
calls A The topology enforces certain conditional independencies A M P(MA) +b ¬e ¬a 0.06 +j 0.9 +a +m 0.7 ¬b +e +a 0.29 +a ¬j 0.1 +a ¬m 0.3 ¬b +e ¬a 0.71 ¬a
51 P(JA) +a This lets us reconstruct any entry of the full joint Not every BN can represent every joint distribution J +j 0.05 ¬a +m 0.01 ¬b ¬e +a 0.001 ¬a ¬j 0.95 ¬a ¬m 0.99 ¬b ¬e ¬a 0.999 9 ...
View
Full Document
 Spring '08
 Staff
 Artificial Intelligence

Click to edit the document details