Goals for today
Learn that policies can be optimized
directly, without learning value functions,
by policy-gradient methods
Glimpse how one could learn real-valued
(continuous) actions
Glimpse how
Structuring your code for empirical AI research
No one can tell another what is the best way to write a program. Ultimately, everybody has their
own preferences and should do it the way that makes sen
Unified View
width
of backup
Temporaldifference
learning
Dynamic
programming
height
(depth)
of backup
Exhaustive
search
Monte
Carlo
.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduct
Comprehensive midterm exam covering chapters 2-7.
Review practice questions, in this directory.
One page (double sided) of hand written notes is allowed. You must prepare your own notes!
Review A1-A4.
" A simulation environment for simple blackjack (see blackjack.pdf).
The state is playerSum in cfw_12-20, dealerCard in cfw_1-10, and usableAce
(boolean).
The terminal state is represented as False.
T
i
Reinforcement Learning:
An Introduction
Second edition, in progress
*Draft*
Richard S. Sutton and Andrew G. Barto
c 2014, 2015, 2016
A Bradford Book
The MIT Press
Cambridge, Massachusetts
London, En
STUDENT NAME:
STUDENT ID:
Chennai Lei
1366324
Practice Exam Questions
CMPUT 366:
Intelligent Systems
University of Alberta
Department of Computing Science
This is (practice for) an in-class closed-boo
Chapter 3: The Reinforcement Learning Problem
(Markov Decision Processes, or MDPs)
Objectives of this chapter:
present Markov decision processesan idealized form of
the AI problem for which we have p
Chapter 4: Dynamic Programming
Objectives of this chapter:
Overview of a collection of classical solution methods
for MDPs known as dynamic programming (DP)
Show how DP can be used to compute value
CMPUT 366 Fa17 - INTELLIGENT SYSTEMS
Combined LAB LEC Fa17
CMPUT 366 Intelligent Systems Course
Outline
General Information
Term: Fall, 2017 (Lecture A1)
Date and Time: Tu/Th 2-3:20pm starting Septemb
Question 1, part 5: Empirical search for best step size
by Dylan Ashley
What you have learned from bandits
The need to tradeoff exploitation and exploration, e.g., by an -greedy policy
The difference
Examples and Videos
of Markov Decision Processes (MDPs)
and Reinforcement Learning
Artificial Intelligence is
interaction to achieve a goal
Environment
action
state
reward
Agent
complete agent
temp
Chapter 6: Temporal Difference Learning
Objectives of this chapter:
Introduce Temporal Difference (TD) learning
Focus first on policy evaluation, or prediction, methods
Compare efficiency of TD learni
Chapter 6: Temporal Difference Learning
Objectives of this chapter:
Introduce Temporal Difference (TD) learning
Focus first on policy evaluation, or prediction, methods
Compare efficiency of TD learni
Reinforcement Learning
in Psychology and
Neuroscience
with thanks to
Elliot Ludvig
University of Warwick
Bidirectional Influences
Psychology
Artificial
Intelligence
Reinforcement
Learning
Neuroscience
2. The goal of reinforcement learning can be seen as producing a _,
which maps from _ to _.
3. From state x, taking action 1 always produces a reward of 2 and sends you to a
state y from which a retur
. ACADEMIC PRESS
WMLW Discounted Dynamic Programming
1. Introduction
Consider a process that is observed at time points n = O, 1, 2, . . . to be
in one of a number of possible states. The set
Chapter 5: Monte Carlo Methods
Monte Carlo methods are learning methods
Experience values, policy
Monte Carlo methods can be used in two ways:
! model-free: No model necessary and still attains opti
Heuristic Single-Agent Search
Robert Holte
A Heuristic Function Estimates
Distance to Goal
Heuristics Speed up Search
10,461,394,944,000 states
heuristic search examines 36,000
Example Heuristic Funct
Eligibility Traces
Chapter 12
Eligibility traces are
Another way of interpolating between MC and TD methods
A way of implementing compound -return targets
A basic mechanistic idea a short-term, fading
Heuristic Single-Agent Search
Robert Holte
1
A Heuristic Function Estimates
Distance to Goal
2
Heuristics Speed up Search
10,461,394,944,000 states
heuristic search examines 36,000
3
Example Heuristic
Deterministic Tree Search
aka Deterministic Tree-based Planning
aka Search
finding a path from start state to a goal state
with thanks for slides to Russ Greiner
Search
! Search problem:
! States (con
Chapter 8: Planning and Learning
Objectives of this chapter:
To think more generally about uses of environment models
Integration of (unifying) planning, learning, and execution
Model-based reinforcem
Chapter 4: Dynamic Programming
Objectives of this chapter:
Overview of a collection of classical solution methods
for MDPs known as dynamic programming (DP)
Show how DP can be used to compute value
What we learned last time
Value-function approximation by stochastic gradient descent
enables RL to be applied to arbitrarily large state spaces
Most algorithms just carry over the Targets from the ta
Chapter 3: The Reinforcement Learning Problem
(Markov Decision Processes, or MDPs)
Objectives of this chapter:
present Markov decision processesan idealized form of
the AI problem for which we have p
Chapter 5: Monte Carlo Methods
Monte Carlo methods are learning methods
Experience values, policy
Monte Carlo methods can be used in two ways:
! model-free: No model necessary and still attains opti
What we learned last time
1. Intelligence is the computational part of the ability to achieve goals
looking deeper: 1) its a continuum, 2) its an appearance, 3) it varies
with observer and purpose
2.