Unformatted text preview: Machine Learning
CMPSCI 383 Learning Machines
and Hollywood HAL plays chess Outline
• Motivation: why should agents learn?
• Different models of learning
• Learning from observation
• classiﬁcation and regression
• Learning decision trees
• Linear regression Machine Learning is
everywhere! • Every time you speak on the phone to an automated program (travel, UPS, Fedex, ...) • Google, Amazon, Facebook all extensively
use machine learning to predict user
behavior (they hire many of our PhD's) • Machine learning is one of the most sought
after disciplines for prospective employers IBM Watson Jeopardy Quiz Program • Watson uses machine learning to select
answers to a wide range of questions Stanley: DARPA Grand
Challenge Winner • Autonomous car that drove several hundred miles through the desert using
machine learning The Future: ML everywhere! • Handheld devices will have terabytes of
RAM and petabytes of disk space • Massive use of machine learning across all
smartphones, web software, OS, desktops • Cars will increasingly use machine learning
to drive autonomously • Hard to overestimate the impact of ML Human Learning
• Learning is a hallmark of intelligence
• Human abilities depend on learning
• Learning a language (e.g., English, French)
• Learning to drive
• Learning to recognize people (faces)
• Learning in the classroom Bongard Problem Identify a rule that separates the ﬁgures on the left
from those on the right Bongard Problem Bongard Problem Types of Learning
• There are many types of learning
• Supervised learning
• Unsupervised learning
• Reinforcement learning
• Evolutionary (genetic) learning Supervised Learning
• Simplest model of learning
• An agent is given positive and negative examples of some concept or function • The goal is to learn an approximation of
the desired concept or function • Classiﬁcation: discrete concept spaces
• Regression: realvalued functions Character Recognition
Apple Apple Apple
383 383 383 Apple Apple 383 • Humans can effortlessly recognize complex
visual patterns (characters, faces, text) • This apparently simple problem is
formidably difﬁcult for machines Classiﬁcation
2 1
1 2 2
1 2 1 3
3 3
3 Classiﬁcation Classiﬁcation AttributeValue Data
Attributebased representations
Examples described by attribute values (Boolean, discrete, continuous, etc.)
E.g., situations where I will/won’t wait for a table:
Example
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12 Alt B ar F ri H un
T
F
F
T
T
F
F
T
F
T
F
F
T
F
T
T
T
F
T
F
F
T
F
T
F
T
F
F
F
F
F
T
F
T
T
F
T
T
T
T
F
F
F
F
T
T
T
T Attributes
P at P rice Rain Res
Some $$$
F
T
Full
$
F
F
Some
$
F
F
Full
$
F
F
Full
$$$
F
T
Some $$
T
T
None
$
T
F
Some $$
T
T
Full
$
T
F
Full
$$$
F
T
None
$
F
F
Full
$
F
F T y pe
French
Thai
Burger
Thai
French
Italian
Burger
Thai
Burger
Italian
Thai
Burger Target
E st WillWait
0–10
T
30–60
F
0–10
T
10–30
T
>60
F
0–10
T
0–10
F
0–10
T
>60
F
10–30
F
0–10
F
30–60
T Classiﬁcation of examples is positive (T) or negative (F)
Chapter 18, Sections 1–3 13 Decision Trees
Decision trees One possible representation for hypotheses
E.g., here is the “true” tree for deciding whether to wait:
Patrons? None Full T WaitEstimate? >60 F Some 30!60 F 10!30 Alternate? No Yes Bar? No
F T Yes
T Hungry? Yes Reservation? No No Fri/Sat? No
F 0!10 T Yes
T T Yes
Alternate? No
T Yes
Raining? No
F Yes
T Chapter 18, Sections 1–3 14 Boolean Functions
Expressiveness
Decision trees can express any function of the input attributes.
E.g., for Boolean functions, truth table row → path to leaf:
A B F
F
T
T F
T
F
T A A xor B
F
T
T
F F T F T F T F T T F B B Trivially, there is a consistent decision tree for any training set
w/ one path to leaf for each example (unless f nondeterministic in x)
but it probably won’t generalize to new examples
Prefer to ﬁnd more compact decision trees Chapter 18, Sections 1–3 15 Hypothesis Spaces
Hyp othesis spaces
How many distinct decision trees with n Boolean attributes??
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22
E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees
How many purely conjunctive hypotheses (e.g., H ung ry ∧ ¬Rain)??
Each attribute can be in (positive), in (negative), or out
⇒ 3n distinct conjunctive hypotheses
More expressive hypothesis space
– increases chance that target function can be expressed
– increases number of hypotheses consistent w/ training set
⇒ may get worse predictions Chapter 18, Sections 1–3 22 Learning Decision Trees
Decision tree learning Aim: ﬁnd a small tree consistent with the training examples
Idea: (recursively) choose “most signiﬁcant” attribute as root of (sub)tree
function DTL(examples, attributes, default) returns a decision tree
if examples is empty then return default
else if all examples have the same classiﬁcation then return the classiﬁcation
else if attributes is empty then return Mode(examples)
else
best ← ChooseAttribute(attributes, examples)
tree ← a new decision tree with root test best
for each value vi of best do
examplesi ← {elements of examples with best = vi }
subtree ← DTL(examplesi, attributes − best, Mode(examples))
add a branch to tree with label vi and subtree subtree
return tree Chapter 18, Sections 1–3 23 Attribute Selection
Cho osing an attribute
Idea: a good attribute splits the examples into subsets that are (ideally) “all
positive” or “all negative” Type? Patrons?
None Some Full French Italian Thai Burger P atrons? is a better choice—gives information about the classiﬁcation Chapter 18, Sections 1–3 24 Entropy
• We can apply ideas from the ﬁeld of information theory to attribute selection • Given a set of events, each of which occurs
with probability pi , entropy measures the
surprise associated with a particular
outcome • Low frequency events are more surprising
H (P ) = − pi log2 pi i Information contd.
Information Theory
Suppose we have p positive and n negative examples at the root
⇒ H ( p/(p + n), n/(p + n) ) bits needed to classify a new example
E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit
An attribute splits the examples E into subsets Ei, each of which (we hope)
needs less information to complete the classiﬁcation
Let Ei have pi positive and ni negative examples
⇒ H ( pi/(pi + ni), ni/(pi + ni) ) bits needed to classify a new example
⇒ expected number of bits per example over all branches is Σi pi + ni
H ( pi/(pi + ni), ni/(pi + ni) )
p+n For P atrons?, this is 0.459 bits, for T y pe this is (still) 1 bit
⇒ choose the attribute that minimizes the remaining information needed Chapter 18, Sections 1–3 26 Entropy an attribute
Cho osing reduction
Idea: a good attribute splits the examples into subsets that are (ideally) “all
positive” or “all negative” Entropy = 1 bit Entropy = 1 bit
Type? Patrons?
None Some Full French Italian Thai Burger Entropy = 1 bit
P atrons? 0.459 bits
Entropy =is a better choice—gives information about the classiﬁcation Chapter 18, Sections 1–3 24 Learned Decision Tree
Example contd.
Decision tree learned from the 12 examples:
Patrons? None
F Some Full
Hungry? T Yes
Type? French
T Italian No
F Thai Burger
T Fri/Sat? F No
F Yes
T Substantially simpler than “true” tree—a more complex hypothesis isn’t justiﬁed by small amount of data
Chapter 18, Sections 1–3 27 Regression
• Another common type of learning involves
making continuous realvalued predictions • How much does it cost to ﬂy to Europe?
• How long to drive to Northampton?
• How much money will I make when I
graduate? Inductive learning metho d Inductive learning metho d Regression Construct/adjust h to agree with f on training set
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
(h is consistent if it agrees with f on all examples)
E.g., curve ﬁtting:
E.g., curve ﬁtting: f(x) f(x) Inductive learning metho d Inductive learning metho d Construct/adjust h to agree with f on training set Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples) (h is consistent if it agrees with f on all examples) x x E.g., curve ﬁtting: E.g., curve ﬁtting: f(x) f(x) Chapter 18, Sections 1–3 7
Chapter 18, Sections 1–3 8 x x Chapter 18, Sections 1–3 9 Chapter 18, Sections 1–3 10 !"#$%&'(%)")'*+#,".#'/.0$1)'234 Polynomial Curve Fitting 56%781$9':.1;#.7"%1'<+&=$'*""#> Linear Models !"#$%&'(%)")'*+#,".#'/.0$1)'234
5$#$&%116 78$&$'!"!!" %&$'9#.7#'%)'!"#$#%&'()*$+(#:
;6<",%116='!#!!"$%$&=').'8%'## %,)'%)'%'>"%):
?#'8$')"@<1$)',%)$='7$'+)$'1"#$%&'>%)")'
A+#,".#)'B'!$!!"$%$%$: Polynomial Basis
!"#$%&'(%)")'*+#,".#'/.0$1)'234
5.16#.7"%1'8%)")'9+#,".#): ;<$)$'%&$'=1.8%1>'%')7%11'
,<%#=$'"#'! %99$,'%11'8%)")'
9+#,".#)? !"#$%&'(%)")'*+#,".#'/.0$1)'234
Gaussian Bases
5%+))"%#'6%)")'7+#,".#)8 9:$)$'%&$'1.,%1;'%')<%11',:%#=$'
"#'! .#1>'%77$,'#$%&6>'6%)")'
7+#,".#)?'"# %#0'$ ,.#&.1'
1.,%".#'%#0'),%1$'[email protected]"0:4? !"#$%&%'($)*+$,.'"/.'(*"01'23&"4*0'567 Minimize Squared Error 8")$/9'1,*'+9"4$1,%:';*'9*1 • Process of curve ﬁtting is based on minimizing a
loss function • One example is minimizing sum of squared
;,*4* errors (between predicted and actual values) $0'1,*'0&%<=<03&"4*0'*444> WidrowHoff Algorithm • Incremental algorithm that modiﬁes the weights based on the gradient of the error • For each example i in a dataset:
wt+1 ← wt + αt (ti − φ(xi )T wt )φ(xi ) • until the error is small enough Matrix Approach
Taking gradient of the error function gives
•9.:'#%#*;'<,3'+""/*.(/*;,.0&*4=%.,0*>[email protected]
N
∂
!"#$%&'()*&+,*)./',(&*.(/*0,&&'()*'&*&"*1,"*2',3/0 = 0
ED (w) = β
{tn − wT φ(xn )}φ(xn )T
∂w n=1 4"35'()*6"*!7*8,*),&*
8+,, A+,*9"",BC,("0,*
$0,%/"B'(5,0,7*******D Summary
• Learning is a fundamental component of
intelligence • Classiﬁcation is a way of discriminating
among categories • Decision trees: simple classiﬁcation method
• Regression is the estimation of functions
• Leastsquares is a standard method for
regression ...
View
Full Document
 Fall '11
 AndrewBarto
 Artificial Intelligence, Machine Learning, Inductive learning metho

Click to edit the document details