383-Fall11-Lec18 - Machine Learning CMPSCI 383 Learning...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Machine Learning CMPSCI 383 Learning Machines and Hollywood HAL plays chess Outline • Motivation: why should agents learn? • Different models of learning • Learning from observation • classification and regression • Learning decision trees • Linear regression Machine Learning is everywhere! • Every time you speak on the phone to an automated program (travel, UPS, Fedex, ...) • Google, Amazon, Facebook all extensively use machine learning to predict user behavior (they hire many of our PhD's) • Machine learning is one of the most sought after disciplines for prospective employers IBM Watson Jeopardy Quiz Program • Watson uses machine learning to select answers to a wide range of questions Stanley: DARPA Grand Challenge Winner • Autonomous car that drove several hundred miles through the desert using machine learning The Future: ML everywhere! • Hand-held devices will have terabytes of RAM and petabytes of disk space • Massive use of machine learning across all smartphones, web software, OS, desktops • Cars will increasingly use machine learning to drive autonomously • Hard to overestimate the impact of ML Human Learning • Learning is a hallmark of intelligence • Human abilities depend on learning • Learning a language (e.g., English, French) • Learning to drive • Learning to recognize people (faces) • Learning in the classroom Bongard Problem Identify a rule that separates the figures on the left from those on the right Bongard Problem Bongard Problem Types of Learning • There are many types of learning • Supervised learning • Unsupervised learning • Reinforcement learning • Evolutionary (genetic) learning Supervised Learning • Simplest model of learning • An agent is given positive and negative examples of some concept or function • The goal is to learn an approximation of the desired concept or function • Classification: discrete concept spaces • Regression: real-valued functions Character Recognition Apple Apple Apple 383 383 383 Apple Apple 383 • Humans can effortlessly recognize complex visual patterns (characters, faces, text) • This apparently simple problem is formidably difficult for machines Classification 2 1 1 2 2 1 2 1 3 3 3 3 Classification Classification Attribute-Value Data Attribute-based representations Examples described by attribute values (Boolean, discrete, continuous, etc.) E.g., situations where I will/won’t wait for a table: Example X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Alt B ar F ri H un T F F T T F F T F T F F T F T T T F T F F T F T F T F F F F F T F T T F T T T T F F F F T T T T Attributes P at P rice Rain Res Some $$$ F T Full $ F F Some $ F F Full $ F F Full $$$ F T Some $$ T T None $ T F Some $$ T T Full $ T F Full $$$ F T None $ F F Full $ F F T y pe French Thai Burger Thai French Italian Burger Thai Burger Italian Thai Burger Target E st WillWait 0–10 T 30–60 F 0–10 T 10–30 T >60 F 0–10 T 0–10 F 0–10 T >60 F 10–30 F 0–10 F 30–60 T Classification of examples is positive (T) or negative (F) Chapter 18, Sections 1–3 13 Decision Trees Decision trees One possible representation for hypotheses E.g., here is the “true” tree for deciding whether to wait: Patrons? None Full T WaitEstimate? >60 F Some 30!60 F 10!30 Alternate? No Yes Bar? No F T Yes T Hungry? Yes Reservation? No No Fri/Sat? No F 0!10 T Yes T T Yes Alternate? No T Yes Raining? No F Yes T Chapter 18, Sections 1–3 14 Boolean Functions Expressiveness Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row → path to leaf: A B F F T T F T F T A A xor B F T T F F T F T F T F T T F B B Trivially, there is a consistent decision tree for any training set w/ one path to leaf for each example (unless f nondeterministic in x) but it probably won’t generalize to new examples Prefer to find more compact decision trees Chapter 18, Sections 1–3 15 Hypothesis Spaces Hyp othesis spaces How many distinct decision trees with n Boolean attributes?? = number of Boolean functions n = number of distinct truth tables with 2n rows = 22 E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., H ung ry ∧ ¬Rain)?? Each attribute can be in (positive), in (negative), or out ⇒ 3n distinct conjunctive hypotheses More expressive hypothesis space – increases chance that target function can be expressed – increases number of hypotheses consistent w/ training set ⇒ may get worse predictions Chapter 18, Sections 1–3 22 Learning Decision Trees Decision tree learning Aim: find a small tree consistent with the training examples Idea: (recursively) choose “most significant” attribute as root of (sub)tree function DTL(examples, attributes, default) returns a decision tree if examples is empty then return default else if all examples have the same classification then return the classification else if attributes is empty then return Mode(examples) else best ← Choose-Attribute(attributes, examples) tree ← a new decision tree with root test best for each value vi of best do examplesi ← {elements of examples with best = vi } subtree ← DTL(examplesi, attributes − best, Mode(examples)) add a branch to tree with label vi and subtree subtree return tree Chapter 18, Sections 1–3 23 Attribute Selection Cho osing an attribute Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or “all negative” Type? Patrons? None Some Full French Italian Thai Burger P atrons? is a better choice—gives information about the classification Chapter 18, Sections 1–3 24 Entropy • We can apply ideas from the field of information theory to attribute selection • Given a set of events, each of which occurs with probability pi , entropy measures the surprise associated with a particular outcome • Low frequency events are more surprising ￿ H (P ) = − pi log2 pi i Information contd. Information Theory Suppose we have p positive and n negative examples at the root ⇒ H ( p/(p + n), n/(p + n) ) bits needed to classify a new example E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit An attribute splits the examples E into subsets Ei, each of which (we hope) needs less information to complete the classification Let Ei have pi positive and ni negative examples ⇒ H ( pi/(pi + ni), ni/(pi + ni) ) bits needed to classify a new example ⇒ expected number of bits per example over all branches is Σi pi + ni H ( pi/(pi + ni), ni/(pi + ni) ) p+n For P atrons?, this is 0.459 bits, for T y pe this is (still) 1 bit ⇒ choose the attribute that minimizes the remaining information needed Chapter 18, Sections 1–3 26 Entropy an attribute Cho osing reduction Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or “all negative” Entropy = 1 bit Entropy = 1 bit Type? Patrons? None Some Full French Italian Thai Burger Entropy = 1 bit P atrons? 0.459 bits Entropy =is a better choice—gives information about the classification Chapter 18, Sections 1–3 24 Learned Decision Tree Example contd. Decision tree learned from the 12 examples: Patrons? None F Some Full Hungry? T Yes Type? French T Italian No F Thai Burger T Fri/Sat? F No F Yes T Substantially simpler than “true” tree—a more complex hypothesis isn’t justified by small amount of data Chapter 18, Sections 1–3 27 Regression • Another common type of learning involves making continuous real-valued predictions • How much does it cost to fly to Europe? • How long to drive to Northampton? • How much money will I make when I graduate? Inductive learning metho d Inductive learning metho d Regression Construct/adjust h to agree with f on training set Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) (h is consistent if it agrees with f on all examples) E.g., curve fitting: E.g., curve fitting: f(x) f(x) Inductive learning metho d Inductive learning metho d Construct/adjust h to agree with f on training set Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) (h is consistent if it agrees with f on all examples) x x E.g., curve fitting: E.g., curve fitting: f(x) f(x) Chapter 18, Sections 1–3 7 Chapter 18, Sections 1–3 8 x x Chapter 18, Sections 1–3 9 Chapter 18, Sections 1–3 10 !"#$%&'(%)")'*+#,-".#'/.0$1)'234 Polynomial Curve Fitting 56%781$9':.1;#.7"%1'<+&=$'*"--"#> Linear Models !"#$%&'(%)")'*+#,-".#'/.0$1)'234 5$#$&%116 78$&$'!"!!" %&$'9#.7#'%)'!"#$#%&'()*$+(#: ;6<",%116='!#!!"$%$&=').'-8%-'## %,-)'%)'%'>"%): ?#'-8$')"@<1$)-',%)$='7$'+)$'1"#$%&'>%)")' A+#,-".#)'B'!$!!"$%$%$: Polynomial Basis !"#$%&'(%)")'*+#,-".#'/.0$1)'234 5.16#.7"%1'8%)")'9+#,-".#): ;<$)$'%&$'=1.8%1>'%')7%11' ,<%#=$'"#'! %99$,-'%11'8%)")' 9+#,-".#)? !"#$%&'(%)")'*+#,-".#'/.0$1)'234 Gaussian Bases 5%+))"%#'6%)")'7+#,-".#)8 9:$)$'%&$'1.,%1;'%')<%11',:%#=$' "#'! .#1>'%77$,-'#$%&6>'6%)")' 7+#,-".#)?'"# %#0'$ ,.#-&.1' 1.,%-".#'%#0'),%1$'2@"0-:4? !"#$%&%'($)*+$,--.'"/.'(*"01'23&"4*0'567 Minimize Squared Error 8")$/9'1,*'+-9"4$1,%:';*'9*1 • Process of curve fitting is based on minimizing a loss function • One example is minimizing sum of squared ;,*4* errors (between predicted and actual values) $0'1,*'0&%<-=<03&"4*0'*44-4> Widrow-Hoff Algorithm • Incremental algorithm that modifies the weights based on the gradient of the error • For each example i in a dataset: wt+1 ← wt + αt (ti − φ(xi )T wt )φ(xi ) • until the error is small enough Matrix Approach Taking gradient of the error function gives •9.:'#%#*;'<,3'+""/*.(/*;,.0&*4=%.-,0*>?@ N ￿ ∂ !"#$%&'()*&+,*)-./',(&*.(/*0,&&'()*'&*&"*1,-"*2',3/0 = 0 ED (w) = β {tn − wT φ(xn )}φ(xn )T ∂w n=1 4"35'()*6"-*!7*8,*),&* 8+,-, A+,*9""-,BC,(-"0,* $0,%/"B'(5,-0,7*******D Summary • Learning is a fundamental component of intelligence • Classification is a way of discriminating among categories • Decision trees: simple classification method • Regression is the estimation of functions • Least-squares is a standard method for regression ...
View Full Document

Ask a homework question - tutors are online