This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Aims
11s1: COMP9417 Machine Learning and Data Mining This lecture will enable you to describe machine learning approaches
to the problem of discovering rules from data. Following it you should be
able to: Rule Learning • deﬁne a representation for rules March 15, 2011 • describe the decision table and 1R approaches
• outline overﬁtting avoidance in rule learning using pruning
• reproduce the basic sequential covering algorithm Acknowledgement: Material derived from slides for the book
Machine Learning, Tom M. Mitchell, McGrawHill, 1997
http://www2.cs.cmu.edu/~tom/mlbook.html
and the book Data Mining, Ian H. Witten and Eibe Frank,
Morgan Kauﬀman, 2000. http://www.cs.waikato.ac.nz/ml/weka Relevant WEKA programs:
OneR, ZeroR, DecisionTable, PART, Prism, JRip, Ridor COMP9417: March 15, 2011 Introduction Rule Learning: Slide 1 Introduction Machine Learning specialists often prefer certain models of data In applications of machine learning, specialists may ﬁnd that users: • decisiontrees • ﬁnd it hard to understand what some representations for models mean • neural networks • expect to see in models similar types of “patterns” to those they can
ﬁnd using manual methods • nearestneighbour
• ... • have other ideas about kinds of representations for models they think
would help them Potential Machine Learning users often prefer certain models of data
• spreadsheets Message:
very simple models may be useful at ﬁrst to help users
understand what is going on in the data. Later, can use representations
for models which may allow for greater predictive accuracy. • 2Dplots
• OLAP
• ...
COMP9417: March 15, 2011 Rule Learning: Slide 2 COMP9417: March 15, 2011 Rule Learning: Slide 3 Data set for W eather
outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy temperature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild humidity
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high windy
false
true
false
false
false
true
true
false
false
false
true
true
false
true COMP9417: March 15, 2011 Decision Tables
Simple representation for model is to use same format as input  a decision
table.
Just look up the attribute values of an instance in the table to ﬁnd the
class value.
This is rote learning or memorization  no generalization !
However, by selecting a subset of the attributes we can compress the table
and classify new instances. play
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Rule Learning: Decision table:
1. a schema, set of attributes
2. a body, multiset of labelled instances, each has value for each attribute
and for label
A multiset is a “set” which can have repeated elements. Slide 4 COMP9417: March 15, 2011 Rule Learning: Learning Decision Tables Slide 5 LOOCV Bestﬁrst search for schema giving decision table with least error. “Leaveoneout crossvalidation”. 1. i := 0 Given a data set, we often wish to estimate the error on new data of a
model learned from this data set. 2. attribute set Ai := A What can we do ? 3. schema Si := ∅ We can use a holdout set, a subset of the data set which is NOT used for
training but is used in testing our model. 4. Do
• Find the best attribute a ∈ Ai to add to Si by minimising crossvalidation estimation of error Ei
• Ai := Ai \ {a}
• Si := Si ∪ {a}
• i := i + 1
5. While Ei is reducing
COMP9417: March 15, 2011 Rule Learning: Slide 6 Often use a 2:1 split of training:test data.
BUT this means only 2
3 of the data set is available to learn our model . . . So in LOOCV, for n examples, we repeatedly leave 1 out and train on the
remaining n − 1 examples. Doing this n times, the mean error of all the trainandtest iterations is
our estimate of the “true error” of our model.
COMP9417: March 15, 2011 Rule Learning: Slide 7 k fold CrossValidation Decision Table for play A problem with LOOCV  have to learn a model n times for n examples
in our data set.
Is this really necessary ?
Partition data set into k equal size disjoint subsets.
Each of these k subsets in turn is used as the test set while the remainder
are used as the training set.
The mean error of all the trainandtest iterations is our estimate of the
“true error” of our model.
k = 10 is a reasonable choice (or k = 3 if the learning takes a long time).
Ensuring the class distribution in each subset is the same as that of the
complete data set is called stratiﬁcation.
We’ll see crossvalidation again . . .
COMP9417: March 15, 2011 Rule Learning: Slide 8 Best first search for feature set,
terminated after 5 non improving subsets.
Evaluation (for feature selection): CV (leave one out)
Rules:
==================================
outlook humidity play
==================================
sunny
normal
yes
overcast normal
yes
rainy
normal
yes
rainy
high
yes
overcast high
yes
sunny
high
no
==================================
COMP9417: March 15, 2011 Rule Learning: Slide 9 Representing Rules Decision Table for play General form of a rule: Unfortunately, not particularly good at predicting play . . . Antecedent → Consequent
=== Stratified crossvalidation ===
Correctly Classified Instances
Incorrectly Classified Instances 6
8 • Antecedent (precondition) is a series of tests or constraints on
attributes (like the tests at decision tree nodes) 42.8571 %
57.1429 % However, on a number of realworld domains has been shown to give
predictive accuracy competitive with C4.5 decisiontree learner and uses a
simpler model representation. • Consequent (postcondition or conclusion) gives class value or
probability distribution on class values (like leaf nodes of a decision
tree)
• Rules of this form (with a single conclusion) are classiﬁcation rules
• Antecedent is true if logical conjunction of constraints is true
• Rule “ﬁres” and gives the class in the consequent
Also has a procedural interpretation: If antecedent Then consequent COMP9417: March 15, 2011 Rule Learning: Slide 10 COMP9417: March 15, 2011 Rule Learning: Slide 11 Sets of Rules Rules vs. Trees
Can solve both problems on previous slide by using ordered rules with a
default class, e.g. decision list. Rule1 ∨
Rule2 ∨
... If
Then
Else If
Then . . .
However, essentially back to trees (which don’t suﬀer from these problems
due to ﬁxed order of execution) Think of set of rules as a logical disjunction.
A problem: can give rise to conﬂicts: So why not just use trees ? Rule1: att1=red ∧ att2= circle → yes Rules can be modular (independent “nuggets” of information) whereas
trees are not (easily) made of independent components. Instance red, circle, heavy classiﬁed as both yes and no ! Rules can be more compact than trees – see lecture on “Decision Tree
Learning”. Rule2: att2=circle ∧ att3= heavy → no
Either give no conclusion, or conclusion of rule with highest coverage.
Another problem: some instances may not be covered by rules:
Either give no conclusion, or majority class of training set.
COMP9417: March 15, 2011 Rule Learning: Slide 12 COMP9417: March 15, 2011 Rule Learning: Slide 13 1R Rules vs. Trees How would you represent these rules as a tree if each attribute w, x, y
and z can have values 1, 2 or 3?
If x = 1 and y = 1 Then class = a
If z = 1 and w = 1 Then class = a
Otherwise class = b A simple rulelearner which has nonetheless proved very competitive in
some domains.
Called 1R for “1rule”, it is a onelevel decisiontree expressed as a set of
rules that test one attribute.
For each attribute a
For each value v of a make a rule:
count how often each class appears
find most frequent class c
set rule to assign class c for attributevalue a = v
Calculate error rate of rules for a
Choose set of rules with lowest error rate COMP9417: March 15, 2011 Rule Learning: Slide 14 COMP9417: March 15, 2011 Rule Learning: Slide 15 1R on play
attribute
outlook temperature humidity
windy rules
sunny → no
overcast → yes
rainy → yes
hot → no
mild → yes
cool yes
high → no
normal → yes
false → yes
true → no errors
2/5
0/4
2/5
2/4
2/6
1/4
3/7
1/7
2/8
3/6 1R on play Two rules tie with the smallest number of errors, the ﬁrst one is: total errors
4/14 outlook:
sunny
> no
overcast
> yes
rainy
> yes
(10/14 instances correct) 5/14 4/14
5/14 COMP9417: March 15, 2011 Rule Learning: Slide 16 COMP9417: March 15, 2011 Rule Learning: Slide 17 ZeroR 1R on play More complicated with missing or numeric attributes: What is this ?
Simply the 1R method but testing zero attributes instead of one. • treat “missing” as a separate value What does it do ? • discretize numeric attributes by choosing breakpoints for threshold tests
However, too many breakpoints causes overﬁtting, so parameter to specify
minimum number of examples lying between two thresholds.
humidity:
< 82.5 > yes
< 95.5 > no
>= 95.5 > yes
(11/14 instances correct)
COMP9417: March 15, 2011 Predicts majority class in training set (mean if numerical prediction).
What is the point ?
Use a baseline for comparing classiﬁer performance.
Stop and think about it . . .
. . . it is a mostgeneral classiﬁer, having no constraints on attributes.
Usually, it will be too general (e.g. “always play”). So we could try 1R,
which is less general (more speciﬁc) . . .
What does this process of moving from ZeroR to 1R resemble ? Rule Learning: Slide 18 COMP9417: March 15, 2011 Rule Learning: Slide 19 Learning Disjunctive Sets of Rules Sequential Covering Algorithm Method 1: Learn decision tree, convert to rules Sequentialcovering(T arget attribute, Attributes, Examples, T hreshold) • can be slow for large and noisy datasets • Learned rules ← {} • improvements: e.g. C5.0, Weka PART • Rule ← learnonerule(T arget attribute, Attributes, Examples) Method 2: Sequential covering algorithm: • while performance(Rule, Examples) > T hreshold, do
– Learned rules ← Learned rules + Rule
– Examples ← Examples − {examples correctly classiﬁed by Rule}
– Rule ← learnonerule(T arget attribute, Attributes, Examples) 1. Learn one rule with high accuracy, any coverage
2. Remove positive examples covered by this rule • Learned rules ← sort Learned rules accord to performance over
Examples 3. Repeat • return Learned rules
COMP9417: March 15, 2011 Rule Learning: Slide 20 COMP9417: March 15, 2011 Learn One Rule LearnOneRule(T arget attribute, Attributes, Examples) IF Wind=weak
THEN PlayTennis=yes IF Humidity=normal
THEN PlayTennis=yes IF Humidity=high
THEN PlayTennis=no COMP9417: March 15, 2011 // Returns a single rule which covers some of the
// positive examples and none of the negatives. ... IF Humidity=normal
Wind=weak
THEN PlayTennis=yes P os := positive Examples
N eg := negative Examples
BestRule := → ...
IF Humidity=normal
Wind=strong
THEN PlayTennis=yes Slide 21 Algorithm “Learn One Rule” IF
THEN PlayTennis=yes IF Wind=strong
THEN PlayTennis=no Rule Learning: IF Humidity=normal
Outlook=sunny
THEN PlayTennis=yes IF Humidity=normal
Outlook=rain
THEN PlayTennis=yes Rule Learning: Slide 22 if P os do
N ewAnte := most general rule antecedent possible
N ewRuleN eg := N eg
while N ewRuleN eg do
for ClassV al in T arget attribute values do
N ewCons := T arget attribute = ClassV al COMP9417: March 15, 2011 Rule Learning: Slide 23 Algorithm “Learn One Rule” Learn One Rule // Add a new literal to specialize N ewAnte, i.e. possible
// constraints of the form att = val for att ∈ Attributes • Called a covering approach because at each stage a rule is identiﬁed
that covers some of the instances Candidate literals ← generate candidates
Best literal ← argmaxL∈Candidate literals
P erf ormance(SpecializeAnte(N ewAnte, L) → N ewCons)
add Best literal to N ewAnte
N ewRule := N ewAnte → N ewCons • the evaluation function P erf ormance(Rule) is unspeciﬁed if P erf ormance(N ewRule) > P erf ormance(BestRule) then
BestRule := N ewRule
endif
N ewRuleN eg := subset of N ewRuleN eg that satisﬁes N ewAnte
endfor
endif
return BestRule COMP9417: March 15, 2011 Rule Learning: Slide 24 • a simple measure would be the number of negatives not covered by the
antecedent, i.e. N eg − N ewRuleN eg
• the consequent could then be the most frequent value of the target
attribute among the examples covered by the antecedent
• this is sure not to be the best measure of performance ! COMP9417: March 15, 2011 Example: generating a rule
b b b y a
a b b b b
b b b b 1. May use beam search a If true then class = a b 2. Easily generalizes to multivalued target functions b 3. Choose evaluation function to guide search: x y b
a
a
bb
b
a
bb
aa
b
b
b
ab
b
b
b
b b
2·6 b b b
b b b
b
b • Entropy (i.e., information gain)
• Sample accuracy: If x > 1.2 then class = a nc
n
where nc = correct rule predictions, n = all predictions
• m estimate:
nc + mp
n+m x 1·2 y Slide 25 Subtleties: Learn One Rule a
aa b Rule Learning: a
a
aa a
b
b 1·2 COMP9417: March 15, 2011 b If x > 1.2 and y > 2.6 then class = a ab
b
x Rule Learning: Slide 26 COMP9417: March 15, 2011 Rule Learning: Slide 27 Aspects of Sequential Covering Algorithms Aspects of Sequential Covering Algorithms • Sequential Covering learns rules singly. Decision Tree induction learns
all disjuncts simultaneously. • If a generaltospeciﬁc search is chosen, then start from a single node.
If a speciﬁctogeneral search is chosen, then for a set of examples,
need to determine what are the starting nodes. • Sequential Covering chooses between all attval pairs at each
specialisation step (i.e. between subsets of the examples covered).
Decision Tree induction only chooses between all attributes (i.e.
between partitions of the examples w.r.t. the added attribute). • Depending on the number of conditions expected for rules relative to
the number of conditions in the examples, most general rules may be
closer to the target than most speciﬁc rules. • Assuming ﬁnal ruleset contains on average n rules with k conditions,
sequential covering requires n×k primitive selection decisions. Choosing
an attribute at the internal node of a decision tree equates to choosing
attval pairs for the conditions of all corresponding rules. • Generaltospeciﬁc sequential covering is a generateandtest approach.
All syntactically permitted specialisations are generated and tested
against the data. Speciﬁctogeneral is typically exampledriven,
constraining the hypotheses generated. • If data is plentiful, then the greater ﬂexibility for choosing attval pairs
might be desired and might lead to better performance. • Variations on performance evaluation are often implemented: entropy,
mestimate, relative frequency, signiﬁcance tests (e.g. likelihood ratio). COMP9417: March 15, 2011 Rule Learning: Slide 28 Rules with exceptions Example: rule for iris data
If petallength ≥ 2.45 and petallength < 4.45 then Irisversicolor New instance:
Sepal
width
3.5 Petal
length
2.6 Petal
width
0.2 Type
Irissetosa Modiﬁed rule:
If petallength ≥ 2.45 and petallength < 4.45 then Irisversicolor EXCEPT
if petalwidth < 1.0 then Irissetosa COMP9417: March 15, 2011 Rule Learning: Slide 29 Exceptions to exceptions to exceptions . . . Idea: allow rules to have exceptions Sepal
length
5.1 COMP9417: March 15, 2011 Rule Learning: Slide 30 default: Irissetosa
except if petallength ≥ 2.45 and petallength < 5.355
and petalwidth < 1.75
then Irisversicolor
except if petallength ≥ 4.95 and petalwidth < 1.55
then Irisvirginica
else if sepallength < 4.95 and sepalwidth ≥ 2.45
then Irisvirginica
else if petallength ≥ 3.35
then Irisvirginica
except if petallength < 4.85 and sepallength < 5.95
then Irisversicolor COMP9417: March 15, 2011 Rule Learning: Slide 31 Advantages of using exceptions Advantages of using exceptions
Default...except if...then... • Rules can be updated incrementally is logically equivalent to – Easy to incorporate new data
– Easy to incorporate domain knowledge if...then...else where the else speciﬁes the default. • People often think in terms of exceptions But: exceptions oﬀer a psychological advantage • Each conclusion can be considered just in the context of rules and
exceptions that lead to it
– Locality property is important for understanding large rule sets
– “Normal” rule sets don’t oﬀer this advantage COMP9417: March 15, 2011 Rule Learning: • Assumption: defaults and tests early on apply more widely than
exceptions further down
• Exceptions reﬂect special cases Slide 32 InductRDR COMP9417: March 15, 2011 Rule Learning: Slide 33 Issues for Rule Learning Programs Gaines & Compton (1995)
• Sequential or simultaneous covering of data? • Learns “RippleDown Rules from examples • General → speciﬁc, or speciﬁc → general? • INDUCT s signiﬁcance measure for a rule:
– Probability of completely random rule with same coverage performing
at least as well • Generateandtest, or exampledriven?
• Whether and how to postprune?
• What statistical evaluation function? • Random rule R selects t cases at random from the data set
• How likely is it that p of these belong to the correct class ?
• Probability given by hypergeometric distribution
• approximate by incomplete beta function
• works well if target function suits ruleswithexceptions bias
COMP9417: March 15, 2011 Rule Learning: Slide 34 COMP9417: March 15, 2011 Rule Learning: Slide 35 Summary of Rule Learning
• A major class of representations (AI, business rules, . . . )
• Rule interpretation may need care
• Many common learning issues: search, evaluation, overﬁtting, etc.
• Can be related to numeric prediction by threshold functions
• Lifted to ﬁrstorder representations in Inductive Logic Programming COMP9417: March 15, 2011 Rule Learning: Slide 36 ...
View
Full
Document
This note was uploaded on 06/20/2011 for the course COMP 9417 taught by Professor Some during the Three '11 term at University of New South Wales.
 Three '11
 some
 Data Mining, Machine Learning

Click to edit the document details