3.1.Classification Rule Learning

3.1.Classification Rule Learning - Aims 11s1: COMP9417...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Aims 11s1: COMP9417 Machine Learning and Data Mining This lecture will enable you to describe machine learning approaches to the problem of discovering rules from data. Following it you should be able to: Rule Learning • define a representation for rules March 15, 2011 • describe the decision table and 1R approaches • outline overfitting avoidance in rule learning using pruning • reproduce the basic sequential covering algorithm Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997 http://www-2.cs.cmu.edu/~tom/mlbook.html and the book Data Mining, Ian H. Witten and Eibe Frank, Morgan Kauffman, 2000. http://www.cs.waikato.ac.nz/ml/weka Relevant WEKA programs: OneR, ZeroR, DecisionTable, PART, Prism, JRip, Ridor COMP9417: March 15, 2011 Introduction Rule Learning: Slide 1 Introduction Machine Learning specialists often prefer certain models of data In applications of machine learning, specialists may find that users: • decision-trees • find it hard to understand what some representations for models mean • neural networks • expect to see in models similar types of “patterns” to those they can find using manual methods • nearest-neighbour • ... • have other ideas about kinds of representations for models they think would help them Potential Machine Learning users often prefer certain models of data • spreadsheets Message: very simple models may be useful at first to help users understand what is going on in the data. Later, can use representations for models which may allow for greater predictive accuracy. • 2D-plots • OLAP • ... COMP9417: March 15, 2011 Rule Learning: Slide 2 COMP9417: March 15, 2011 Rule Learning: Slide 3 Data set for W eather outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild humidity high high high high normal normal normal high normal normal normal high normal high windy false true false false false true true false false false true true false true COMP9417: March 15, 2011 Decision Tables Simple representation for model is to use same format as input - a decision table. Just look up the attribute values of an instance in the table to find the class value. This is rote learning or memorization - no generalization ! However, by selecting a subset of the attributes we can compress the table and classify new instances. play no no yes yes yes no yes no yes yes yes yes yes no Rule Learning: Decision table: 1. a schema, set of attributes 2. a body, multiset of labelled instances, each has value for each attribute and for label A multiset is a “set” which can have repeated elements. Slide 4 COMP9417: March 15, 2011 Rule Learning: Learning Decision Tables Slide 5 LOOCV Best-first search for schema giving decision table with least error. “Leave-one-out cross-validation”. 1. i := 0 Given a data set, we often wish to estimate the error on new data of a model learned from this data set. 2. attribute set Ai := A What can we do ? 3. schema Si := ∅ We can use a holdout set, a subset of the data set which is NOT used for training but is used in testing our model. 4. Do • Find the best attribute a ∈ Ai to add to Si by minimising crossvalidation estimation of error Ei • Ai := Ai \ {a} • Si := Si ∪ {a} • i := i + 1 5. While Ei is reducing COMP9417: March 15, 2011 Rule Learning: Slide 6 Often use a 2:1 split of training:test data. BUT this means only 2 3 of the data set is available to learn our model . . . So in LOOCV, for n examples, we repeatedly leave 1 out and train on the remaining n − 1 examples. Doing this n times, the mean error of all the train-and-test iterations is our estimate of the “true error” of our model. COMP9417: March 15, 2011 Rule Learning: Slide 7 k -fold Cross-Validation Decision Table for play A problem with LOOCV - have to learn a model n times for n examples in our data set. Is this really necessary ? Partition data set into k equal size disjoint subsets. Each of these k subsets in turn is used as the test set while the remainder are used as the training set. The mean error of all the train-and-test iterations is our estimate of the “true error” of our model. k = 10 is a reasonable choice (or k = 3 if the learning takes a long time). Ensuring the class distribution in each subset is the same as that of the complete data set is called stratification. We’ll see cross-validation again . . . COMP9417: March 15, 2011 Rule Learning: Slide 8 Best first search for feature set, terminated after 5 non improving subsets. Evaluation (for feature selection): CV (leave one out) Rules: ================================== outlook humidity play ================================== sunny normal yes overcast normal yes rainy normal yes rainy high yes overcast high yes sunny high no ================================== COMP9417: March 15, 2011 Rule Learning: Slide 9 Representing Rules Decision Table for play General form of a rule: Unfortunately, not particularly good at predicting play . . . Antecedent → Consequent === Stratified cross-validation === Correctly Classified Instances Incorrectly Classified Instances 6 8 • Antecedent (pre-condition) is a series of tests or constraints on attributes (like the tests at decision tree nodes) 42.8571 % 57.1429 % However, on a number of real-world domains has been shown to give predictive accuracy competitive with C4.5 decision-tree learner and uses a simpler model representation. • Consequent (post-condition or conclusion) gives class value or probability distribution on class values (like leaf nodes of a decision tree) • Rules of this form (with a single conclusion) are classification rules • Antecedent is true if logical conjunction of constraints is true • Rule “fires” and gives the class in the consequent Also has a procedural interpretation: If antecedent Then consequent COMP9417: March 15, 2011 Rule Learning: Slide 10 COMP9417: March 15, 2011 Rule Learning: Slide 11 Sets of Rules Rules vs. Trees Can solve both problems on previous slide by using ordered rules with a default class, e.g. decision list. Rule1 ∨ Rule2 ∨ ... If Then Else If Then . . . However, essentially back to trees (which don’t suffer from these problems due to fixed order of execution) Think of set of rules as a logical disjunction. A problem: can give rise to conflicts: So why not just use trees ? Rule1: att1=red ∧ att2= circle → yes Rules can be modular (independent “nuggets” of information) whereas trees are not (easily) made of independent components. Instance ￿red, circle, heavy ￿ classified as both yes and no ! Rules can be more compact than trees – see lecture on “Decision Tree Learning”. Rule2: att2=circle ∧ att3= heavy → no Either give no conclusion, or conclusion of rule with highest coverage. Another problem: some instances may not be covered by rules: Either give no conclusion, or majority class of training set. COMP9417: March 15, 2011 Rule Learning: Slide 12 COMP9417: March 15, 2011 Rule Learning: Slide 13 1R Rules vs. Trees How would you represent these rules as a tree if each attribute w, x, y and z can have values 1, 2 or 3? If x = 1 and y = 1 Then class = a If z = 1 and w = 1 Then class = a Otherwise class = b A simple rule-learner which has nonetheless proved very competitive in some domains. Called 1R for “1-rule”, it is a one-level decision-tree expressed as a set of rules that test one attribute. For each attribute a For each value v of a make a rule: count how often each class appears find most frequent class c set rule to assign class c for attribute-value a = v Calculate error rate of rules for a Choose set of rules with lowest error rate COMP9417: March 15, 2011 Rule Learning: Slide 14 COMP9417: March 15, 2011 Rule Learning: Slide 15 1R on play attribute outlook temperature humidity windy rules sunny → no overcast → yes rainy → yes hot → no mild → yes cool yes high → no normal → yes false → yes true → no errors 2/5 0/4 2/5 2/4 2/6 1/4 3/7 1/7 2/8 3/6 1R on play Two rules tie with the smallest number of errors, the first one is: total errors 4/14 outlook: sunny -> no overcast -> yes rainy -> yes (10/14 instances correct) 5/14 4/14 5/14 COMP9417: March 15, 2011 Rule Learning: Slide 16 COMP9417: March 15, 2011 Rule Learning: Slide 17 ZeroR 1R on play More complicated with missing or numeric attributes: What is this ? Simply the 1R method but testing zero attributes instead of one. • treat “missing” as a separate value What does it do ? • discretize numeric attributes by choosing breakpoints for threshold tests However, too many breakpoints causes overfitting, so parameter to specify minimum number of examples lying between two thresholds. humidity: < 82.5 -> yes < 95.5 -> no >= 95.5 -> yes (11/14 instances correct) COMP9417: March 15, 2011 Predicts majority class in training set (mean if numerical prediction). What is the point ? Use a baseline for comparing classifier performance. Stop and think about it . . . . . . it is a most-general classifier, having no constraints on attributes. Usually, it will be too general (e.g. “always play”). So we could try 1R, which is less general (more specific) . . . What does this process of moving from ZeroR to 1R resemble ? Rule Learning: Slide 18 COMP9417: March 15, 2011 Rule Learning: Slide 19 Learning Disjunctive Sets of Rules Sequential Covering Algorithm Method 1: Learn decision tree, convert to rules Sequential-covering(T arget attribute, Attributes, Examples, T hreshold) • can be slow for large and noisy datasets • Learned rules ← {} • improvements: e.g. C5.0, Weka PART • Rule ← learn-one-rule(T arget attribute, Attributes, Examples) Method 2: Sequential covering algorithm: • while performance(Rule, Examples) > T hreshold, do – Learned rules ← Learned rules + Rule – Examples ← Examples − {examples correctly classified by Rule} – Rule ← learn-one-rule(T arget attribute, Attributes, Examples) 1. Learn one rule with high accuracy, any coverage 2. Remove positive examples covered by this rule • Learned rules ← sort Learned rules accord to performance over Examples 3. Repeat • return Learned rules COMP9417: March 15, 2011 Rule Learning: Slide 20 COMP9417: March 15, 2011 Learn One Rule Learn-One-Rule(T arget attribute, Attributes, Examples) IF Wind=weak THEN PlayTennis=yes IF Humidity=normal THEN PlayTennis=yes IF Humidity=high THEN PlayTennis=no COMP9417: March 15, 2011 // Returns a single rule which covers some of the // positive examples and none of the negatives. ... IF Humidity=normal Wind=weak THEN PlayTennis=yes P os := positive Examples N eg := negative Examples BestRule := → ... IF Humidity=normal Wind=strong THEN PlayTennis=yes Slide 21 Algorithm “Learn One Rule” IF THEN PlayTennis=yes IF Wind=strong THEN PlayTennis=no Rule Learning: IF Humidity=normal Outlook=sunny THEN PlayTennis=yes IF Humidity=normal Outlook=rain THEN PlayTennis=yes Rule Learning: Slide 22 if P os do N ewAnte := most general rule antecedent possible N ewRuleN eg := N eg while N ewRuleN eg do for ClassV al in T arget attribute values do N ewCons := T arget attribute = ClassV al COMP9417: March 15, 2011 Rule Learning: Slide 23 Algorithm “Learn One Rule” Learn One Rule // Add a new literal to specialize N ewAnte, i.e. possible // constraints of the form att = val for att ∈ Attributes • Called a covering approach because at each stage a rule is identified that covers some of the instances Candidate literals ← generate candidates Best literal ← argmaxL∈Candidate literals P erf ormance(SpecializeAnte(N ewAnte, L) → N ewCons) add Best literal to N ewAnte N ewRule := N ewAnte → N ewCons • the evaluation function P erf ormance(Rule) is unspecified if P erf ormance(N ewRule) > P erf ormance(BestRule) then BestRule := N ewRule endif N ewRuleN eg := subset of N ewRuleN eg that satisfies N ewAnte endfor endif return BestRule COMP9417: March 15, 2011 Rule Learning: Slide 24 • a simple measure would be the number of negatives not covered by the antecedent, i.e. N eg − N ewRuleN eg • the consequent could then be the most frequent value of the target attribute among the examples covered by the antecedent • this is sure not to be the best measure of performance ! COMP9417: March 15, 2011 Example: generating a rule b b b y a a b b b b b b b b 1. May use beam search a If true then class = a b 2. Easily generalizes to multi-valued target functions b 3. Choose evaluation function to guide search: x y b a a bb b a bb aa b b b ab b b b b b 2·6 b b b b b b b b • Entropy (i.e., information gain) • Sample accuracy: If x > 1.2 then class = a nc n where nc = correct rule predictions, n = all predictions • m estimate: nc + mp n+m x 1·2 y Slide 25 Subtleties: Learn One Rule a aa b Rule Learning: a a aa a b b 1·2 COMP9417: March 15, 2011 b If x > 1.2 and y > 2.6 then class = a ab b x Rule Learning: Slide 26 COMP9417: March 15, 2011 Rule Learning: Slide 27 Aspects of Sequential Covering Algorithms Aspects of Sequential Covering Algorithms • Sequential Covering learns rules singly. Decision Tree induction learns all disjuncts simultaneously. • If a general-to-specific search is chosen, then start from a single node. If a specific-to-general search is chosen, then for a set of examples, need to determine what are the starting nodes. • Sequential Covering chooses between all att-val pairs at each specialisation step (i.e. between subsets of the examples covered). Decision Tree induction only chooses between all attributes (i.e. between partitions of the examples w.r.t. the added attribute). • Depending on the number of conditions expected for rules relative to the number of conditions in the examples, most general rules may be closer to the target than most specific rules. • Assuming final rule-set contains on average n rules with k conditions, sequential covering requires n×k primitive selection decisions. Choosing an attribute at the internal node of a decision tree equates to choosing att-val pairs for the conditions of all corresponding rules. • General-to-specific sequential covering is a generate-and-test approach. All syntactically permitted specialisations are generated and tested against the data. Specific-to-general is typically example-driven, constraining the hypotheses generated. • If data is plentiful, then the greater flexibility for choosing att-val pairs might be desired and might lead to better performance. • Variations on performance evaluation are often implemented: entropy, m-estimate, relative frequency, significance tests (e.g. likelihood ratio). COMP9417: March 15, 2011 Rule Learning: Slide 28 Rules with exceptions Example: rule for iris data If petal-length ≥ 2.45 and petal-length < 4.45 then Iris-versicolor New instance: Sepal width 3.5 Petal length 2.6 Petal width 0.2 Type Iris-setosa Modified rule: If petal-length ≥ 2.45 and petal-length < 4.45 then Iris-versicolor EXCEPT if petal-width < 1.0 then Iris-setosa COMP9417: March 15, 2011 Rule Learning: Slide 29 Exceptions to exceptions to exceptions . . . Idea: allow rules to have exceptions Sepal length 5.1 COMP9417: March 15, 2011 Rule Learning: Slide 30 default: Iris-setosa except if petal-length ≥ 2.45 and petal-length < 5.355 and petal-width < 1.75 then Iris-versicolor except if petal-length ≥ 4.95 and petal-width < 1.55 then Iris-virginica else if sepal-length < 4.95 and sepal-width ≥ 2.45 then Iris-virginica else if petal-length ≥ 3.35 then Iris-virginica except if petal-length < 4.85 and sepal-length < 5.95 then Iris-versicolor COMP9417: March 15, 2011 Rule Learning: Slide 31 Advantages of using exceptions Advantages of using exceptions Default...except if...then... • Rules can be updated incrementally is logically equivalent to – Easy to incorporate new data – Easy to incorporate domain knowledge if...then...else where the else specifies the default. • People often think in terms of exceptions But: exceptions offer a psychological advantage • Each conclusion can be considered just in the context of rules and exceptions that lead to it – Locality property is important for understanding large rule sets – “Normal” rule sets don’t offer this advantage COMP9417: March 15, 2011 Rule Learning: • Assumption: defaults and tests early on apply more widely than exceptions further down • Exceptions reflect special cases Slide 32 Induct-RDR COMP9417: March 15, 2011 Rule Learning: Slide 33 Issues for Rule Learning Programs Gaines & Compton (1995) • Sequential or simultaneous covering of data? • Learns “Ripple-Down Rules from examples • General → specific, or specific → general? • INDUCT s significance measure for a rule: – Probability of completely random rule with same coverage performing at least as well • Generate-and-test, or example-driven? • Whether and how to post-prune? • What statistical evaluation function? • Random rule R selects t cases at random from the data set • How likely is it that p of these belong to the correct class ? • Probability given by hypergeometric distribution • approximate by incomplete beta function • works well if target function suits rules-with-exceptions bias COMP9417: March 15, 2011 Rule Learning: Slide 34 COMP9417: March 15, 2011 Rule Learning: Slide 35 Summary of Rule Learning • A major class of representations (AI, business rules, . . . ) • Rule interpretation may need care • Many common learning issues: search, evaluation, overfitting, etc. • Can be related to numeric prediction by threshold functions • Lifted to first-order representations in Inductive Logic Programming COMP9417: March 15, 2011 Rule Learning: Slide 36 ...
View Full Document

Page1 / 10

3.1.Classification Rule Learning - Aims 11s1: COMP9417...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online