# Tutorial02 - SEEM4630 2016-2017 Tutorial 1 Classification...

• 20

This preview shows page 1 - 7 out of 20 pages.

SEEM4630 2016-2017 Tutorial 1 Classification Yingfan Liu, [email protected]
Classification: Definition Given a collection of records ( training set ), each record contains a set of attributes , one of the attributes is the class . Find a model for class attribute as a function of the values of other attributes. Decision tree Naïve bayes k-NN Goal: previously unseen records should be assigned a class as accurately as possible. 2
Decision Tree Goal Construct a tree so that instances belonging to different classes should be separated Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive manner At start, all the training examples are at the root Test attributes are selected on the basis of a heuristics or statistical measure (e.g., information gain ) Examples are partitioned recursively based on selected attributes 3
Let p i be the probability that a tuple belongs to class C i , estimated by |C i,D |/|D| Expected information (entropy) needed to classify a tuple in D: Information needed (after using A to split D into v partitions) to classify D: Information gained by branching on attribute A: Attribute Selection Measure 1: Information Gain 4 ) ( log ) ( 2 1 i m i i p p D Info ) ( | | | | ) ( 1 j v j j A D Info D D D Info (D) Info Info(D) Gain(A) A
Information gain measure is biased towards attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain): GainRatio(A) = Gain(A)/SplitInfo(A) Attribute Selection Measure 2: Gain Ratio 5 ) | | | | ( log | | | | ) ( 2 1 D D D D D SplitInfo j v j j A
If a data set D contains examples from n classes, gini index, gini ( D )