Splitting criteria In C4.5, given a node ‘t’, the splitting criteria used is Info (t) = - = k j j S S C f 1 ) , ( . log 2 S S C f j ) , ( where f(C j ,S) stands for the number of samples in S that belong to class C j (out of k possible classes) and |S| denotes the total number of samples in the set S. Gain ratio (t) = ) ( ) ( t n Informatio Split t Gain where Info x (t) = - = n i i i T Info T T 1 ) ( . | | | | Gain (t) = Info (t) – Info x (t) Split info(t) = - = n i i i T T T T 1 2 | | | | log . | | | | and where x refers to the splitting attribute currently being tested. Let us have a look how these parameter calculations are worked out with reference to the Outlook data set as shown in Table prepared by Quinlan. We choose the ‘outlook’ attribute as the splitting attribute to be tested (x 1 ).

Outlook temp humidity windy Class Overcast 72 90 TRUE Play
Overcast 83 78 FALSE Play Overcast 64 65 TRUE Play Overcast 81 75 FALSE Play Rain 71 80 TRUE Don't play Rain 65 70 TRUE Don't play Rain 75 80 FALSE Play Rain 68 80 FALSE Play Rain 70 96 FALSE Play Sunny 75 70 TRUE Play Sunny 80 90 TRUE Don't play Sunny 85 85 FALSE Don't play Sunny 72 95 FALSE Don't play Sunny 69 70 FALSE Play Info(T) = -9/14 log 2 (9/14) – 5/14 log 2 (5/14) = 0.940 Info x1 (t) = 4/14(-4/4 log 2 (4/4) – 0/4 log 2 (0/4)) + 5/14(-3/5 log 2 (3/5) -2/5 log 2 (2/5)) + 5/14(-2/5 log 2 (2/5) – 3/5 log 2 (3/5)) = 0.694 Split Info x1 (T) = -4/14 log 2 (4/14)- 5/14 log 2 (5/14) - 5/14 log 2 (5/14) = 1.577 Gain (x1) = 0.940 – 0.694 = 0.246 Gain ratio (x1) = 0.246/1.557 = 0.156 A node is selected to split that provides the maximum information gain ratio.
