CSc 401 – Data Mining Name:______________________________ Exam #2 November 4, 2002 Score:_________________/100 Directions: Carefully answer each of the following questions. This is an open-book, open-note exam. You may use calculators. You are NOT to get help from others. Points will be assigned on answer quality as well as answer correctness. CLEARLY show all work. 1. Consider the following data: Calculate the gain(C) when attribute C is considered to have two sets of values; those less than or equal to (164 + 180)/2 and those > (164 + 180)/2. CLASS is the classification attribute. Clearly show work. [10 pts.] This was on previous exam. Skipping for now # A B C CLASS 1 b u 202 P 2 a u 43 N 3 a u 280 N 4 b u 100 P 5 b u 120 P 6 b u 360 P 7 b u 164 N 8 b y 180 N - = = | | ) , ( log * | | ) , ( ) ( info 2 1 T T C freq T T C freq T j j k j ) info( * | | | | ) ( info 1 X i i n i T T T T = =

Name:______________________________ page 2/8 2. Using the data shown in problem #1, develop datasets which can be used to do 3-fold cross- validation. [Careful] Clearly explain. [10 pts] Split the data up into three non-overlapping data sets, which encompass the full set of data. Randomly selecting the 3 sets would be an ideal way. The model will be trained in 3 different passes (or folds). In each pass, all data sets except for one will be used to train. The one not used for training will be used for testing.
