Introduction to Classifcation Data Mining Prof. Dawn Woodard School of ORIE Cornell University 1 Outline 1 Announcements 2 Big-O Notation 3 Heart Disease Detection 2 Announcements Don’t forget to sign up for Blackboard and get a departmental account 4 Big-O Notation The order of an algorithm, e.g. O ( N 2 P ) , controls the efFciency as N and P get very large. This is because asymptotically the highest-order term dominates If algorithm A has order N 2 P and algorithm B has order N 3 P then as N grows algorithm B becomes much slower than algorithm A ±or N large enough, B is arbitrarily slower than A. ±or instance, there is some N such that for all N > N , B is 10 times slower than A. There is some N ∗∗ such that for all N > N ∗∗ ,Bis100 times slower than A, etc. 6

Heart Disease Detection Can patients be effectively screened for the presence of heart disease (CAD) without the use of angiography? Angiography is an invasive and expensive procedure where a tube is inserted into the artery of concern. Rather than using angiography on all patients to detect CAD, it is better to use it on high-risk patients. 8 Heart Disease Detection 9 Heart Disease Detection The authors use data from the Cleveland Clinic. The data has non-invasive clinical test results as well as angiography results (CAD / no CAD) for 303 patients. They learn a classiFcation rule for predicting CAD based on the non-invasive test results. This classiFcation rule uses a model-based approach (logistic regression). They check the predictive accuracy of their classiFcation rule on data from patients in Hungary (74 % ) and California (77 % ).
