DATA_MINING_CONCEPTS_AND_TECHNIQUES_3RD.pdf

DATA_MINING_CONCEPTS_AND_TECHNIQUES_3RD.pdf - Jiawei...

This preview shows page 1 out of 560 pages.

Unformatted text preview: Jiawei Han [DATA MINING: CONCEPTS AND TECHNIQUES 3RD EDITION] 1 Jiawei Han [DATA MINING: CONCEPTS AND TECHNIQUES 3RD EDITION] Data Mining: Concepts and Techniques Third Edition Jiawei Han University of Illinois at Urbana–Champaign Micheline Kamber Jian Pei Simon Fraser University Morgan Kaufmann is an imprint of Elsevier 2 Jiawei Han [DATA MINING: CONCEPTS AND TECHNIQUES 3RD EDITION] Table of Contents 1.Introduction .............................................................................................................................................. 12 1.1. Why Data Mining? ........................................................................................................................... 12 1.1.1. Moving toward the Information Age......................................................................................... 12 1.1.2. Data Mining as the Evolution of Information Technology ....................................................... 13 1.2. What Is Data Mining? ...................................................................................................................... 16 1.3. What Kinds of Data Can Be Mined? ................................................................................................ 18 1.3.1. Database Data ............................................................................................................................ 18 1.3.2. Data Warehouses ....................................................................................................................... 19 1.3.3. Transactional Data..................................................................................................................... 22 1.3.4. Other Kinds of Data .................................................................................................................. 22 1.4. What Kinds of Patterns Can Be Mined? .......................................................................................... 23 1.4.1. Class/Concept Description: Characterization and Discrimination ............................................ 24 1.4.2. Mining Frequent Patterns, Associations, and Correlations ....................................................... 25 1.4.3. Classification and Regression for Predictive Analysis.............................................................. 26 1.4.4. Cluster Analysis ........................................................................................................................ 28 1.4.5. Outlier Analysis......................................................................................................................... 28 1.4.6. Are All Patterns Interesting? ..................................................................................................... 29 1.5. Which Technologies Are Used? ....................................................................................................... 30 1.5.1. Statistics .................................................................................................................................... 31 1.5.2. Machine Learning ..................................................................................................................... 32 1.5.3. Database Systems and Data Warehouses .................................................................................. 33 1.5.4. Information Retrieval ................................................................................................................ 33 1.6. Which Kinds of Applications Are Targeted? ................................................................................... 34 1.6.1. Business Intelligence ................................................................................................................. 34 1.6.2. Web Search Engines.................................................................................................................. 35 1.7. Major Issues in Data Mining ............................................................................................................ 36 1.7.1. Mining Methodology................................................................................................................. 36 1.7.2. User Interaction ......................................................................................................................... 37 1.7.3. Efficiency and Scalability ......................................................................................................... 38 1.7.4. Diversity of Database Types ..................................................................................................... 38 1.7.5. Data Mining and Society ........................................................................................................... 39 1.8. Summary .......................................................................................................................................... 39 2. Getting to Know Your Data .................................................................................................................... 41 2.1. Data Objects and Attribute Types .................................................................................................... 42 2.1.1. What Is an Attribute? ................................................................................................................ 42 2.1.2. Nominal Attributes .................................................................................................................... 43 2.1.3. Binary Attributes ....................................................................................................................... 43 3 Jiawei Han [DATA MINING: CONCEPTS AND TECHNIQUES 3RD EDITION] 2.1.4. Ordinal Attributes ...................................................................................................................... 44 2.1.5. Numeric Attributes .................................................................................................................... 44 2.1.6. Discrete versus Continuous Attributes ...................................................................................... 45 2.2. Basic Statistical Descriptions of Data .............................................................................................. 46 2.2.1. Measuring the Central Tendency: Mean, Median, and Mode ................................................... 46 2.2.2. Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and Interquartile Range .............................................................................................................................. 49 2.2.3. Graphic Displays of Basic Statistical Descriptions of Data ...................................................... 52 2.3. Data Visualization ............................................................................................................................ 57 2.3.1. Pixel-Oriented Visualization Techniques .................................................................................. 57 2.3.2. Geometric Projection Visualization Techniques ....................................................................... 59 2.3.3. Icon-Based Visualization Techniques ....................................................................................... 61 2.3.4. Hierarchical Visualization Techniques ..................................................................................... 62 2.3.5. Visualizing Complex Data and Relations ................................................................................. 63 2.4. Measuring Data Similarity and Dissimilarity ................................................................................... 64 2.4.1. Data Matrix versus Dissimilarity Matrix................................................................................... 65 2.4.2. Proximity Measures for Nominal Attributes ............................................................................. 66 2.4.3. Proximity Measures for Binary Attributes ................................................................................ 68 Table 2.4Relational TableWhere Patients Are Described by Binary Attributes ................................. 69 2.4.4. Dissimilarity of Numeric Data: Minkowski Distance ............................................................... 69 2.4.5. Proximity Measures for Ordinal Attributes ............................................................................... 72 2.4.6. Dissimilarity for Attributes of Mixed Types ............................................................................. 73 2.4.7. Cosine Similarity ....................................................................................................................... 74 2.5. Summary .......................................................................................................................................... 76 3. Data Preprocessing .................................................................................................................................. 78 3.1. Data Preprocessing: An Overview ................................................................................................... 78 3.1.1. Data Quality: Why Preprocess the Data? .................................................................................. 79 3.1.2. Major Tasks in Data Preprocessing ........................................................................................... 80 3.2. Data Cleaning ................................................................................................................................... 82 3.2.1. Missing Values .......................................................................................................................... 82 3.2.2. Noisy Data ................................................................................................................................. 83 3.2.3. Data Cleaning as a Process........................................................................................................ 85 3.3. Data Integration ................................................................................................................................ 87 3.3.1. Entity Identification Problem .................................................................................................... 88 3.3.2. Redundancy and Correlation Analysis ...................................................................................... 88 3.3.3. Tuple Duplication...................................................................................................................... 92 3.3.4. Data Value Conflict Detection and Resolution ......................................................................... 92 3.4. Data Reduction ................................................................................................................................. 93 3.4.1. Overview of Data Reduction Strategies .................................................................................... 93 4 Jiawei Han [DATA MINING: CONCEPTS AND TECHNIQUES 3RD EDITION] 3.4.2. Wavelet Transforms .................................................................................................................. 93 3.4.3. Principal Components Analysis ................................................................................................ 95 3.4.4. Attribute Subset Selection ......................................................................................................... 96 3.4.5. Regression and Log-Linear Models: Parametric Data Reduction ............................................. 98 3.4.6. Histograms ................................................................................................................................ 99 3.4.7. Clustering ................................................................................................................................ 100 3.5. Data Transformation and Data Discretization ................................................................................ 103 3.5.1. Data Transformation Strategies Overview .............................................................................. 103 3.5.2. Data Transformation by Normalization .................................................................................. 105 3.5.3. Discretization by Binning........................................................................................................ 106 3.5.4. Discretization by Histogram Analysis..................................................................................... 107 3.5.5. Discretization by Cluster, Decision Tree, and Correlation Analyses ......................................107 3.5.6. Concept Hierarchy Generation for Nominal Data ................................................................... 108 3.6. Summary ....................................................................................................................................... 110 4. Data Warehousing and Online Analytical Processing ...................................................................... 112 4.1. Data Warehouse: Basic Concepts................................................................................................... 113 4.1.1. What Is a Data Warehouse? .................................................................................................... 113 4.1.2. Differences between Operational Database Systems and Data Warehouses ..........................115 4.1.3. But, Why Have a Separate Data Warehouse? ......................................................................... 116 4.1.4. Data Warehousing: A Multitiered Architecture ...................................................................... 117 4.1.5. Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse ...........118 4.1.7. Metadata Repository ............................................................................................................... 120 4.2. Data Warehouse Modeling: Data Cube and OLAP........................................................................ 121 4.2.1. Data Cube: A Multidimensional Data Model.......................................................................... 121 4.2.2. Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models ....124 4.2.3. Dimensions: The Role of Concept Hierarchies ....................................................................... 127 4.2.4. Measures: Their Categorization and Computation .................................................................. 129 4.2.5. Typical OLAP Operations ....................................................................................................... 130 4.2.6. A Starnet Query Model for Querying Multidimensional Databases .......................................132 4.3. Data Warehouse Design and Usage ............................................................................................... 133 4.3.1. A Business Analysis Framework for Data Warehouse Design ............................................... 133 4.3.2. Data Warehouse Design Process ............................................................................................. 134 4.3.3. Data Warehouse Usage for Information Processing ............................................................... 135 4.4. Data Warehouse Implementation ................................................................................................... 138 4.4.1. Efficient Data Cube Computation: An Overview ................................................................... 139 4.4.2. Indexing OLAP Data: Bitmap Index and Join Index .............................................................. 142 4.4.3. Efficient Processing of OLAP Queries ................................................................................... 145 4.4.4. OLAP Server Architectures: ROLAP versus MOLAP versus HOLAP ..................................146 5 Jiawei Han [DATA MINING: CONCEPTS AND TECHNIQUES 3RD EDITION] 4.5. Data Generalization by Attribute-Oriented Induction .................................................................... 147 4.5.1. Attribute-Oriented Induction for Data Characterization ......................................................... 148 4.5.2. Efficient Implementation of Attribute-Oriented Induction ..................................................... 154 4.5.3. Attribute-Oriented Induction for Class Comparisons ............................................................. 156 4.6. Summary ........................................................................................................................................ 160 5. Data Cube Technology .......................................................................................................................... 162 5.1. Data Cube Computation: Preliminary Concepts ............................................................................ 163 5.1.1. Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Cube Shell .....................163 5.1.2. General Strategies for Data Cube Computation ...................................................................... 167 5.2. Data Cube Computation Methods .................................................................................................. 169 5.2.1. Multiway Array Aggregation for Full Cube Computation ...................................................... 169 5.2.2. BUC: Computing Iceberg Cubes from the Apex Cuboid Downward .....................................173 5.2.3. Star-Cubing: Computing Iceberg Cubes Using a Dynamic Star-Tree Structure .....................177 5.2.4. Precomputing Shell Fragments for Fast High-Dimensional OLAP ........................................182 5.3. Processing Advanced Kinds of Queries by Exploring Cube Technology ......................................189 5.3.1. Sampling Cubes: OLAP-Based Mining on Sampling Data .................................................... 190 5.3.2. Ranking Cubes: Efficient Computation of Top-k Queries ...................................................... 195 5.4. Multidimensional Data Analysis in Cube Space ............................................................................ 198 5.4.1. Prediction Cubes: Prediction Mining in Cube Space .............................................................. 198 5.4.2. Multifeature Cubes: Complex Aggregation at Multiple Granularities ....................................200 5.4.3. Exception-Based, Discovery-Driven Cube Space Exploration ............................................... 201 5.5. Summary ........................................................................................................................................ 204 6. Mining Frequent Patterns, Associations, and Correlations ................................................................... 206 6.1. Basic Concepts ............................................................................................................................... 207 6.1.1. Market Basket Analysis: A Motivating Example.................................................................... 207 6.1.2. Frequent Itemsets, Closed Itemsets, and Association Rules ................................................... 208 6.2. Frequent Itemset Mining Methods ................................................................................................. 211 6.2.1. Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation ...............211 6.2.2.Generating Association Rules from Frequent Itemsets ............................................................ 215 6.2.3. Improving the Efficiency of Apriori ....................................................................................... 216 6.2.4.A Pattern-Growth Approach for Mining Frequent Itemsets..................................................... 218 6.2.5. Mining F...
View Full Document

{[ snackBarMessage ]}

Get FREE access by uploading your study materials

Upload your study materials now and get free access to over 25 million documents.

Upload now for FREE access Or pay now for instant access
Christopher Reinemann
"Before using Course Hero my grade was at 78%. By the end of the semester my grade was at 90%. I could not have done it without all the class material I found."
— Christopher R., University of Rhode Island '15, Course Hero Intern

Ask a question for free

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern