# Assignment 3 - Data Cleaning and Data Mining.docx (1).docx...

• 33
• 75% (4) 3 out of 4 people found this document helpful

This preview shows page 1 - 3 out of 33 pages.

Assignment: Data Cleaning and Data Mining Group Member Names: Assignment Prerequisite This assignment requires the installation of the Excel Analytics Solver plugin. Data Mining Introduction The overall goal with data mining is to discover patterns from a historical set of data. With the discovery of patterns, businesses are able to create predictions and classifications to increase the efficiency and effectiveness of their business processes. For example, Amazon has mined shoppers’ browsing and purchasing history to predict future purchases and generate the “Items to Consider” section for adaptable low-cost marketing. This course focuses on two fundamental aspects of data mining: Prediction : A pattern in a historical set of data with non-binary (yes/no) outcome (i.e. not a yes/no outcome). For example, from historical video watching data (genre, tags, content creator, director, actor, etc.), Netflix predicts a percentage match of what shows we may want to watch next. 1
Classification : A pattern in a historical set of data with a binary (yes/no) outcome. For example, from historical mortgage approval data (income, assets, debts, location, etc.) banks are able to offer online mortgage pre- approvals based on user’s input (yes/no: approved/not approved). This is usually determined from a percentage with a threshold- those predictions >50% will be yes, those less than 50% will be no. This threshold could be adjusted and should be based on the impact of incorrect classification. Classifying someone as low risk for defaulting on their loan when they are actually high risk is worse than the other way around. Data mining is an explorative exercise. Data scientists, generally specializing in mathematics and statistics, are given large sets of data to mine and experiment with different data models. It may take multiple attempts to find valuable correlation in subsets of data which may, or may not, benefit the business. In this assignment we will focus on learning how to use the mining tool: Analytics Solver and how to understand the basics of mining reports. The modeling and experimentation aspect of data mining will be a part of the final project. The Data Mining Process has four steps: 1. Cleaning the Data 2. Creating a Data Partition 3. Running a Prediction / Classification mining tool against the cleaned data 4. Determine if there is correlation between the parameters and the output using error analysis. In other words, was a pattern found between the inputs and the output? Cleaning Data Example Before feeding historical data into a data mining tool such as Analytics Solver, the data should be cleansed since data mining involves numerical statistics. In general, any words in a dataset is converted into numbers. For example, a column of answers containing “Yes” and “No” is converted into 1 and 0. Another example is a column containing a list of movie genres: Action, Romance, and Thriller is converted to 1, 2 and 3.