: A pattern in a historical set of data with a binary (yes/no) outcome.
For example, from historical
mortgage approval data (income, assets, debts, location, etc.) banks are able to offer online mortgage pre-
approvals based on user’s input (yes/no: approved/not approved). This is usually determined from a percentage
with a threshold- those predictions >50% will be yes, those less than 50% will be no.
This threshold could be
adjusted and should be based on the impact of incorrect classification.
Classifying someone as low risk for
defaulting on their loan when they are actually high risk is worse than the other way around.
Data mining is an explorative exercise.
Data scientists, generally specializing in mathematics and statistics, are
given large sets of data to mine and experiment with different data models.
It may take multiple attempts to find
valuable correlation in subsets of data which may, or may not, benefit the business.
In this assignment we will focus on learning how to use the mining tool: Analytics Solver and how to understand
the basics of mining reports.
The modeling and experimentation aspect of data mining will be a part of the final
The Data Mining Process has four steps:
Cleaning the Data
Creating a Data Partition
Running a Prediction / Classification mining tool against the cleaned data
Determine if there is correlation between the parameters and the output using error analysis.
In other words,
was a pattern found between the inputs and the output?
Cleaning Data Example
Before feeding historical data into a data mining tool such as Analytics Solver, the data should be cleansed since
data mining involves numerical statistics.
In general, any words in a dataset is converted into numbers.
example, a column of answers containing “Yes” and “No” is converted into 1 and 0.
Another example is a column
containing a list of movie genres: Action, Romance, and Thriller is converted to 1, 2 and 3.