Insurance Claims Settlement: find similar claims in the past
Real estate: Property price appraisal based on previous sales
K-Nearest Neighbor can be used for classification tasks.
Step 1: Using a chosen distance metric, compute the distance
between the n
The goal: ideal DM environment
All data mining algorithms want their input in tabular form rows &
columns as in a spreadsheet or database table
Contain data that describe aspects of the customer (e.g., sales
$ and quantity for each of product
GAs can be used to try different combinations of parameters for building
models, to see which build parameters are best.
Assume we arrived at ten formulae, using a variety of neural nets run
on different sample data and with different r
often work well for classes that are hard to separate using linear
methods or the axis-parallel splits used by decision trees.
K-Nearest neighbor works well even when there is some missing data
K-Nearest neighbor is good at specifying which predictions h
We would maintain a population of solutions, with each solutions
having a score. Solutions with a higher fitness score are given a
higher probability of mating.
The Traveling Sales Person is a well-known route optimization
problem with applications to tr
Based on the concept of similarity
The distance metric can be, for instance, simple Euclidian distance or
weighted Euclidian distance. The Euclidian distance d(X,Y) between
two points, X = (x1, x2, ., xn) and Y = (y1, y2, ., yn) is:
Regular database quer
Principal Component Analysis -1
Open the dataset Open the file Utilities.xls, which gives data on 22
public utilities in the US.
Principal Component Analysis -2
In XLMiner, select Data Reduction and Exploration -> Principal
In the Principal Co
CF is a variant of MBR particularly well suited to personalized
Starts with a history of peoples personal preferences
Uses a distance function people who like the same things are
Uses votes which are weighted by distances, so close
Survival Analysis aka Time-to-Event Analysis is very valuable for
Survival tells us when to start worrying about customers
Most important facet of customer behavior is the customers tenure
The analysis of time to event dat
Based on an analogy to biological processes similar to neural
networks and memory-based reasoning techniques
Goal is to maximize the fitness
Are being utilized for
Often used in tandem with other DM
After examining 20 cases of ownership, how many owners have been
correctly identified, based on the below lift chart?
Based on the table below, what is the probablity of classifying a small firm with
For the given neural network, if the cutoff value was 0.56, the output of this
network would be classified as 1?
What is the best value of k to select for the output below?
For categorical values, we need to convert them to numeric values.
We might treat being in class A as 1, and not in class A as 0.
Therefore, two items in the same class have distance 0 for that
attribute, and two items in different classes have distance
Principal Component Analysis
PCA is used to reduce the dimensionality of data.
Normalized data points are mapped onto a multi-dimensional graph.
Dimensions with the highest variances account for the greatest
effects on the data set.
Dimensions with the l
Fitness functions are used to decide how good a solution is. The
fitness function used depends on the application: e.g. for model
parameter selection we could use the accuracy of profitability of the
resultant model as our fitness function.
Steps in GAs:
MBR and K Nearest Neighbor are lazy approaches: they do not
construct a model in advance, but rather wait till they have to classify
a new example.
Contrast this with decision trees which construct meaningful
symbolic descriptions of classes from the tra
Memory-Based Reasoning (MBR) results are based on analogous
situations in the past
Collaborative Filtering results use preferences in addition to
analogous situations from the past
Decide how to represent cases: attribute selecti
Adaptation techniques include:
Predefined formulas or processes for adapting the solution
Interpolation: Assuming, you had two old bridges
Bridge 1 = 2 lanes, and cost $1 million
Bridge 2 = 4 lanes, and cost $2 million
Now, assuming you wanted a new 3 la
We can look up similar cases using a distance metric.
5 bins are made depending on the count of records and the values of
x3 are kept in them.
For all the values of x3 lying in one bin, the output variable,
Binned_x3 gets a value. (This value is the same
How many hidden layers are shown in the neural network below?
Recommender systems, like the system Amazon.com uses, are typically
based on which data mining technique