*This preview shows
pages
1–8. Sign up
to
view the full content.*

This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*This
** preview**
has intentionally

**sections.**

*blurred***to view the full version.**

*Sign up*
**Unformatted text preview: **DATA MINING
Susan Holmes © Stats202 Lecture 30 Fall 2010 ABabcdfghiejkl
. . . . . . Special Announcements Hw6 has been returned to all those who put a valid
sunet id in their hw ﬁle names.
Midterm: Last day to pick up your midterms today in
class
We are going to do a review now of the material done up
to last Wednesday.
Everything we studied will be possibly on the ﬁnal.
Solutions to Homework 6, as some questions will be of
exactly the same type. . . . . . . BUZZWORDS
Bayesian/ Frequentists.
Parametric/ Nonparametric.
Robust / Breakdown Point.
Supervised/ Unsupervised.
Exploratory/ Conﬁrmatory.
Continuous/ Categorical.
Density/ Distribution.
Dependent/ Independent.
Eigenvalues/ Singular Values.
Populations/ Samples/ Mixtures. . . . . . . What is Data? Data Types: Examples
Continuous.
Discrete/Nominal −→ factors (few levels).
Discrete/Ordinal (rankings).
Nominal/ Identiﬁers (how are they different from
factors?).
Time series, maps.
Graphs.
How are they represented in R? . . . . . . Table of Methods for Supervised Methods
Techniques
Multiple Regression
Discriminant Analysis
Analysis of Variance
Multidimensional Analysis
of Variance (MANOVA)
Multidimensional Analysis
of Covariance
Regression Tree
Classiﬁcation Tree
Correspondence Analaysis
Ensemble Methods
Support Vector Machines Vars. to explain
(response)
1 continuous
1 categorical
1 continuous
p continuous Explanatory Var.
p
p
p
p p continuous
1
1
1
1
1 continuous
continuous
categorical
categorical p1 categorical
p2 continuous
p1 continuous, p2 categoric
p1 continuous, p2 categoric
1 categorical
p1 continuous, p2 categoric
p1 continuous continuous
categorical
categorical
categorical
categorical . . . . . . Table of Unsupervised Methods Techniques
Principal Components
Multiple Correspondence Analysis
Multidimensionnal Scaling (PCoA)
Double PCoA
Clustering (either hierarchical or not)
Association Rules p cat
ca c
categorical and continuous . . . . . . Using Distances for: Binary Data (asymetric and symetric).
Discrete/Nominal Distances/ Contingency Tables/ CA.
Discrete/Ordinal (rankings) Distance/PCA. . . . . . . Distances, Similarities, Dissimilarities
Distances:
Euclidean
Chisquare
Chisquare(exp, obs) = ∑ (expj − obsj )2
expj j Hamming/L1
Mahalanobis.
Similarity Indices:
Confusion (cognitive psychology) is a Similiarity.
Matching coefﬁcient is a Similarity
nb of matching attrs
f11 + f00
=
nb of attrs
f11 + f00 + f10 + f01
. . . . . . MDS ALgorithm In summary, given an n × n matrix of interpoint distances D,
one can solve for points achieving these distances by:
1. Double centering the interpoint distance squared matrix:
B = − 1 HD2 H.
2 2. Diagonalizing B: B = UΛUT .
˜˜
3. Extracting X: X = UΛ1/2 . . . . . . . Decision Trees and Classiﬁcation Examples
Two sets of Data: Training and Test.
Response Y is a nominal/categorical variable.
Explanatory variables can be continuous AND nominal
AND ordinal.
Indices of Purity: Gini, Entropy (Deviance) and
Misclassiﬁcation.
Output of rpart and tree is a PRUNED tree. . . . . . . Impurity ∑
1. Deviance: deviance Dt = −2 k ntk log ptk
∑
2. Entropy: entropy Dt = −2 k ptk log2 ptk
∑
3. Gini index: gini index Dt = 1 − k p2
tk
4. Misclassiﬁcation error: misclassiﬁcation where k(t) is the
category at node t with the largest number of
observations. . . . . . . Cross-validation
To determine when to stop pruning a large tree we use
cross-validation.
For each best tree of a ﬁxed size, or alternatively for a
distinct value of cp, we carry out a k-fold cross-validation.
To do this we randomly divide the data X into k subsets such
that
X1 ∪ X 2 ∪ X 3 . . . ∪ Xr
We leave out each one of the k subsets Xj in turn and ﬁt the
tree to a data set consisting of the remaining k − 1 subsets
combined. The data that were left out are then used to
calculate the impurity of the tree. This process is repeated k
times for each subset and the relative average
cross-validation error (xerror) is obtained as follows.
1∑
k CC
k
xerror =
CC(null tree)
. . . . . . Alternative Classiﬁcation Methods Rule Based.
Instance Based Methods and Nearest Neighbors.
Discriminant Analysis: for continuous explanatory
variables only.
SVM . . . . . . Discriminant Functions for LDA
A discriminant function, is a linear combination of
discriminating variables, such that
L = b1 x1 + b2 x2 + ... + bp xp + c, where the b's are
discriminant coefﬁcients, the x's are discriminating variables,
and c is a constant.
The ﬁrst discriminant function (or variable or axis) is the
linear combination of the original variables that maximizes:
a′ Ba/a′ Ta
This is equivalent to maximizing the quadratic form a′ Ba
under a constraint a′ Ta = 1. This is also equivalent to ﬁnding
the eigenvectors of BW−1 . . . . . . SVM Let X be a classiﬁcation dataset with n points in a
p-dimensional space X = {(xi , yi )}, with i = 1, 2, . . . , n yi is
either +1 or -1.
A hyperplane h(x) gives a linear discriminant function in d
dimensions and splits the original space into two half-spaces.
where w is a p-dimensional weight vector and b is a scalar
bias.
Points on the hyperplane have h(x) = 0, i.e. the hyperplane is
deﬁned by all points for which wTx = -b. . . . . . . SVM
Given a separating hyperplane h(x) = 0, it is possible to
calculate the distance between each point xi and the
hyperplane by:
y h(xi )
δi = i
||w||
The margin of the linear classiﬁer is deﬁned as the minimum
distance of all n points to the separating hyperplane. δ ∗ = minxi yi h(xi )
||w|| All points (vectors x∗i ) that achieve this minimum distance
are called the support vectors for the linear classiﬁer. In
other words, a support vector is a point that lies precisely on
the margin of the classifying hyperplane.
. . . . . . Parameters in SVM Kernel Parameters
Cost Parameters: C that controls the trade off between
allowing training errors and forcing rigid margins.
How to
choose them: tune.svm provides a grid. More information, see: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classific . . . . . . Ensemble Methods Iterative/Replicative methods combining `weaker learners' in
different ways.
Bagging: Bootstrap Aggregation.
Boosting : combining weak learners.
Random Forests. . . . . . . Bootstrapping 1
P(x1 is in the bootstrap resample) = 1 − (1 − )n
n
1
1
1
1
(1− )n = exp(nlog(1− )) ∼ expn(− ) = exp(−1) = = 0.36788
n
n
n
e
OOB=out of the bag (not included in the Bootstrap resample).
OOB prediction is determined by a majority rule vote of all
trees whose training set did not contain that observation. . . . . . . About Boosting Weights
Given: (x1 , y1 ), . . . , (xn , yn ), xi ∈ X, yi ∈ Y = {−1, +1}
Initialize D1 (i) = 1 , i = 1, . . . , n For t = 1...T Find the
n
classiﬁer ht X → {−1, +1} that minimizes the error with
respect to the distribution Dt
ht = argmin ϵj where ϵj =
hj ∈H m
∑ Dt (i)[yi ̸= hj (xi )] i=1 ϵt ≥ 0.5 then stop.
Choose αt ∈ R, typically αt = 1 ln 1−tϵt where ϵt is the
2
ϵ
weighted error rate of classiﬁer ht .
Dt+1 (i) =
factor. Dt (i) exp(−αt yi ht (xi ))
Zt where Zt is a normalization Output the ﬁnal classiﬁer: H(x) = sign
. (∑ T
t=1 αt ht (x) . . . ) . . The equation to update the distribution Dt is constructed so
that {
< 0 y(i) = ht (xi )
−αt yi ht (xi )
> 0 y(i) ̸= ht (xi )
. . . . . . Clustering Supervised or Unsupervised?
Methods:
k-means
k-medoids
hierarchical clustering
DBSCAN.
spectral clustering.
EM-model based clustering. . . . . . . Clustering Dichotomy Clustering, partitionnal and hierarchical.
Gap statistic, simulation from a model (sometimes called
parametric Bootstrap).
Hierarchical Clustering, different shapes of trees.
Heatmap (biclustering and image plots). . . . . . . k-medoids algorithm 1. For a given cluster assignment C nd the observation in
the cluster minimizing total distance to other points in
that cluster.
2. Given a current set of cluster centers {m1 , . . . , mK },
minimize the total error by assigning each observation to
the closest (current) cluster center:
C(i) = argmin1≤k≤K D(xi , mk ).
3. Iterate steps 1 and 2 until the assignments do not
change. . . . . . . Density Based Clustering (DBSCAN) Density-based clustering methods for groups of data that
differ in size or shape,
(arbitrary shapes are possible) and the number of clusters is
not known a priori.
Clustering `in context'.
Clustering based on density (local cluster criterion), such as
density-connected points or based on an explicitly
constructed density function . . . . . . Jargon and Parameters
ϵ radius of a neighborhood sphere.
MinPts Threshold for number of points to have a cluster.
Density measured as number of points within a speciﬁed
radius (ϵ).
core point : if it has more than a speciﬁed number of
points (MinPts) within ϵ.
border point : has fewer than MinPts within Eps, but is in
the neighborhood of a core point.
noise point is any point that is not a core point or a border
point. ϵ-neighborhood is a sphere of radius ϵ. . . . . . . Cluster based on density reachability
a point q is directly density-reachable from a point p if it is
not farther away than a given distance ϵ. (i.e., is
part of its ϵ-neighborhood), and if p is
surrounded by sufﬁciently many points such that
one may consider p and q be part of a cluster.
q is called density-reachable from p if there is a sequence
p1 , . . . , pn of points with and p1 = p and pn = q
where each pi+1 is directly density-reachable
from pi .
The relation of density-reachable is not symmetric (since q
might lie on the edge of a cluster, having
insufﬁciently many neighbors to count as a
genuine cluster element).
density-connected two points p and q are density-connected
if there is a point o such that o and p as well as
o and q are density-reachable.
. . . . . . A cluster, which is a subset of the points of the database,
satisﬁes two properties:
1. All points within the cluster are mutually
density-connected.
2. If a point is density-connected to any point of the cluster,
it is part of the cluster as well. . . . . . . Density reachability . . . . . . Algorithm
DBScan use the two parameters:
ϵ (eps)-radius
the minimum number of points required to form a cluster
(minPts).
Arbitrary starting point that has not been visited.
This point's eps-neighborhood is retrieved,
and if it contains sufﬁciently many points, a cluster is started.
Otherwise, the point is labeled as noise. Note that this point
might later be found in a sufﬁciently sized eps-environment
of a different point and hence be made part of a cluster.
If a point is found to be part of a cluster, its
eps-neighborhood is also part of that cluster. Hence, all
points that are found within the eps-neighborhood are
added, as is their own eps-neighborhood. This process
continues until the cluster is completely found. Then, a new
unvisited point is retrieved and processed, leading to the
discovery of a further cluster or noise.
. . . . . . Correlation between Data and Clusters Similarity built from distances
sij = 1 − (dij − mind )
(maxd − mind ) Graph Incidence Matrix from cluster identity:
Mi,j = 1 if xi and xj belong to the same cluster
Mi,j = 0 otherwise. We can compute cor(M, S) . . . . . . Internal Measures: SSE Total sum of squares is constant T.
WSS = ∑∑ d(xi , mk )2 k x i ∈C k . . . . . . Clustering Silhouette Assume the data have been clustered. For each datum, xi let
a(i) be the average dissimilarity of xi with all other data
within the same cluster.
1∑2
a(i) =
d (xi , mk )
nk
xi ∈Ck We can interpret a(i) as how well matched xi is to the cluster
it is assigned (the smaller the value, the better the matching).
Then ﬁnd the average dissimilarity of xi with the data of
another single cluster. Repeat this for every cluster of which
xi is not a member. Denote the lowest average dissimilarity
to xi of any such cluster by b(i). The cluster with this
average dissimilarity is said to be the "neighbouring cluster"
of xi as it is, aside from the cluster xi is assigned, the cluster
in which xi ﬁts best.
s ( i) = b(i) − a(i)
max{a(i), b(i)}
. . . . . . Gap Statistic
To the extent this scenario is realized, there will be a sharp
decrease in successive di?erences in criterion value,
WK − WK+1 , at K = K∗ . That is, {WK − WK+1 |K < K∗ } > {WK − WK+1 |K ≥ K∗ }
>
ˆ
An estimate K∗ for K∗ is then obtained by identifying a 'kink'
in the plot of WK as a function of K.
Heuristic: Gap statistic (Tibshirani,Walther,Hastie, 2001)
compares the curve log WK to the curve obtained from data
uniformly distributed over a rectangle containing the data.
It estimates the optimal number of clusters to be the place
where the gap between the two curves is largest.
(Essentially this is an automatic way of locating the 'kink'). . . . . . . Probabilistic Clustering:EM algorithm
Probability that instance x belongs to cluster CA :
P(CA |x) = f(x; µA , σA )pA
P(x|CA )P(CA )
=
P (x )
P(x) Most common form: (x − µ)2
1
exp(−
f(x; µ, σ ) = √
)
2σ 2
2πσ
Probability of an instance given the clusters:
∑
P(x|the_clusters) =
P(x|clusterk )P(clusterk )
k . . . . . . Probabilistic Clustering: extension of k-means Assume: We know there are k clusters
Learn the clusters
Determine their parameters( means and standard deviations)
Performance criterion:
Probability of training data given the clusters
EM algorithm: ﬁnds a local maximum of the likelihood . . . . . . EM = Expectation-Maximization Generalize k?means to probabilistic setting
Iterative procedure:
E `expectation' step: Calculate cluster probability for
each instance
M `maximization' step: Estimate distribution parameters
from cluster probabilities (this is maximizing the
likelihood)
Store cluster probabilities as instance weights
Stop when improvement is negligible . . . . . . ...

View
Full
Document