Lect30-FinalReview

Lect30-FinalReview - DATA MINING Susan Holmes © Stats202...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: DATA MINING Susan Holmes © Stats202 Lecture 30 Fall 2010 ABabcdfghiejkl . . . . . . Special Announcements Hw6 has been returned to all those who put a valid sunet id in their hw file names. Midterm: Last day to pick up your midterms today in class We are going to do a review now of the material done up to last Wednesday. Everything we studied will be possibly on the final. Solutions to Homework 6, as some questions will be of exactly the same type. . . . . . . BUZZWORDS Bayesian/ Frequentists. Parametric/ Nonparametric. Robust / Breakdown Point. Supervised/ Unsupervised. Exploratory/ Confirmatory. Continuous/ Categorical. Density/ Distribution. Dependent/ Independent. Eigenvalues/ Singular Values. Populations/ Samples/ Mixtures. . . . . . . What is Data? Data Types: Examples Continuous. Discrete/Nominal −→ factors (few levels). Discrete/Ordinal (rankings). Nominal/ Identifiers (how are they different from factors?). Time series, maps. Graphs. How are they represented in R? . . . . . . Table of Methods for Supervised Methods Techniques Multiple Regression Discriminant Analysis Analysis of Variance Multidimensional Analysis of Variance (MANOVA) Multidimensional Analysis of Covariance Regression Tree Classification Tree Correspondence Analaysis Ensemble Methods Support Vector Machines Vars. to explain (response) 1 continuous 1 categorical 1 continuous p continuous Explanatory Var. p p p p p continuous 1 1 1 1 1 continuous continuous categorical categorical p1 categorical p2 continuous p1 continuous, p2 categoric p1 continuous, p2 categoric 1 categorical p1 continuous, p2 categoric p1 continuous continuous categorical categorical categorical categorical . . . . . . Table of Unsupervised Methods Techniques Principal Components Multiple Correspondence Analysis Multidimensionnal Scaling (PCoA) Double PCoA Clustering (either hierarchical or not) Association Rules p cat ca c categorical and continuous . . . . . . Using Distances for: Binary Data (asymetric and symetric). Discrete/Nominal Distances/ Contingency Tables/ CA. Discrete/Ordinal (rankings) Distance/PCA. . . . . . . Distances, Similarities, Dissimilarities Distances: Euclidean Chisquare Chisquare(exp, obs) = ∑ (expj − obsj )2 expj j Hamming/L1 Mahalanobis. Similarity Indices: Confusion (cognitive psychology) is a Similiarity. Matching coefficient is a Similarity nb of matching attrs f11 + f00 = nb of attrs f11 + f00 + f10 + f01 . . . . . . MDS ALgorithm In summary, given an n × n matrix of interpoint distances D, one can solve for points achieving these distances by: 1. Double centering the interpoint distance squared matrix: B = − 1 HD2 H. 2 2. Diagonalizing B: B = UΛUT . ˜˜ 3. Extracting X: X = UΛ1/2 . . . . . . . Decision Trees and Classification Examples Two sets of Data: Training and Test. Response Y is a nominal/categorical variable. Explanatory variables can be continuous AND nominal AND ordinal. Indices of Purity: Gini, Entropy (Deviance) and Misclassification. Output of rpart and tree is a PRUNED tree. . . . . . . Impurity ∑ 1. Deviance: deviance Dt = −2 k ntk log ptk ∑ 2. Entropy: entropy Dt = −2 k ptk log2 ptk ∑ 3. Gini index: gini index Dt = 1 − k p2 tk 4. Misclassification error: misclassification where k(t) is the category at node t with the largest number of observations. . . . . . . Cross-validation To determine when to stop pruning a large tree we use cross-validation. For each best tree of a fixed size, or alternatively for a distinct value of cp, we carry out a k-fold cross-validation. To do this we randomly divide the data X into k subsets such that X1 ∪ X 2 ∪ X 3 . . . ∪ Xr We leave out each one of the k subsets Xj in turn and fit the tree to a data set consisting of the remaining k − 1 subsets combined. The data that were left out are then used to calculate the impurity of the tree. This process is repeated k times for each subset and the relative average cross-validation error (xerror) is obtained as follows. 1∑ k CC k xerror = CC(null tree) . . . . . . Alternative Classification Methods Rule Based. Instance Based Methods and Nearest Neighbors. Discriminant Analysis: for continuous explanatory variables only. SVM . . . . . . Discriminant Functions for LDA A discriminant function, is a linear combination of discriminating variables, such that L = b1 x1 + b2 x2 + ... + bp xp + c, where the b's are discriminant coefficients, the x's are discriminating variables, and c is a constant. The first discriminant function (or variable or axis) is the linear combination of the original variables that maximizes: a′ Ba/a′ Ta This is equivalent to maximizing the quadratic form a′ Ba under a constraint a′ Ta = 1. This is also equivalent to finding the eigenvectors of BW−1 . . . . . . SVM Let X be a classification dataset with n points in a p-dimensional space X = {(xi , yi )}, with i = 1, 2, . . . , n yi is either +1 or -1. A hyperplane h(x) gives a linear discriminant function in d dimensions and splits the original space into two half-spaces. where w is a p-dimensional weight vector and b is a scalar bias. Points on the hyperplane have h(x) = 0, i.e. the hyperplane is defined by all points for which wTx = -b. . . . . . . SVM Given a separating hyperplane h(x) = 0, it is possible to calculate the distance between each point xi and the hyperplane by: y h(xi ) δi = i ||w|| The margin of the linear classifier is defined as the minimum distance of all n points to the separating hyperplane. δ ∗ = minxi yi h(xi ) ||w|| All points (vectors x∗i ) that achieve this minimum distance are called the support vectors for the linear classifier. In other words, a support vector is a point that lies precisely on the margin of the classifying hyperplane. . . . . . . Parameters in SVM Kernel Parameters Cost Parameters: C that controls the trade off between allowing training errors and forcing rigid margins. How to choose them: tune.svm provides a grid. More information, see: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classific . . . . . . Ensemble Methods Iterative/Replicative methods combining `weaker learners' in different ways. Bagging: Bootstrap Aggregation. Boosting : combining weak learners. Random Forests. . . . . . . Bootstrapping 1 P(x1 is in the bootstrap resample) = 1 − (1 − )n n 1 1 1 1 (1− )n = exp(nlog(1− )) ∼ expn(− ) = exp(−1) = = 0.36788 n n n e OOB=out of the bag (not included in the Bootstrap resample). OOB prediction is determined by a majority rule vote of all trees whose training set did not contain that observation. . . . . . . About Boosting Weights Given: (x1 , y1 ), . . . , (xn , yn ), xi ∈ X, yi ∈ Y = {−1, +1} Initialize D1 (i) = 1 , i = 1, . . . , n For t = 1...T Find the n classifier ht X → {−1, +1} that minimizes the error with respect to the distribution Dt ht = argmin ϵj where ϵj = hj ∈H m ∑ Dt (i)[yi ̸= hj (xi )] i=1 ϵt ≥ 0.5 then stop. Choose αt ∈ R, typically αt = 1 ln 1−tϵt where ϵt is the 2 ϵ weighted error rate of classifier ht . Dt+1 (i) = factor. Dt (i) exp(−αt yi ht (xi )) Zt where Zt is a normalization Output the final classifier: H(x) = sign . (∑ T t=1 αt ht (x) . . . ) . . The equation to update the distribution Dt is constructed so that { < 0 y(i) = ht (xi ) −αt yi ht (xi ) > 0 y(i) ̸= ht (xi ) . . . . . . Clustering Supervised or Unsupervised? Methods: k-means k-medoids hierarchical clustering DBSCAN. spectral clustering. EM-model based clustering. . . . . . . Clustering Dichotomy Clustering, partitionnal and hierarchical. Gap statistic, simulation from a model (sometimes called parametric Bootstrap). Hierarchical Clustering, different shapes of trees. Heatmap (biclustering and image plots). . . . . . . k-medoids algorithm 1. For a given cluster assignment C ￿nd the observation in the cluster minimizing total distance to other points in that cluster. 2. Given a current set of cluster centers {m1 , . . . , mK }, minimize the total error by assigning each observation to the closest (current) cluster center: C(i) = argmin1≤k≤K D(xi , mk ). 3. Iterate steps 1 and 2 until the assignments do not change. . . . . . . Density Based Clustering (DBSCAN) Density-based clustering methods for groups of data that differ in size or shape, (arbitrary shapes are possible) and the number of clusters is not known a priori. Clustering `in context'. Clustering based on density (local cluster criterion), such as density-connected points or based on an explicitly constructed density function . . . . . . Jargon and Parameters ϵ radius of a neighborhood sphere. MinPts Threshold for number of points to have a cluster. Density measured as number of points within a specified radius (ϵ). core point : if it has more than a specified number of points (MinPts) within ϵ. border point : has fewer than MinPts within Eps, but is in the neighborhood of a core point. noise point is any point that is not a core point or a border point. ϵ-neighborhood is a sphere of radius ϵ. . . . . . . Cluster based on density reachability a point q is directly density-reachable from a point p if it is not farther away than a given distance ϵ. (i.e., is part of its ϵ-neighborhood), and if p is surrounded by sufficiently many points such that one may consider p and q be part of a cluster. q is called density-reachable from p if there is a sequence p1 , . . . , pn of points with and p1 = p and pn = q where each pi+1 is directly density-reachable from pi . The relation of density-reachable is not symmetric (since q might lie on the edge of a cluster, having insufficiently many neighbors to count as a genuine cluster element). density-connected two points p and q are density-connected if there is a point o such that o and p as well as o and q are density-reachable. . . . . . . A cluster, which is a subset of the points of the database, satisfies two properties: 1. All points within the cluster are mutually density-connected. 2. If a point is density-connected to any point of the cluster, it is part of the cluster as well. . . . . . . Density reachability . . . . . . Algorithm DBScan use the two parameters: ϵ (eps)-radius the minimum number of points required to form a cluster (minPts). Arbitrary starting point that has not been visited. This point's eps-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized eps-environment of a different point and hence be made part of a cluster. If a point is found to be part of a cluster, its eps-neighborhood is also part of that cluster. Hence, all points that are found within the eps-neighborhood are added, as is their own eps-neighborhood. This process continues until the cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise. . . . . . . Correlation between Data and Clusters Similarity built from distances sij = 1 − (dij − mind ) (maxd − mind ) Graph Incidence Matrix from cluster identity: Mi,j = 1 if xi and xj belong to the same cluster Mi,j = 0 otherwise. We can compute cor(M, S) . . . . . . Internal Measures: SSE Total sum of squares is constant T. WSS = ∑∑ d(xi , mk )2 k x i ∈C k . . . . . . Clustering Silhouette Assume the data have been clustered. For each datum, xi let a(i) be the average dissimilarity of xi with all other data within the same cluster. 1∑2 a(i) = d (xi , mk ) nk xi ∈Ck We can interpret a(i) as how well matched xi is to the cluster it is assigned (the smaller the value, the better the matching). Then find the average dissimilarity of xi with the data of another single cluster. Repeat this for every cluster of which xi is not a member. Denote the lowest average dissimilarity to xi of any such cluster by b(i). The cluster with this average dissimilarity is said to be the "neighbouring cluster" of xi as it is, aside from the cluster xi is assigned, the cluster in which xi fits best. s ( i) = b(i) − a(i) max{a(i), b(i)} . . . . . . Gap Statistic To the extent this scenario is realized, there will be a sharp decrease in successive di?erences in criterion value, WK − WK+1 , at K = K∗ . That is, {WK − WK+1 |K < K∗ } > {WK − WK+1 |K ≥ K∗ } > ˆ An estimate K∗ for K∗ is then obtained by identifying a 'kink' in the plot of WK as a function of K. Heuristic: Gap statistic (Tibshirani,Walther,Hastie, 2001) compares the curve log WK to the curve obtained from data uniformly distributed over a rectangle containing the data. It estimates the optimal number of clusters to be the place where the gap between the two curves is largest. (Essentially this is an automatic way of locating the 'kink'). . . . . . . Probabilistic Clustering:EM algorithm Probability that instance x belongs to cluster CA : P(CA |x) = f(x; µA , σA )pA P(x|CA )P(CA ) = P (x ) P(x) Most common form: (x − µ)2 1 exp(− f(x; µ, σ ) = √ ) 2σ 2 2πσ Probability of an instance given the clusters: ∑ P(x|the_clusters) = P(x|clusterk )P(clusterk ) k . . . . . . Probabilistic Clustering: extension of k-means Assume: We know there are k clusters Learn the clusters Determine their parameters( means and standard deviations) Performance criterion: Probability of training data given the clusters EM algorithm: finds a local maximum of the likelihood . . . . . . EM = Expectation-Maximization Generalize k?means to probabilistic setting Iterative procedure: E `expectation' step: Calculate cluster probability for each instance M `maximization' step: Estimate distribution parameters from cluster probabilities (this is maximizing the likelihood) Store cluster probabilities as instance weights Stop when improvement is negligible . . . . . . ...
View Full Document

This note was uploaded on 07/29/2011 for the course STAT 202 at Stanford.

Page1 / 37

Lect30-FinalReview - DATA MINING Susan Holmes © Stats202...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online