ohdatamineDIS

OhdatamineDIS - DATA MINING Susan Holmes © Stats202 Lecture 14 Fall 2010 ABabcdfghiejkl Special Announcements All requests should be sent to

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: DATA MINING Susan Holmes © Stats202 Lecture 14 Fall 2010 ABabcdfghiejkl . . . . . . Special Announcements All requests should be sent to [email protected] Homework, the deadline is Tuesday 5.00pm, all hw not within the deadline is rejected (we have an automatic system). Please don't forget to add your sunet id to your hw file name (at the end). Midterm, you can bring a one page cheatsheet, no cellphones, no laptops. . . . . . . Last Time: Classification and Regression Trees Examples Explanatory variables can be continuous AND nominal AND ordinal. Indices of Purity: Gini, Entropy (Deviance) and Misclassification. Cross Validation Choice of cp tuning parameter. Today:Alternative Classification Methods Rule Based. Instance Based Methods and Nearest Neighbors. Discriminant Analysis: for continuous explanatory variables only. . . . . . . Discrimination for Continuous Explanatory Variables Discriminant functions are the essence of the output from a discriminant analysis. Discriminant functions are the linear combinations of the standardised independent variables which yield the biggest mean differences between the groups. If the response is a dichotomy(only two classes to be predicted) there is one discriminant function; if the reponse variable has k levels(ie there are k classes to predict), up to k-1 discriminant functions can be extracted, and we can test how many are worth extracting. . . . . . . Discriminant Functions Successive discriminant functions are orthogonal to one another, like principal components, but they are not the same as the principal components you would obtain if you just did a principal components analysis on the independent variables, because they are constructed to maximise the differences between the values of the response, not the total variance, but the variance between classes. The initial input data do not have to be centered or standardized before the analysis as is the case in principal components, the outcome of the final discriminant analysis will not be affected by the scaling. . . . . . . Discriminant Functions A discriminant function, also called a canonical root, is a latent variable which is created as a linear combination of discriminating (independent) variables, such that L = b1 x1 + b2 x2 + ... + bp xp + c, where the b's are discriminant coefficients, the x's are discriminating variables, and c is a constant. This is similar to multiple regression, but the b's are discriminant coefficients which maximize the distance between the means of the criterion (dependent) variable. Note that the foregoing assumes the discriminant function is estimated using ordinary least-squares, the traditional method, but there is also a version involving maximum likelihood estimation. . . . . . . Least Squares Method of estimation of Discriminant Functions The variance covariances matrix can be decomposed into two parts: one is the variance within each class and the other the variability between clases, or we can decompose the sum of squares and cross products (the same up to a constant factor) T = B+W T = X′ (In − P1n)X B = X′ (Pg − P1n)X between-class W = X′ (In − Pg)X within In is the identity matrix. P1n is the orthogonal projection in the space 1n . (i.e. P1n = 1n 1′ /n ). Such that (In − P1n)X is n the matrix of centered cases. Pg is the matrix projecting onto the subspace generated by the columns of the binary discriminating matrix G. This matrix has g columns and a one on row i and column j if observaton i belongs to group j a′ Ta = a′ Ba + a′ Wa. . . . . . . The first discriminant function (or variable or axis) is the linear combination of the original variables that maximises: a′ Ba/a′ Ta This is equivalent to maximizing the quadratic form a′ Ba under a constraint a′ Ta = 1. This is also equivalent to finding the eigenvectors of BW−1 . . . . . . * The discriminant score, also called the DA score, is the value resulting from applying a discriminant function formula to the data for a given case. The Z score is the discriminant score for standardized data. * Cutoff: If the discriminant score of the function is less than or equal to the cutoff, the case is classed as 0, or if above it is classed as 1. When group sizes are equal, the cutoff is the mean of the two centroids (for two-group DA). If the groups are unequal, the cutoff is the weighted mean. * Unstandardized discriminant coefficients are used in the formula for making the classifications in DA, much as b coefficients are used in regression in making predictions. The product of the unstandardized coefficients with the observations yields the discriminant scores. * Standardized discriminant coefficients are used to compare the relative importance of the independent variables, much as beta weights are used in regression. * The group centroid is the mean value for the discriminant scores for a given category of the dependent. Two-group . . . . . . discriminant analysis has two centroids, one for each group. * Number of discriminant functions. There is one discriminant function for 2-group discriminant analysis, but for higher order DA, the number of functions (each with its own cut-off value) is the lesser of (g - 1), where g is the number of groups, or p,the number of discriminating (independent) variables. Each discriminant function is orthogonal to the others. . . . . . . Mahalonobis Distance Mahalanobis distances are used in analyzing cases in discriminant analysis. For instance, one might wish to analyze a new, unknown set of cases in comparison to an existing set of known cases. Mahalanobis distance is the distance between a case and the centroid for each group in attribute space (p-dimensional space defined by p variables) taking into account the covariance of the variables. The population version: Suppose g groups, and p variable, and that the mean for group i is a vector µi = [µi1 , µi2 , . . . µip ], 1 ≤ i ≤ g, and call σ the variance-covariance matrix (which we suppose to be the same in all the groups). The Mahalanobis distance between group i and group j is: Dij = (µi − µj )′ Σ−1 (µi − µj ) . . . . . . The Mahalanobis distance is often used to compute the distance of a case x and the centre of the population as: D2 (x, µ) = (x − µ)′ Σ−1 (x − µ) When the distribution is multivariate normal the D2 follows a χ2 distribution. p Suppose now, we do not know the population variance-covariance, we estimate it by the pooled variance covariance matrix ∑g i=1 (ni − 1)Ci C= ∑ i (ni − 1) Then the Mahalanobis distance between the observation x and the group centroid i is: ¯ ¯ ¯ D2 (x, xi ) = (x − xi )′ C−1 (x − xi ) We can assign x to the group to which its Mahalanobis distance is the smallest. Thus, the smaller the Mahalanobis distance, the closer the case is to the group centroid and the more likely it is to be classed as belonging to that group. . . . . . . R function lda() lda(formula, data, ..., subset, na.action) ## Default S3 method: lda(x, grouping, prior = proportions, method, CV = FALSE, nu, ...) . . . . . . Example of Linear Discrimination diabetes=read.table('diabetes.txt',header=T,row.names=1) diabetes[1:20,] relwt glufast glutest steady insulin Group 1 0.81 80 356 124 55 3 3 0.94 105 319 143 105 3 5 1.00 90 323 240 143 3 7 0.91 100 350 221 119 3 9 0.99 97 379 142 98 3 11 0.90 91 353 221 53 3 13 0.96 78 290 136 142 3 15 0.74 86 312 208 68 3 17 1.10 90 364 152 76 3 >pairs(diabetes[,1:5],pch = 21, bg = c("red", "green3", "blue" . . . . . . . . . . . . 100 200 300 qq q q qq q q qq q qq q q qq qq q qqq q q q qqq q qq q q qq q q qqq q qq q q qq q q qq q q q q q q qqqq q qq q qq q qq q qq qqq q qq q q q q q qq q q q q qq qq qq q q qq q q q q q q qqq qq q q q q q q q q qq q qq q q q q q q qq 100 200 300 relwt q qq qq q q q qq q qq qq q q q q q q qq q q q q qq q q q qq qq qq q q q q q q qqqqqq q qq q q qq qq q qq q q qq q q q q q qq q qqq qq qqq q qq q qq qq qqqqqqqq qq qq qq qq qqqqqqqq q q q qq qqq q qq qq q 0 400 q q q q qq q q q q q qq q q q q qqq q q qq q qq q q q q qqqqq q qq q qq q q q q qqqq q qq qq q q q qq q q q qqq qqqqq qqqqq q qqq q qq q q q q qq qqqqq q qq q q q q qq qq q qqqq q q q q qqqq q q qq q q q qq q qq q 400 qq q qq q q qqq q q qq q q qq q q qq q qq qqq qq q qq q q qq q q q q q q q q q qq q qq qq q qqq q q qqq q q q q qq q qq q q qqqq qq q q q qq q q q qqq q q q q qq qq q q q q qq q q q qq q q q q q qq q q qq q q q q qq qq q q qq q qq q qq q q qq qq q qqq q qq q q qq qqq qq q q qq q q qq q q q q q qq q qqq q qq q qqq q qq q qq q q qq q qqq q qq q qq q qqq q q qq qq qq q q q q qq q qqq q q q qq q q qq q q q q qq q qq q qq qq q qq q qq q q q qq q qq q qqqq q qq q qq q qq q q qq q qq q q q q qq q q qq q q qq q q q qq q qq qqqq q qq q q qq q q q qq q q qqqqqq q q q qq qq q qq qq qqqqqq qqqqqqqq qq qq qqq q qq qq qqqq q qqqq q q qq q q qq q q q qq q q q q q q q q qq q q q q q qqq q q q q q qq qqqqq q q q q q qqqqq q q qqq q qqqqq q qqqqqqqqqqqqqq qq qq qq q q q q q qqq q q q qq qqqqqqqqq qq qq qq qq qq q q q qq q qq 0 glufast qq qq qq q qq qqqqq qq q q qqqqq qq q qqqqq qq q q qq q qqq qqq qqq qq q q qqq q qq q q q qq q q qq q q q qq q q q qq qq q qq q q qq qq qq q qq qq qq q qq q q q qqq qq qq q qq q qq q q q qqq q qqq q qq q qq q q q q qq q q q q q qq q qq q q qq q q qq qqq q qq qq q q qq q q qq q q q q qqq q q q qq qqq q q qq qq q qq q qqq q qqq qq q qq q q qqq q qq q q q q q qq q q qq qq qq q q q qq q q q qq q q qq q q q q q qq q q glutest qq q q q q qq qq q qq q qq qq qq qq q q qqq q q q qq q qq q qq q q q q q q qq qqq q qq q q q qq qqq q qq qq q qqq qqq qq q q q q qq qqqq q qq q q qq q qqqqqq qqq q qqq qq q qq q qq q q q qq qq q qq q qq q q qq qq qq q q q qqq q q qqq qqq q q q qq q q q q qq qq qq q qq q qq qq qqq q qqq q qqq qq q q q q qqq q q qq qq q q q qq q q q q q q q q qq q q q qq . qq q qq qq q q q qq q q q q qq qq q qq qqqq q q qq q q q q q qqq q q qqq qq qqq q qq qq q qq qqqq q q qq qqq q qq q qq q qq qqq q q q q q qqq q qq qq q qq qq q qqq q q q q q q q q qqq q qqqq q q qq q qqq q qq q qq q qq q q q q q q q q q q q q q q q q q q qq q qq q q q qq q q q qq q q q qq q q q q q qq q qqqq q qqqq qqqqq qqqqq qq q qqqqqq qqq qqq qqq q q q qq qq q qq q qqqq qq qqqqqq qq q qqq q q qq q qq q q q qq q qq qqq qqq q q q q q qq q qq q q qq qqq qq q qq q q q qqq q qq q q q qqqqq q q q q qq qqqqqqq q qqq qqqqqq q q qq q qqqqqq qq qqqq q qq qqqq q qqqq q qq q qqqqq q q q steady . . q q q qq q qq qq qq q q q q qq qq q q q qq q q qq q qq q qq q q q qq q qq q qqqq q q q qqq qq q q qqqqq qqqq q q qq qq qq q q q q q qq q qqqq q q qq q qqqq q q qq qq qqq q qqq q q q q q q q q q qq q qq qq q . . . Example of Linear Discrimination > diab.ld=lda(diabetes[,1:5],grouping=diabetes[,6]) > diab.ld lda(diabetes[, 1:5], grouping = diabetes[, 6]) Prior probabilities of groups: 1 2 3 0.2222222 0.2500000 0.5277778 Group means: relwt glufast glutest steady insulin 1 0.9915625 213.65625 1027.3750 108.8438 320.9375 2 1.0558333 99.30556 493.9444 288.0000 208.9722 3 0.9372368 91.18421 349.9737 172.6447 114.0000 Coefficients of linear discriminants: LD1 LD2 relwt -1.339546e+00 -3.7950612048 glufast 3.301944e-02 0.0373202882 glutest -1.263978e-02 -0.0068947755 steady 1.240248e-05 -0.0059924778 . . . insulin -3.895587e-03 0.0005754322 . . . . . . . . . 4 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 0 1 1 11 1 1 11 1 2 1 22 1 1 12 1 2 −2 2 1 2 2 2 2 2 2 22 22 12 2 22 3 2 2 22 2 2 2 2 −4 LD2 1 3 3 3 3 3 33 33 3 3 33 3 3 3 3 3 1 3 3 33 3 3 333 33 3 33 3333 33 3 3 3 3 3 3 333 33 2 3 23 33 33 3 33 33 3 33 2 2 2 33 3 3 3 3 23 2 2 3 3 −6 −4 −2 0 2 . . . . . . Cross-validation To determine an estimate of the misclassification rate that is not biased, we use cross-validation. Usually for LDA we use leave-one out cross validation (n fold) X 1 ∪ X2 ∪ X 3 . . . ∪ X n . . . . . . ...
View Full Document

This note was uploaded on 07/29/2011 for the course STAT 202 at Stanford.

Ask a homework question - tutors are online