ohdatamineVISU

ohdatamineVISU - DATA MINING Susan Holmes © Stats202...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: DATA MINING Susan Holmes © Stats202 Lecture 8 Fall 2010 ABabcdfghiejkl . . . . . . Special Announcements Homework questions in office hours (6 times a week), please come with questions. All other requests/questions/suggestions should be sent to stats202-aut1011-staff@lists.stanford.edu. Homework, the deadline is Friday 5.00pm, all hw not within the deadline is rejected (we have an automatic system). Hints on coursework and Jaccard(rep(0,n),rep(0,n))=0 is a possible convention. Nice commands for changing classes: as.vector, as.dist, dist, as.matrix. . . . . . . Recap > sum(pcao$co[,1]^2) [1] 3.418238 (which is the vlaue of the first eigenvalue) > sum(pcao$c1[,1]^2) [1] 1 > sum(pcao$li[,1]^2) [1] 112.8019 > sum(pcao$l1[,1]^2) [1] 33 > sum(pcao$li[,1]^2)/33 [1] 3.418238 so the norm of the principal components is taken to be 33 and the difference between li and l1 is that l1 has all the colu normalized to 33, the norm of li is 33 times the eigenvalue. . . . . . . Initialization is Important mato=matrix(0,10,3) for ( i in 1:10){ for (j in 1:3){ mato[i,j] =rnorm(1) }} > mato [,1] [,2] [,3] [1,] -1.31312857 -0.1226580 -1.16757231 [2,] -0.01357774 -0.9739593 0.04860852 [3,] 0.73723196 0.2661910 -0.67721466 [4,] -1.00701794 -1.1324707 1.11201355 [5,] -0.53416760 -0.1872749 -0.42778941 > mato[3,] [1] 0.7372320 0.2661910 -0.6772147 . . . . . . Input Data A typical example of an input matrix is the aggregate proximity matrix derived from a pilesort task. Each cell xij of such a matrix records the number (or proportion) of respondents who placed items i and j into the same pile. It is assumed that the number of respondents placing two items into the same pile is an indicator of the degree to which they are similar. An MDS map of such data would put items close together which were often sorted into the same piles. Another typical example of an input matrix is a matrix of correlations among variables. Treating these data as similarities (as one normally would), would cause the MDS program to put variables with high positive correlations near each other, and variables with strong negative correlations far apart. . . . . . . Another type of input matrix is a flow matrix. For example, a dataset might consist of the number of business transactions occurring during a given period between a set of corporations. Running this data through MDS might reveal clusters of corporations that whose members trade more heavily with one another than other than with outsiders. Although technically neither similarities nor dissimilarities, these data should be classified as similarities in order to have companies who trade heavily with each other show up close to each other on the map. . . . . . . MDS ALgorithm In summary, given an n × n matrix of interpoint distances D, one can solve for points achieving these distances by: 1. Double centering the interpoint distance squared matrix: B = − 1 HD2 H. 2 2. Diagonalizing B: B = UΛUT . ˜˜ 3. Extracting X: X = UΛ1/2 . . . . . . . data(eurodist) eurodist . . . . . . > Msig3[1:6,1:5] HEA26_EFFE_1 HEA26_MEM_1 HEA26_NAI_1 MEL36_EFFE_1 3968 -2.61083608 -2.2592865 -0.2719081 -2.24310216 14831 -1.18579226 -0.4718562 0.8170308 -1.08060090 13492 -0.05612926 0.2773099 0.8126917 -0.23557156 5108 -0.14702989 0.5438521 0.7198425 -0.17605950 16348 0.52307578 -0.3659554 -0.9049818 0.64067822 585 -0.01816524 0.1078421 0.7481906 0.01000666 > disjonct[1:6,1:5] [,1] [,2] [,3] [,4] [,5] [1,] -1 -1 0 -1 -1 [2,] -1 -1 1 -1 0 [3,] 0 0 1 0 0 [4,] 0 1 1 0 1 [5,] 1 -1 -1 1 0 [6,] 0 0 1 0 0 . . . . . MEL36_M -2.6798 -0.1463 0.2490 0.95198 -0.2040 0.16530 . dist.disj=dist(disjonct,method="manhattan") class(dist.disj) [1] "dist" str(dist.disj) Class 'dist' atomic [1:12090] 13 31 36 31 33 8 41 52 39 29 ... ..- attr(*, "Size")= int 156 ..- attr(*, "Diag")= logi FALSE ..- attr(*, "Upper")= logi FALSE ..- attr(*, "method")= chr "manhattan" ..- attr(*, "call")= language dist(x = disjonct, method = "man dist.disj[1:20] [1] 13 31 36 31 33 8 41 52 39 29 24 49 9 8 25 43 10 38 49 14 as.matrix(dist.disj)[1:8,1:5] 12345 1 0 13 31 36 31 2 13 0 18 27 36 3 31 18 0 11 30 4 36 27 11 0 37 . . . . . . > cmd.micro=cmdscale(dist.disj) > class(cmd.micro) [1] "matrix" > dim(cmd.micro) [1] 156 2 plot(cmd.micro[,1],cmd.micro[,2]) ####But we don't know how good the plot is in representing ####all the information cmd.micro2=cmdscale(dist.disj,eig=TRUE,k=60) names(cmd.micro2) [1] "points" "eig" "x" "ac" "GOF" > round(cmd.micro2$eig,2) [1] 65000.12 9714.37 3269.84 1753.16 1397.03 1027.31 827. [10] 577.68 511.59 477.42 402.74 375.12 357.07 324.18 [19] 270.69 259.36 250.34 229.26 226.36 200.47 194.28 [28] 150.64 146.44 136.50 122.86 108.46 100.36 92.55 [37] 70.44 65.02 51.85 46.88 44.04 33.57 28.91 [46] 15.66 11.73 10.55 7.68 4.19 1.29 0.00 [55] 0.00 0.00 0.00 0.00 0.00 0.00 . . . . . . 0 10000 20000 30000 40000 50000 60000 Scree Plot . . . . . . Map of genes plot(cmd.micro2$points[,1:2]) 20 q q q q 10 0 qq q q q q q q q q q q qq q q q q q q q q q q q q q q qq q q q q q q q qq q q q q q q qq q qq q q q q qq q q qq q q q q q q qq q q qq q −10 qq q qq q q q q q q q q q q qq q q q q q qq q q q qq q q q q q qq −20 cmd.micro2$points[, 1:2][,2] q q q q −30 −20 −10 0 10 20 30 cmd.micro2$points[, 1:2][,1] . . . . . . Example of MDS on Multivariate Data str(eurodist) Class 'dist' atomic [1:210] 3313 2963 3175 3339 2762 ... ..- attr(*, "Size")= num 21 ..- attr(*, "Labels")= chr [1:21] "Athens" "Barcelona" "Bruss attr(eurodist,"Labels") [1] "Athens" "Barcelona" "Brussels" "Calais" [5] "Cherbourg" "Cologne" "Copenhagen" "Geneva" [9] "Gibraltar" "Hamburg" "Hook of Holland" "Lisbon" [13] "Lyons" "Madrid" "Marseilles" "Milan" [17] "Munich" "Paris" "Rome" "Stockholm" [21] "Vienna" cmd.euro1=cmdscale(eurodist) plot(cmd.euro1,type="n") text(cmd.euro1,attr(eurodist,"Labels")) . . . . . . Configurations European Cities 1000 Athens Rome Gibralta Barcelona MarseillesMilan Geneva Vienna Lyons Munich 0 Lisbon Paris Cherbourg Brussels Cologne Calais Hook of Holland Hamburg −1000 axe 2 Madrid Copenhagen Stockholm −2000 −1000 0 1000 2000 axe 1 . . . . . . Diabetes Data Reaven and Miller (1979) examined the relationship between measures of glucose (from blood plasma) and insulin in 145 non-obese adult patients at the Stanford Clinical Research Center in order to examine ways of classifying people as "normal", "overt diabetic", or "chemical diabetic". Each patient underwent a glucose tolerance test and the variables were : Relative weight (RELWT) a ratio (OBS/EXP), Fasting Plasma Glucose (GLUFAST), Test Plasma Glucose (GLUTEST, a measure of intolerance to insulin), and Steady State Plasma Glucose (STEADY, a measure of insulin resistance). . . . . . . Multivariate Data Visualization Without MDS: pdf("/Users/susan/stat202/slides/pairsdiab.pdf") pairs(diab[,-c(1,7)]) title("pairs(diab[,-c(1,7)])") dev.off() library(lattice) cloud(diab[,6]~diab[,5]*diab[,4]) cloud(insulin~steady*glutest,data=diab) . . . . . . Output: Diabetes q qq qq q q q q q q q qq q q q q qq q q q qq qq q qqq q q q qq q qqq qqq q q qqqq q q q q q qqq q qq q q qq q qqqqqq qqq q qqq q qq q qqq qqq qq qqqqqqqq qq qq qq q q q qqqqqqqq q q qq q qq q 600 qq qq qq q qq q qq q qq qq qq q qq qqqqqq qq q qqqqqq qq qqqq qqq qqqqq qqq qqq q qqqq qq q qqq q q q qq qq q qq q q q q qq q qq qq qq q q qq qq qq qq q qq q qq q q q q qq q qq qq qq qqqq qq q qq q qq q qq qqq qqq qq qqq qq q qqq qq q q qq qqq qq q q qq q q q q q qq q qqq q qq q qq q qq q qq qq q q q q qqq qqqq qq qq qq q qq qq q q q q q q qqq q qqq q qq qqqq qqq q q qq q qq q qqqq qqqq qq q q q q qqqqq q qq q q q qqq q qq q q q q q q qq qq q q qq qq q q q qqq q q q q q q q q q qqq q q q qqq qq qqq q q qqq q qq qq q qq q q q q q qqqq q q q q q qq q qq q q qq q q q qqq q qq qqqqqq qq q q q q q qqq qq q qq q q qq qq qq q q q q q qq q q q qq q q q q q q q qq q q q qq q q 1.1 qq q q qq q qq qq q q qq q q qq qq q q qq q qqqq q q q q qq qq q q qqq q q q q qq q qq q q q q qq qqq q qq q qq qq qqq qq qq q q qq q q q qqq qq q q q q qq qq q qq q q qq q q q q qq q q qq qq q qq q qq q q qq qq qq q q q qq q q q q q qq q q q q q q q q q qq q qq q q q q qq q q q q qqqqq q qq q q qq qq q qq q qq qqqqq q q qqq q qqq q qq qqqqq q qqq q qq q q qq q q qq q q q qq q qq q q q 1.1 0.9 q q q qq q qq q q q qq q q q qq q qq q q q qq qq q q q q qq q q q qq q q q q q qq q q qq q qqq q q q q q qq q q q qq q qq qqqqqqq qq qq qqqqqqqq q qq q q qqqq q qq q qqqqq q q qq q qqqq qq q q qq q q qq q q q qq qq qq q q qq q q qqq q q qqq q q q q q qqq q qqq q qq q q q q qq q qq qq q qq q q qq qq q qqqq qq qqq qq qqq qq q qq qq q q qq q q qq q q qq q q qq qq q qq q q qqq q q qq q q qq q q q q q qq qq q q qq q q q q qq q q qq q q q qq q qq qq q qq qq q q q q q q qq q q q q qq q qqq q q q qq q q q qq q qqq qqq q q q qqq qq qqqq q q qq q qq qq q qq qq q qqq q q qq qq q q qq q qq q qq 400 q q q q q q qq q q qq q qq qq q q qq qq qq q qq q q q q qq qqq qqq qqqqqqqqqqqq q q q qqq qq q qqq qqqq q q q qqq qq q q q q qqq qq qqq qqq q qqq q q qq qqq qq q qq qq q q q q qq q q q q q qq q q q q q q qq q q qq q qq qq qqq qqq qq q qq qqq q qqqqqq q q qq qq q qq qq q qq qqqq q q q q q qq q q qq q qqq qqqqq q qq q qqq q qqq q q qqq q q qq q q q q qq q q qq q q q q q qq q q q q 0.9 glutest q q q glufast q q q qq q qq q q qq qq q q q q q q qq q qq q qq qq q q q q qq qq q q q q q q q q q q qq qqq qqq q q q qq q q q q q qq q q qqq qq qq qq q q qq q q q q q qq q q q q qq q q q q q q q qq q q q q q qq q q qq q q qq q q qq q qq q q qq qq q q qq 0.7 q q qq q q q q qq q qq q q qq q q q qq qq q qqq qq q q qqq qq q q q q qq qqqq q q qq qqqqq qq qqqqq q qqqqqqqqq q q q q qq qq q qqqqq q q q qq qqqqq q q qq qq qq q qq q qq q q qq q qq q q q 800 1200 q q q 200 q q qq q qq q q q q q q q q qq q q q q q q qq q q qq q q qq q q q qqq qq q q q q q qq q q q q qqq qqqq qq qq q q qqqqqqqqq qqqq qq qqq qq qqqqqq q q qq qq q q qqq q qq q q q q qq q qqq q q q q q q q q qq q qq 0 qq q q qq q qq q q q q qq q q q q qq q q q q q q qqq q q q q q qq q qqq q q q qq qq q qq q qqq q qq qq q q qq q q q qq q q qqqq qq q q qqq qqq q q qq q q q qq q q q qq q q qq q q q qq q q q qqq qq q q q qq q q q q q qq q q qq q q q qq qq q q 600 qq q q q q q qq qq q q q q q qq q q qq qq q qqq q q q q q qq q qqq q qq qq qq q q q q qq q qq q q qqq q q qq q qq q q q q qq qq q qq q q q q qqq q qq q qq qq q q q qqq q q q qq q qq q q qq q q q q qqq q qq q q qq q qq q q q q qq q qq q q q qq 0.7 0 200 pairs(diab[,−c(1,7)]) 400 300 200 100 q q 300 800 1200 q q q qq q qq q q q qq q q qq q qq q q qq qq q q qq qq qq q q q q qq q q q q q qq q q q q q q qq q q q q qq q qq q qq q qq q q qq q q qq q qq q qq q q qqqqq q qq q qq qq q qqq q qq q qq qq q q qq q qq q qq q q qq q q q qq q q q qq q q qq q qq q q q q qq q q qq q q q q qq q q q q q q q qqq q q q qqqqqq q qq qq q qqqqqqqq q q q q q qq q q qqqqq q qqq qq qqqqqqqqq q qqq qq q qq q qq q q q qq q q q q q q qq q qq q qq q q steady insulin 300 relwt q qq qq q 200 100 100 qq qqq q q q qq q qq q q qq qq q q q qq q q q q qq qq q qq q q q qqqq q q q qq qq q qq q q q q qq q qqq q q q qq qq q q qqq q q qqq q q q qq qq q q q q q q qq qq qq q q qq q q q qq q q qqq qq q q q qq q q qq q qq q q q q q q q qq q 100 300 . . . . . . insulin glutest steady . . . . . . Multivariate Data Visualization Problems with ink overload. Always try to increase the Information/Ink ratio. Avoid chart junk. overload=matrix(rnorm(30000),ncol=3,nrow=10000) plot(overload[,1:2]) qplot(overload[,1],overload[,2],alpha=1/10) x=overload[,1] y=overload[,2] z=overload[,3] qplot(x[z>0.5],y[z>0.5]) . . . . . . Multivariate Data Subsetting: Real Example COMBO=read.table('http://www.astro.psu.edu/users/edf/COMBO17 header=T,fill=T) dim(COMBO) ; names(COMBO) [1] 3463 65 [1] "Nr" "Rmag" "e.Rmag" "ApDRmag" "mumax" "Mcz" [8] "MCzml" "chi2red" "UjMAG" "e.UjMAG" "BjMAG" "e.BjM [15] "e.VjMAG" "usMAG" "e.usMAG" "gsMAG" "e.gsMAG" "rsMA [22] "UbMAG" "e.UbMAG" "BbMAG" "e.BbMAG" "VnMAG" "e.VbM [29] "e.S280MA" "W420FE" "e.W420FE" "W462FE" "e.W462FE" "W4 [36] "W518FE" "e.W518FE" "W571FS" "e.W571FS" "W604FE" "e.W [43] "e.W646FD" "W696FE" "e.W696FE" "W753FE" "e.W753FE" "W8 [50] "W856FD" "e.W856FD" "W914FD" "e.W914FD" "W914FE" "e.W [57] "e.UFS" "BFS" "e.BFS" "VFD" "e.VFD" "RFS" [64] "IFD" "e.IFD" loz_index=which((COMBO[,6]<0.3) & (COMBO[,12]<0) & (COMBO[,28 COMBO_loz=cbind(COMBO[loz_index,12],COMBO[loz_index,28] . . . . . . - COMBO[loz_index,12]) Multivariate Data Subsetting: Real Example par(mfrow=c(1,2)) plot(COMBO_loz,pch=20,cex=0.5,xlim=c(-22,-7), ylim=c(-2,2.5) xlab='M_B (mag)',ylab='M_280 - M_B (mag)',main='COMBO-17 galax library(MASS) COMBO_loz_sm=kde2d(COMBO_loz[,1],COMBO_loz[,2], h=c(1.6,0.4) lims = c(-22,-7,-2,2.5), n=500) image(COMBO_loz_sm,col=grey(13:0/15),xlab='M_B (mag)',ylab=' ,xlim=c(-22,-7), ylim=c(-2,2.5),xaxp=c(-20,-10,2)) . . . . . . Output: Combo-17 galaxy q COMBO−17 galaxies (z<0.3) 2 q q 2 q q q q q q q q q q q q q q q q q q q q q q q q q 1 q q q q q q q q q M_280 − M_B (mag) 1 q q q q q q qq q q q q qq q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q q q q q q q q qq q qq q q q qq q q q q q q qqq q qq q q q qq q qq qq q qq q qq q q q qq q qq q q q q qq q q qq qq q q q q qq q q q q q q qq qq q qq q q qq q qq q q q q qq qq q q qq q q qq q q q q q qqq q qq q q q q q q q q q q q q qq qq qq q qq qq q q q q qq q q qq qq q q q q qq qq q q q qq q q q q q q q q q q q q q q q q qq q q qq q q q qq q q q q qq q q q qq qq q q q q qq q qq q q qq q q qq q q q q qq q qq q q q q qq qq q q q q q q q q q qq q q qq q q q qq q q qq q qq q q q q q q qq qq q q q q q qqq q q q q q q q q qq q q q qq q qq qqq q q q q qq qq q q q q q q q q qq q q q qq q q q q q q q qq qq q q q q q q qq q q qq q q qq q q qq q qqqq q q q q q qq qq q q qq q q q q q q q q qq q q qq q q q qq q qq q q qq q q q q q q qqq q q q qq q q q q qq qq q q q q qq q q q q q q q 0 q −1 q q q q q q q q q q q q q q q q −2 q q −2 M_280 − M_B (mag) q q q 0 qq q q −1 q q −20 −15 −10 −20 −15 −10 q M_B (mag) M_B (mag) . . . . . . ...
View Full Document

Ask a homework question - tutors are online