This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: DATA MINING ASSIGNMENT GROUP MEMBERS: 1. LOHIT GUPTA 2. SUNIL KUSHWAHA 3. NATARAJAN In this problem, we are going to analyze three kernel functions – Uniform, Gaussian and Epanechnikov. Cluster Analysis: The analysis is started off by clustering the dependent variable (Y values). In order to have a rough idea about the dispersion of the Y values, we draw a scatterplot. The Scatterplot is drawn using Excel and the output is given below: From the above chart, we see that there exists a cluster for the lower values of Y. Similarly, we see that crowd of the data is more for moderate values of Y. Hence we select centroid as the method of clustering. That is, we can use ‘ K Mean clustering’ method. We are provided that the number of clusters should be 4. Hence we have to select 4 centroids an initial guess. Initial Centroids: We select the initial centroids using the following procedure: Divide the Y values into 4 class interval of equal width. Lower Class Upper Class 0.00002 0.000288 0.000288 0.000555 0.000555 0.000823 0.000823 0.00109 Then the four centroids are given by the midpoints of these class intervals. Lower Class Upper Class Initial Centroids 0.00002 0.000288 0.000154 0.000288 0.000555 0.000421 0.000555 0.000823 0.000689 0.000823 0.00109 0.000956 Using these centroids as the initial guess, we start the iteration of the cluster analysis. We have written the code for the cluster analysis in MATLAB and the same is given below: The shift in the centroids after each iteration is given below: Iteration Centroid 1 Centroid 2 Centroid 3 Centroid 4 1 0.000154 0.000421 0.000689 0.000956 2 0.000093 0.000505 0.000713 0.000939 3 0.000093 0.000524 0.000754 0.000939 4 0.000093 0.000524 0.000762 0.000954 5 0.000093 0.000524 0.000762 0.000954 From the above table, we see that after 4 th iteration there is no shift in the centroids. The scatterplot depicting the Y values and the class to which these y values belong is given below: The Frequency distribution for the classes is given below: CLASS ID COUNT PERCENTAGE 1 7 12 2 9 16 3 25 44 4 16 28 TOTAL 57 100 Out of the 57 samples, we have to randomly select 40 observations. This can be done by Bernoulli experiment. 57 Bernoulli samples have been generated with 57 40 p = 0.701. The training sample is selected using the following procedure. Bernoulli Sample = Sample Validation in is Data Sample Training in is Data 1 Using this procedure, we have collected training sample of size 40. Nonparametric Discriminant analysis: For the given problem, we don’t know the underlying distribution of the variables. Hence we approximate those distributions with the help of the kernel functions. The kernel functions that are considered for the analysis are as follows: Uniform Kernel: otherwise r Z V Z if t V Z K t r t ) ( 1 ) ( 2 1 ' Where 2 / 1 ) ( v V r t V t P r , ) 1 ) 2 / (( p v p and V t = diagonal matrix of the covariance matrix of group ‘t’. covariance matrix of group ‘t’....
View
Full
Document
This note was uploaded on 03/13/2012 for the course STATISTICS SI406 taught by Professor Rrj during the Spring '12 term at IIT Kanpur.
 Spring '12
 RRJ

Click to edit the document details