knn - Learning Vector Quantization and K-Nearest Neighbor...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Learning Vector Quantization and K-Nearest Neighbor Learning Vector Quantization and K-Nearest Neighbor Jia Li Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu http://www.stat.psu.edu/jiali Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Learning Vector Quantization Developed by Kohonen. A package with document is available at: http://www.cis.hut.fi/nnrc/nnrc-programs.html . When position the prototypes, use information given by class labels, as a contrast to k-means which selects prototypes without using class labels. Often works better than k-means. The idea is to move a prototype close to training samples in its class, and move away from samples with different classes. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor The Algorithm 1. Start from a set of initial prototypes with classes assigned. Denote the M prototypes by Z = {z1 , ..., zM } and their associated classes by C (zm ), m = 1, 2, ..., M. The initial prototypes can be provided by k-means. 2. Sweep through the training samples and update zm after visiting each sample. Suppose xi is assigned to the mth prototype zm by the nearest neighbor rule: xi - zm xi - zm , m = m, 1 m M If gi = C (zm ), move zm towards the training sample: zm zm + (xi - zm ) where is the learning rate. If gi = C (zm ), move zm away from the training sample: zm zm - (xi - zm ) 3. Step 2 can be repeated a number of times. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Experiments Use the diabetes data set. Use prototypes obtained by k-means as initial prototypes. Use LVQ with = 0.1. Results obtained after 1, 2, and 5 passes are shown below. Classification is not guaranteed to improve after adjusting prototypes. One pass with a small usually helps. But don't over do it. Comments: Fine tuning often helps: Select initial prototypes. Adjust learning rate . Jia Li Read the package documents for details. http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Error rate: 27.74% Error rate: 27.61% Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Error rate: 27.86% Error rate: 32.37% Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor K-Nearest Neighbor Classifiers Given a query point x0 , find the k training samples x(r ) , r = 1, ..., k closest in distance to x0 , and then classify using majority vote among the k neighbors. Feature normalization is often performed in pre-processing. Classification boundaries become smoother with larger k. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor A Comparative Study (ElemStatLearn) Two simulated problems. There are 10 independent features Xj , each uniformly distributed on [0, 1]. The two-class 0-1 target variable is defined as follows: problem 1: "easy" 1 Y = I (X1 > ); 2 problem 2: "difficult" 3 1 Y = I sign (Xj - ) > 0 . 2 j=1 Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor For problem 1 (2), except X1 (and X2 , X3 ), all the other features are "noise". The Bayes error rates are zero. In each run, 100 samples used in training and 1000 used in testing. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor The figure shows the mean and standard deviation of the misclassification error for nearest-neighbors, K-means and LVQ over 10 realizations (10 simulated data sets), as the tuning parameters change. K-means and LVQ give almost identical results. For the first problem, K-means and LVQ outperform nearest neighbors, assuming the best choice of tuning parameters for each. For the second problem, they perform similarly. The optimal k for the k-nearest neighbor classification differs significantly for the two problems. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Adaptive Nearest-Neighbor Methods When dimension is high, data become relatively sparse. Implicit in nearest-neighbor classification is the assumption that the class probabilities are roughly constant in the neighborhood, and hence simple averages give good estimates. In high dimensional space, the neighborhood represented by the few nearest samples may not be local. Consider N data points uniformly distributed in the unit cube [- 1 , 1 ]p . Let R be the radius of a 1-nearest-neighborhood 2 2 centered at the origin. Then 1/p 1 1/N -1/p median(R) = vp 1- , 2 where vp r p is the volume of the sphere of radius r in p dimensions. The median radius quickly approaches 0.5, the distance to the edge of the cube, when dimension increases. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Adjust distance metric locally, so that the resulting neighborhoods stretch out in directions for which the class probabilities don't change much. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Discriminant Adaptive Nearest-Neighbor (DANN) At each query point, a neighborhood of say 50 points is formed. Class probabilities are NOT assumed constant in this neighborhood. This neighborhood is used only to decide how to define the adapted metric. After the metric is decided, a normal k-nearest neighbor rule is applied to classify the query. The metric changes with the query. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor The DANN metric at a query point x0 is defined by D(x, x0 ) = (x - x0 )T (x - x0 ) , where = W-1/2 [(W-1/2 )T BW-1/2 + I](W-1/2 )T = W-1/2 [B + I](W-1/2 )T . W is the pooled within-class covariance matrix and B is the between class covariance matrix. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Intuition We compute W and B in LDA. Recall we did similar computation when deriving discriminant coordinates. Eigen decomposition of B = V DB VT . = W-1/2 [B + I](W-1/2 )T = W-1/2 [V DB VT + I](W-1/2 )T = W-1/2 V (DB + I) (W-1/2 V )T Note that the column vectors of W-1/2 V are simply the discriminant coordinates. I is added to avoid using samples far away from the query point. http://www.stat.psu.edu/jiali Jia Li Learning Vector Quantization and K-Nearest Neighbor Geometric interpretation: To compute DNAA metric, x - x0 is projected onto the discriminant coordinates. The projection values on the significant discriminant coordinates are magnified; those on the insignificant DCs are shrunk. The implication is that the neighborhood is stretched in the directions of the insignificant DCs and squeezed in those of the significant ones. (W-1/2 )T is introduced to sphere the data so that the within-class covariance matrix is the identity matrix. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Significant discriminant coordinates represent directions in which class probabilities change substantially. Hence when we form a neighborhood, we want it to have small span in these directions. On the opposite, we want the neighborhood to have large span in directions in which class probabilities don't change much. To summarize: we want to form a neighborhood that contains as many samples as possible but in the mean while has approximately constant class probabilities. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor To summarize: we want to form a neighborhood that contains as many samples as possible but in the mean while has approximately constant class probabilities. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Global Dimension Reduction At each training sample xi , the between-centroids sum of squares matrix Bi is computed, using a neighborhood of say 50 points. Average Bi over all training samples: N 1 B= Bi . N i=1 Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and K-Nearest Neighbor Let v1 , v2 , ..., vp be the eigenvectors of the matrix B, ordered from largest to smallest eigenvalues k . Then a rank-L, L < p, approximation to B is B[L] = L l=1 l vl vlT . B[L] is optimal in the sense of solving the least square problem: rank(M)=L min trace(B - M)2 . And hence solves: min N i=1 rank(M)=L trace(Bi - M)2 . Jia Li http://www.stat.psu.edu/jiali ...
View Full Document

This note was uploaded on 02/04/2012 for the course STAT 557 taught by Professor Jiali during the Fall '09 term at Pennsylvania State University, University Park.

Ask a homework question - tutors are online