This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Learning Vector Quantization and KNearest Neighbor Learning Vector Quantization and KNearest Neighbor
Jia Li
Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu http://www.stat.psu.edu/jiali Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Learning Vector Quantization Developed by Kohonen. A package with document is available at: http://www.cis.hut.fi/nnrc/nnrcprograms.html . When position the prototypes, use information given by class labels, as a contrast to kmeans which selects prototypes without using class labels. Often works better than kmeans. The idea is to move a prototype close to training samples in its class, and move away from samples with different classes. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor The Algorithm
1. Start from a set of initial prototypes with classes assigned. Denote the M prototypes by Z = {z1 , ..., zM } and their associated classes by C (zm ), m = 1, 2, ..., M. The initial prototypes can be provided by kmeans. 2. Sweep through the training samples and update zm after visiting each sample. Suppose xi is assigned to the mth prototype zm by the nearest neighbor rule: xi  zm xi  zm , m = m, 1 m M If gi = C (zm ), move zm towards the training sample: zm zm + (xi  zm ) where is the learning rate. If gi = C (zm ), move zm away from the training sample: zm zm  (xi  zm ) 3. Step 2 can be repeated a number of times.
Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Experiments Use the diabetes data set. Use prototypes obtained by kmeans as initial prototypes. Use LVQ with = 0.1. Results obtained after 1, 2, and 5 passes are shown below. Classification is not guaranteed to improve after adjusting prototypes. One pass with a small usually helps. But don't over do it. Comments: Fine tuning often helps: Select initial prototypes. Adjust learning rate . Jia Li Read the package documents for details.
http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Error rate: 27.74% Error rate: 27.61% Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Error rate: 27.86% Error rate: 32.37% Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor KNearest Neighbor Classifiers Given a query point x0 , find the k training samples x(r ) , r = 1, ..., k closest in distance to x0 , and then classify using majority vote among the k neighbors. Feature normalization is often performed in preprocessing. Classification boundaries become smoother with larger k. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor A Comparative Study (ElemStatLearn) Two simulated problems. There are 10 independent features Xj , each uniformly distributed on [0, 1]. The twoclass 01 target variable is defined as follows: problem 1: "easy" 1 Y = I (X1 > ); 2 problem 2: "difficult" 3 1 Y = I sign (Xj  ) > 0 . 2 j=1 Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor For problem 1 (2), except X1 (and X2 , X3 ), all the other features are "noise". The Bayes error rates are zero. In each run, 100 samples used in training and 1000 used in testing. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor The figure shows the mean and standard deviation of the misclassification error for nearestneighbors, Kmeans and LVQ over 10 realizations (10 simulated data sets), as the tuning parameters change. Kmeans and LVQ give almost identical results. For the first problem, Kmeans and LVQ outperform nearest neighbors, assuming the best choice of tuning parameters for each. For the second problem, they perform similarly. The optimal k for the knearest neighbor classification differs significantly for the two problems. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Adaptive NearestNeighbor Methods When dimension is high, data become relatively sparse. Implicit in nearestneighbor classification is the assumption that the class probabilities are roughly constant in the neighborhood, and hence simple averages give good estimates. In high dimensional space, the neighborhood represented by the few nearest samples may not be local. Consider N data points uniformly distributed in the unit cube [ 1 , 1 ]p . Let R be the radius of a 1nearestneighborhood 2 2 centered at the origin. Then 1/p 1 1/N 1/p median(R) = vp 1 , 2 where vp r p is the volume of the sphere of radius r in p dimensions. The median radius quickly approaches 0.5, the distance to the edge of the cube, when dimension increases. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Adjust distance metric locally, so that the resulting neighborhoods stretch out in directions for which the class probabilities don't change much. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Discriminant Adaptive NearestNeighbor (DANN) At each query point, a neighborhood of say 50 points is formed. Class probabilities are NOT assumed constant in this neighborhood. This neighborhood is used only to decide how to define the adapted metric. After the metric is decided, a normal knearest neighbor rule is applied to classify the query. The metric changes with the query. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor The DANN metric at a query point x0 is defined by D(x, x0 ) = (x  x0 )T (x  x0 ) , where = W1/2 [(W1/2 )T BW1/2 + I](W1/2 )T = W1/2 [B + I](W1/2 )T . W is the pooled withinclass covariance matrix and B is the between class covariance matrix. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Intuition We compute W and B in LDA. Recall we did similar computation when deriving discriminant coordinates. Eigen decomposition of B = V DB VT . = W1/2 [B + I](W1/2 )T = W1/2 [V DB VT + I](W1/2 )T = W1/2 V (DB + I) (W1/2 V )T Note that the column vectors of W1/2 V are simply the discriminant coordinates. I is added to avoid using samples far away from the query point.
http://www.stat.psu.edu/jiali Jia Li Learning Vector Quantization and KNearest Neighbor Geometric interpretation: To compute DNAA metric, x  x0 is projected onto the discriminant coordinates. The projection values on the significant discriminant coordinates are magnified; those on the insignificant DCs are shrunk. The implication is that the neighborhood is stretched in the directions of the insignificant DCs and squeezed in those of the significant ones. (W1/2 )T is introduced to sphere the data so that the withinclass covariance matrix is the identity matrix. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Significant discriminant coordinates represent directions in which class probabilities change substantially. Hence when we form a neighborhood, we want it to have small span in these directions. On the opposite, we want the neighborhood to have large span in directions in which class probabilities don't change much. To summarize: we want to form a neighborhood that contains as many samples as possible but in the mean while has approximately constant class probabilities. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor To summarize: we want to form a neighborhood that contains as many samples as possible but in the mean while has approximately constant class probabilities. Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Global Dimension Reduction At each training sample xi , the betweencentroids sum of squares matrix Bi is computed, using a neighborhood of say 50 points. Average Bi over all training samples:
N 1 B= Bi . N i=1 Jia Li http://www.stat.psu.edu/jiali Learning Vector Quantization and KNearest Neighbor Let v1 , v2 , ..., vp be the eigenvectors of the matrix B, ordered from largest to smallest eigenvalues k . Then a rankL, L < p, approximation to B is B[L] =
L l=1 l vl vlT . B[L] is optimal in the sense of solving the least square problem:
rank(M)=L min trace(B  M)2 . And hence solves: min
N i=1 rank(M)=L trace(Bi  M)2 . Jia Li http://www.stat.psu.edu/jiali ...
View
Full
Document
This note was uploaded on 02/04/2012 for the course STAT 557 taught by Professor Jiali during the Fall '09 term at Pennsylvania State University, University Park.
 Fall '09
 JIALI
 Statistics

Click to edit the document details