1995_Viola_thesis_registrationMI

To justify this claim he uses gures much like figure

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: s to run on Sun Sparc5 workstation. 70 3.4. PRINCIPAL COMPONENTS ANALYSIS AND INFORMATION AI-TR 1548 4 ECA PCA 3 2 1 0 -1 -2 -3 -4 -4 -3 -2 -1 0 1 2 3 4 Figure 3.3: A scatter plot of a 400 point sample from a two dimensional density. The density is a mixture of two horizontally stretched Gaussians. The PCA and ECA principal axes are also plotted as vectors from the origin. This is comparable to the time it takes to run PCA. In general, PCA does not nd the highest entropy projection of non-Gaussian densities. For more complex densities the PCA axis is very di erent from the entropy maximizing axis. Figure 3.3 shows a density for which the PCA and ECA axes are very di erent. The PCA axis, which is vertical, spreads the points in the sample as far apart as possible. The ECA axis, which is oblique, spreads nearby points in the sample as far apart as possible. The resulting densities, YPCA and YECA , are graphed in Figure 3.4. The PCA density is very tightly peaked, the ECA density is broadly spread out. Though the nal variance of YPCA is larger, 2:005 vs. 1:626, the entropy of the YECA distribution is much higher, hYPCA  = ,0:17 and hYECA  = 1:61. Linsker has argued that the PCA axis separates the clusters of a distribution Linsker, 1988. To justify this claim, he uses gures much like Figure 3.3 and Figure 3.4. These graphs show the PCA axis projecting points from separated clusters so that they remain separate. It is then proposed that the PCA axis is useful for cluster classi cation of high dimensional data. In other words, that high dimensional data can be projected down into a low dimensional space without perturbing the cluster structure. In general this is not true. PCA only separates clusters when the variance between clusters is higher than the variance within clusters. 71 CHAPTER 3. Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT 2 ECA PCA 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 -4 -3 -2 -1 0 1 2 3 4 Figure 3.4: The Parzen density estimates of YPCA and YECA . Ironically, it is the minimum entropy projection that should separate clusters well. Let us assume that each cluster is generated from a prototypical point that has been perturbed by random noise. If there is very little noise, the sample points associated with a cluster prototype will be clustered together tightly. The resulting density is sharply peaked around the cluster prototypes and has low entropy. Additional noise acts to spread out each cluster, adding entropy to the density. Most of the entropy in this density arises from the noise, not the clusters. An entropy maximizing algorithm will nd a projection vector that maximizes the projection of the noise. On the other hand, an entropy minimizing algorithm should, if possible, nd a projection that is perpendicular to the noise. ECA can be used both to nd the entropy maximizing ECA-MAX and minimizing ECA-MIN axes. Figure 3.5 shows a distribution where the noise, or spread, of the clusters is perpendicular to the axis that separates the clusters. As a result, the PCA axis does not separate these clusters. The ECA axis shown is the minimum entropy axis which is obtained by running the EMMA algorithm with a negative learning rate. The ECA-MIN axis separates the clusters much better than the PCA axis see Figure 3.6. To provide further intuition regarding the behavior of ECA we have run ECA, PCA, and two related procedures BCM and BINGO on the same density. BCM is a learning rule that was originally proposed to explain development of receptive elds patterns in visual cortex Bienenstock et al., 1982. More recently it has been argued that the rule nds projections that are far from Gaussian Intrator and Cooper, 1992. Under a limited set of conditions 72 3.5. CONCLUSION AI-TR 1548 4 ECA Min PCA 3 2 1 0 -1 -2 -3 -4 -4 -3 -2 -1 0 1 2 3 4 Figure 3.5: A scatter plot of a 400 point sample from a two dimensional density. The density is a mixture of two horizontally stretched Gaussians. The PCA and ECA minimum entropy axes are also plotted as vectors from the origin. BCM nds the minimum entropy projection. BINGO was proposed to nd axes along which there is a bimodal distribution Schraudolph and Sejnowski, 1993. Figure 3.7 displays a 400 point sample and the ve di erent projection axes found by the algorithms discussed above discussed above. The density is a mixture of two clusters. Each cluster has high kurtosis in the horizontal direction. The oblique axis projects the data so that it is most uniform and hence has the highest entropy; ECA-MAX nds this axis. Along the vertical axis the data is clustered and has low entropy; ECA-MIN nds this axis. Interestingly because the vertical axis has high variance, PCA nds the entropy minimizing axis. BCM, while it may nd minimum entropy projections for some densities, is attracted to the kurtosis along the horizontal axis. The horizontal axis neither minimizes nor maximizes entropy. Finally, BINGO successfully discovers that the vertical axis is very bimodal....
View Full Document

This note was uploaded on 02/10/2010 for the course TBE 2300 taught by Professor Cudeback during the Spring '10 term at Webber.

Ask a homework question - tutors are online