Unformatted text preview: s to run on Sun Sparc5 workstation.
70 3.4. PRINCIPAL COMPONENTS ANALYSIS AND INFORMATION AI-TR 1548 4 ECA
PCA 3 2 1 0 -1 -2 -3 -4
-4 -3 -2 -1 0 1 2 3 4 Figure 3.3: A scatter plot of a 400 point sample from a two dimensional density. The density
is a mixture of two horizontally stretched Gaussians. The PCA and ECA principal axes are
also plotted as vectors from the origin.
This is comparable to the time it takes to run PCA.
In general, PCA does not nd the highest entropy projection of non-Gaussian densities.
For more complex densities the PCA axis is very di erent from the entropy maximizing
axis. Figure 3.3 shows a density for which the PCA and ECA axes are very di erent. The
PCA axis, which is vertical, spreads the points in the sample as far apart as possible. The
ECA axis, which is oblique, spreads nearby points in the sample as far apart as possible.
The resulting densities, YPCA and YECA , are graphed in Figure 3.4. The PCA density is
very tightly peaked, the ECA density is broadly spread out. Though the nal variance
of YPCA is larger, 2:005 vs. 1:626, the entropy of the YECA distribution is much higher,
hYPCA = ,0:17 and hYECA = 1:61.
Linsker has argued that the PCA axis separates the clusters of a distribution Linsker,
1988. To justify this claim, he uses gures much like Figure 3.3 and Figure 3.4. These
graphs show the PCA axis projecting points from separated clusters so that they remain
separate. It is then proposed that the PCA axis is useful for cluster classi cation of high
dimensional data. In other words, that high dimensional data can be projected down into a
low dimensional space without perturbing the cluster structure. In general this is not true.
PCA only separates clusters when the variance between clusters is higher than the variance
71 CHAPTER 3.
Paul A. Viola EMPIRICAL ENTROPY MANIPULATION AND STOCHASTIC GRADIENT DESCENT
0 -4 -3 -2 -1 0 1 2 3 4 Figure 3.4: The Parzen density estimates of YPCA and YECA .
Ironically, it is the minimum entropy projection that should separate clusters well. Let
us assume that each cluster is generated from a prototypical point that has been perturbed
by random noise. If there is very little noise, the sample points associated with a cluster
prototype will be clustered together tightly. The resulting density is sharply peaked around
the cluster prototypes and has low entropy. Additional noise acts to spread out each cluster,
adding entropy to the density. Most of the entropy in this density arises from the noise, not
the clusters. An entropy maximizing algorithm will nd a projection vector that maximizes
the projection of the noise. On the other hand, an entropy minimizing algorithm should, if
possible, nd a projection that is perpendicular to the noise. ECA can be used both to nd
the entropy maximizing ECA-MAX and minimizing ECA-MIN axes.
Figure 3.5 shows a distribution where the noise, or spread, of the clusters is perpendicular
to the axis that separates the clusters. As a result, the PCA axis does not separate these
clusters. The ECA axis shown is the minimum entropy axis which is obtained by running the
EMMA algorithm with a negative learning rate. The ECA-MIN axis separates the clusters
much better than the PCA axis see Figure 3.6.
To provide further intuition regarding the behavior of ECA we have run ECA, PCA, and
two related procedures BCM and BINGO on the same density. BCM is a learning rule that
was originally proposed to explain development of receptive elds patterns in visual cortex
Bienenstock et al., 1982. More recently it has been argued that the rule nds projections
that are far from Gaussian Intrator and Cooper, 1992. Under a limited set of conditions
72 3.5. CONCLUSION AI-TR 1548 4 ECA Min
PCA 3 2 1 0 -1 -2 -3 -4
-4 -3 -2 -1 0 1 2 3 4 Figure 3.5: A scatter plot of a 400 point sample from a two dimensional density. The density
is a mixture of two horizontally stretched Gaussians. The PCA and ECA minimum entropy
axes are also plotted as vectors from the origin.
BCM nds the minimum entropy projection. BINGO was proposed to nd axes along which
there is a bimodal distribution Schraudolph and Sejnowski, 1993.
Figure 3.7 displays a 400 point sample and the ve di erent projection axes found by
the algorithms discussed above discussed above. The density is a mixture of two clusters.
Each cluster has high kurtosis in the horizontal direction. The oblique axis projects the data
so that it is most uniform and hence has the highest entropy; ECA-MAX nds this axis.
Along the vertical axis the data is clustered and has low entropy; ECA-MIN nds this axis.
Interestingly because the vertical axis has high variance, PCA nds the entropy minimizing
axis. BCM, while it may nd minimum entropy projections for some densities, is attracted to
the kurtosis along the horizontal axis. The horizontal axis neither minimizes nor maximizes
entropy. Finally, BINGO successfully discovers that the vertical axis is very bimodal....
View Full Document
- Spring '10
- The Land, Probability distribution, Probability theory, probability density function, Mutual Information, Paul A. Viola