Sec. 10.5]
Performance related To measures: theoretical
193
10.5.2
Absolute performance: quadratic discriminants
In theory, quadratic discrimination is the best procedure to use when the data are normally
distributed, especially so if the covariances differ. Because it makes very specific distribu-
tional assumptions, and so is very efficient for normal distributions, it is inadvisable to use
quadratic discrimination for non-normal distributions(a common situation with parametric
procedures - they are not robust to departures from the assumptions), and, because it uses
many more parameters, it is also not advisable to use quadratic discrimination when the
sample sizes are small. We will now relate these facts to our measures for the datasets.
The
ideal
dataset for quadratic discrimination would be a very large, normally dis-
tributed, dataset with widely differing covariance matrices.
In terms of the measures,
ideally we want
skewness = 0,
kurtosis = 3, and SD
ratio much greater than
unity.
The
most normal
dataset in our study is the KL digits dataset, as
skewness =
0.18 (and this is small),
kurtosis = 2.92 (and this is near 3), and, most importantly,
SD
ratio = 1.97 (and this is much greater than unity).
This dataset is nearest ideal, so
it is predictable that quadratic discrimination will achieve the lowest error rate. In fact,
quadratic discriminants achieve an error rate of 2.5%, and this is only bettered by k-NN
with an error rate of 2.0% and by ALLOC80 with an error rate of 2.4%.
At the other extreme, the
least normal
dataset is probably the shuttle dataset, with
skewness = 4.4 (very large)),
kurtosis = 160.3 (nowhere near 3), and, to make
matters worse, the SD
ratio = 1.12 (and this is not much greater than unity). Therefore,
we can predict that this is the least appropriate dataset for quadratic discrimination, and it
is no surprise that quadratic discriminants achieve an error rate of 6.72%, which is worst
of all our results for the shuttle dataset. The decision tree methods get error rates smaller
than this by a factor of 100!
The important proviso should always be borne in mind that there must be enough data
to estimate all parameters accurately.
10.5.3
Relative performance: Logdisc vs. DIPOL92
Another fruitful way of looking at the behaviour of algorithms is by making paired compar-
isons between closely related algorithms. This extremely useful device is best illustrated
by comparing logistic discrimination (Logdisc) and DIPOL92. From their construction, we
can see that DIPOL92 and logistic discrimination have exactly the same formal decision
procedure in one special case, namely the case of two-class problems in which there is no
clustering (
i.e.
both classes are “pure”). Where the two differ then, will be in multi-class
problems (such as the digits or letters datasets) or in two-class problems in which the classes
are impure (such as the Belgian Power dataset).
With this in mind, it is of interest to compare the performance of DIPOL92