Rules and trees from data: first principles
tiated by ML-leaning statisticians (see Spiegelhalter, 1986) and statistically inclined ML
theorists (see Pearl, 1988) may change this.
Although marching to a different drum, ML people have for some time been seen as a
possibly useful source of algorithms for certain data-analyses required in industry. There
are two broad circumstances that might favour applicability:
categorical rather than numerical attributes;
strong and pervasive conditional dependencies among attributes.
As an example of what is meant by a conditional dependency, let us take the classification
of vertebrates and consider two variables, namely “breeding-ground” (values: sea, fresh-
water, land) and “skin-covering” (values: scales, feathers, hair, none). As a value for the
first, “sea” votes overwhelmingly for FISH. If the second attribute has the value “none”,
then on its own this would virtually clinch the case for AMPHIBIAN. But in combination
with “breeding-ground = sea” it switches identification decisively to MAMMAL. Whales
and some other sea mammals now remain the only possibility. “Breeding-ground” and
“skin-covering” are said to exhibit strong conditional dependency. Problems characterised
by violent attribute-interactions of this kind can sometimes be important in industry. In
predicting automobile accident risks, for example, information that a driver is in the age-
group 17 – 23 acquires great significance if and only if sex = male.
To examine the “horses for courses” aspect of comparisons between ML, neural-net
and statistical algorithms, a reasonable principle might be to select datasets approximately
evenly among four main categories as shown in Figure 5.2.
all or mainly numerical
all or mainly categorical
ML expected to do well
ML expected to do well, marginally
ML expected to do poorly, marginally
Fig. 5.2: Relative performance of ML algorithms.
, collection of datasets necessarily followed opportunity rather than design,
so that for light upon these particular contrasts the reader will find much that is suggestive,
but less that is clear-cut. Attention is, however, called to the Appendices which contain
additional information for readers interested in following up particular algorithms and
datasets for themselves.
Classification learning is characterised by (i) the data-description language, (ii) the
language for expressing the classifier, –
as formulae, rules,
and (iii) the learning
Of these, (i) and (ii) correspond to the “observation language” and