Unformatted text preview: volve non-standard data types, such as text. The only requirement
for being able to include a data type is the existence of an appropriate metric. 18 © 1999 Two Crows Corporation Logistic regression
Logistic regression is a generalization of linear regression. It is used primarily for predicting binary
variables (with values such as yes/no or 0/1) and occasionally multi-class variables. Because the
response variable is discrete, it cannot be modeled directly by linear regression. Therefore, rather than
predict whether the event itself (the response variable) will occur, we build the model to predict the
logarithm of the odds of its occurrence. This logarithm is called the log odds or the logit
The odds ratio:
probability of an event occurring
probability of the event not occurring
has the same interpretation as in the more casual use of odds in games of chance or sporting events.
When we say that the odds are 3 to 1 that a particular team will win a soccer game, we mean that the
probability of their winning is three times as great as the probability of their losing. So we believe
they have a 75% chance of winning and a 25% chance of losing. Similar terminology can be applied
to the chances of a particular type of customer (e.g., a customer with a given gender, income, marital
status, etc.) replying to a mailing. If we say the odds are 3 to 1 that the customer will respond, we
mean that the probability of that type of customer responding is three times as great as the probability
of him or her not responding.
Having predicted the log odds, you then take the anti-log of this number to find the odds. Odds of
62% would mean that the case is assigned to the class designated “1” or “yes,” for example.
While logistic regression is a very powerful modeling tool, it assumes that the response variable (the
log odds, not the event itself) is linear in the coefficients of the predictor variables. Furthermore, the
modeler, based on his or her experience with the data and data analysis, must choose the right inputs
and specify their functional relationship to the response variable. So, for example, the modeler must
choose among income or (income)2 or log (income) as a predictor variable. Additionally the modeler
must explicitly add terms for any interactions. It is up to the model builder to search for the right
variables, find their correct expression, and account for their possible interactions. Doing this
effectively requires a great deal of skill and experience on the part of the analyst.
Neural nets, on the other hand, use their hidden layers to estimate the forms of the non-linear terms
and interaction in a semi-automated way. Users need a different set of analytic skills to apply neural
nets successfully. For example, the choice of an activation function will affect the speed with which a
neural net trains.
Discriminant analysis is the oldest mathematical classification technique, having been first published
by R. A. Fisher in 1936 to classify the famous Iris botanical data into three species. It finds hyperplanes (e.g., lines in two dimensions, planes in three, etc.) that separate the classes. The resultant
model is very easy to interpret because all the user has to do is determine on which side of the line (or
hyper-plane) a point falls. Training is simple and scalable. The technique is very sensitive to patterns
in the data. It is used very often in certain disciplines such as medicine, the social sciences, and field
biology. © 1999 Two Crows Corporation 19 Discriminant analysis is not popular in data mining, however, for three main reasons. First, it assumes
that all of the predictor variables are normally distributed (i.e., their histograms look like bell-shaped
curves), which may not be the case. Second, unordered categorical predictor variables (e.g., red/blue/
green) cannot be used at all. Third, the boundaries that separate the classes are all linear forms (such
as lines or planes), but sometimes the data just can’t be separated that way.
Recent versions of discriminant analysis address some of these problems by allowing the boundaries
to be quadratic as well as linear, which significantly increases the sensitivity in certain cases. There
are also techniques that allow the normality assumption to be replaced with an estimate of the real
distribution (i.e., replace the theoretical bell-shaped curve with the histograms of the predictor
variables). Ordered categorical data can be modeled by forming the histogram from the bins defined
by the categorical variables.
Generalized Additive Models (GAM)
There is a class of models extending both linear and logistic regression, known as generalized
additive models or GAM. They are called additive because we assume that the model can be written
as the sum of possibly non-linear functions, one for each predictor. GAM can be used either for
regression or for classification of a binary response. The response variable can be virtually any
View Full Document
- Winter '08
- Data Mining, .........