This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Acknowledgement: Material derived from slides for the book
Machine Learning, Tom M. Mitchell, McGrawHill, 1997
http://www2.cs.cmu.edu/~tom/mlbook.html 11s1: COMP9417 Machine Learning and Data Mining and the book Data Mining, Ian H. Witten and Eibe Frank,
Morgan Kaufmann, 2000. http://www.cs.waikato.ac.nz/ml/weka Evaluating Hypotheses
May 3, 2011 Aims [Recommended reading: Mitchell, Chapter 5]
[Recommended exercises: 5.2 – 5.4] This lecture will enable you to apply statistical and graphical methods
to the evaluation of hypotheses in machine learning. Following it you
should be able to: Relevant WEKA programs:
weka.gui.experiment.Experimenter • describe the problem of estimating hypothesis accuracy (error)
• deﬁne sample error and true error
• derive conﬁdence intervals for observed hypothesis error
• understand learning algorithm comparisons using paired ttests
• deﬁne and apply common evaluation measures
• generate lift charts and ROC curves COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 1 COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 2 Evaluation in machine learning Estimating Hypothesis Accuracy Machine learning is a highly empirical science . . . • how well does a hypothesis generalize beyond the training set ?
– need to estimate oﬀtrainingset error In theory, there is no diﬀerence between theory and practice.
But, in practice, there is. • what is the probable error in this estimate ?
• if one hypothesis is more accurate than another on a data set, how
probable is this diﬀerence in general ? COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 3 COMP9417: May 3, 2011 Two Deﬁnitions of Error Slide 4 Estimators The sample error of h with respect to target function f and data sample
S is the proportion of examples h misclassiﬁes
errorS (h) ≡ Evaluating Hypotheses: 1
δ (f (x) = h(x))
n Experiment:
1. choose sample S of size n according to distribution D x∈S 2. measure errorS (h) Where δ (f (x) = h(x)) is 1 if f (x) = h(x), and 0 otherwise (cf. 0 − 1
loss).
The true error of hypothesis h with respect to target function f and
distribution D is the probability that h will misclassify an instance drawn
at random according to D.
errorD (h) ≡ Pr [f (x) = h(x)] errorS (h) is a random variable (i.e., result of an experiment)
errorS (h) is an unbiased estimator for errorD (h)
Given observed errorS (h) what can we conclude about errorD (h)? x∈D Question: How well does errorS (h) estimate errorD (h)?
COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 5 COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 6 Problems Estimating Error Problems Estimating Error Note: Estimation bias not to be confused with Inductive bias – former
is a numerical quantity [comes from statistics], latter is a set of assertions
[comes from concept learning]. 1. Bias: If S is training set, errorS (h) is optimistically biased
bias ≡ E [errorS (h)] − errorD (h) More on this in the lecture on ensemble methods. For unbiased estimate, h and S must be chosen independently
2. Variance: Even with selection of S to give unbiased estimate, errorS (h)
may still vary from errorD (h) COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 7 COMP9417: May 3, 2011 Example Slide 8 Conﬁdence Intervals Hypothesis h misclassiﬁes 12 of the 40 examples in S
errorS (h) = Evaluating Hypotheses: If 12
= .30
40 • S contains n examples, drawn independently of h and each other
• n ≥ 30 What is errorD (h)? Then
• With approximately 95% probability, errorD (h) lies in interval
errorS (h) ± 1.96 COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 9 COMP9417: May 3, 2011 errorS (h)(1 − errorS (h))
n Evaluating Hypotheses: Slide 10 Conﬁdence Intervals Conﬁdence Intervals Where do the zN values come from ? Statistical tables, e.g. If
• S contains n examples, drawn independently of h and each other
• n ≥ 30 N %:
zN : 50%
0.67 68%
1.00 80%
1.28 90%
1.64 95%
1.96 98%
2.33 99%
2.58 Then
• With approximately N% probability, errorD (h) lies in interval
errorS (h) ± zN errorS (h)(1 − errorS (h))
n COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 11 COMP9417: May 3, 2011 Conﬁdence Intervals Evaluating Hypotheses: Slide 12 Conﬁdence Intervals Example: Example (continued): Hypothesis h misclassiﬁes 12 of the 40 examples in S . . ., but for repeated samples of 40 examples, expect some variation in
the sample error. With approximately 95% probability, errorD (h) lies in
interval
errorS (h)(1 − errorS (h))
errorS (h) ± 1.96
n
.30 × .70
= .30 ± 1.96
40
= .30 ± 1.96 × .072 errorS (h) = 12
= .30
40 What is errorD (h)?
Given no other information, our best estimate is .30
... COMP9417: May 3, 2011 = .30 ± .14 Evaluating Hypotheses: Slide 13 COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 14 Binomial Probability Distribution errorS (h) is a Random Variable
Rerun the experiment with diﬀerent randomly drawn S (of size n) Binomial distribution for n = 40, p = 0.3 0.14 Probability of observing r misclassiﬁed examples: 0.12
n!
P (r ) =
errorD (h)r (1 − errorD (h))n−r
r!(n − r)! P(r) 0.1
0.08
0.06
0.04
0.02
0 COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 15 0 5 15 20 25 COMP9417: May 3, 2011 30 35 Evaluating Hypotheses: Binomial Probability Distribution Slide 16 Binomial Probability Distribution Probability P (r) of r heads in n coin ﬂips, if p = Pr(heads)
P (r ) = 10 • Expected, or mean value of X , E [X ], is n!
pr (1 − p)n−r
r!(n − r)! E [X ] ≡ n
iP (i) = np i=0 • Variance of X is
V ar(X ) ≡ E [(X − E [X ])2] = np(1 − p)
• Standard deviation of X , σX , is
σX ≡
COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 17 COMP9417: May 3, 2011
E [(X − E [X ])2] = np(1 − p)
Evaluating Hypotheses: Slide 18 40 Examples Examples Suppose you test a hypothesis h and ﬁnd that it commits r = 12 errors
on a sample S of n = 40 randomly drawn test examples. An unbiased
etimate for errorD (h) is given by errorS (h) = r/n = 0.3.
The variance in this estimate arises from r alone (n is a constant). Suppose you test a hypothesis h and ﬁnd that it commits r = 300 errors
on a sample S of n = 1000 randomly drawn test examples. What is the
standard deviation in errorS (h) ?
The standard deviation for r is estimated to be 1000 × 0.3(1 − 0.3) ≈
14.5. From the Binomial distribution, this variance is np(1 − p). We can substitute r/n as an estimate for p. Then the variance for r is
estimated to be 40 × 0.3(1 − 0.3) = 8.4 and the standard deviation is
√
8.4 ≈ 2.9. Therefore the standard deviation in errorS (h) = r/n is approximately
2.9/40 = 0.07. Therefore the standard deviation in errorS (h) = r/n is approximately
14.5/1000 = .0145.
errorS (h) is observed to be 0.30 with standard deviation of approximately
.0145. errorS (h) is observed to be 0.30 with standard deviation of approximately
0.07. COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 19 COMP9417: May 3, 2011 Normal Distribution Approximates Binomial Evaluating Hypotheses: Slide 20 Normal Distribution Approximates Binomial errorS (h) follows a Binomial distribution, with Approximate this by a Normal distribution with • mean µerrorS (h) = errorD (h) • mean µerrorS (h) = errorD (h)
• standard deviation σerrorS (h) • standard deviation σerrorS (h)
σerrorS (h) = COMP9417: May 3, 2011 errorD (h)(1 − errorD (h))
n Evaluating Hypotheses: σerrorS (h) ≈ Slide 21 COMP9417: May 3, 2011 errorS (h)(1 − errorS (h))
n Evaluating Hypotheses: Slide 22 Normal Probability Distribution Normal Probability Distribution The probability that X will fall into the interval (a, b) is given by Normal distribution with mean 0, standard deviation 1
0.4 0.35
0.3
0.25 p(x)dx
a • Expected, or mean value of X , E [X ], is 0.2 E [X ] = µ 0.15
0.1 • Variance of X is 0.05
0 b 3 2 1 0 p( x ) = √ 1
2πσ 2 e 1 2 3 V ar(X ) = σ 2 • Standard deviation of X , σX , is − 1 ( x − µ )2
σ
2 COMP9417: May 3, 2011 σX = σ
Evaluating Hypotheses: Slide 23 COMP9417: May 3, 2011 Evaluating Hypotheses: Normal Probability Distribution Slide 24 Normal Probability Distribution 80% of area (probability) lies in µ ± 1.28σ
N% of area (probability) lies in µ ± zN σ 0.4 N %:
zN : 0.35
0.3 68%
1.00 80%
1.28 90%
1.64 95%
1.96 98%
2.33 99%
2.58 Note: with 80% conﬁdence the value of the random variable will lie in the
twosided interval [−1.28, 1.28]. 0.25
0.2 With 10% conﬁdence it will lie to the right of this interval (resp. left). 0.15 With 90% conﬁdence it will lie in the onesided interval [−∞, 1.28]
Let α be the probability that the value lies outside the interval. 0.1
0.05
0 50%
0.67 3
COMP9417: May 3, 2011 2 1 0 1 2 Evaluating Hypotheses: 3
Slide 25 Then a 100(1 − α)% twosided conﬁdence interval with lowerbound L
and upperbound U can be converted into a 100(1 − (α/2))% onesided
conﬁdence interval with lower bound L and no upper bound (resp. upper
bound U and no lower bound).
COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 26 Conﬁdence Intervals, More Correctly Conﬁdence Intervals, More Correctly equivalently, errorD (h) lies in interval If
• S contains n examples, drawn independently of h and each other errorS (h) ± 1.96 • n ≥ 30 errorD (h)(1 − errorD (h))
n which is approximately Then
• With approximately 95% probability, errorS (h) lies in interval
errorD (h) ± 1.96 errorS (h) ± 1.96 errorS (h)(1 − errorS (h))
n errorD (h)(1 − errorD (h))
n COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 27 COMP9417: May 3, 2011 Central Limit Theorem Evaluating Hypotheses: Slide 28 Calculating Conﬁdence Intervals Consider a set of independent, identically distributed random variables
Y1 . . . Yn, all governed by an arbitrary probability distribution with mean
µ and ﬁnite variance σ 2. Deﬁne the sample mean, 1. Pick parameter p to estimate
• errorD (h)
2. Choose an estimator n 1
¯
Y≡
Yi
n i=1 • errorS (h)
3. Determine probability distribution that governs estimator ¯
Central Limit Theorem. As n → ∞, the distribution governing Y
σ2
approaches a Normal distribution, with mean µ and variance n . • errorS (h) governed by Binomial distribution, approximated by
Normal when n ≥ 30 the sum of a large number of independent, identically distributed (i.i.d)
random variables follows a distribution that is approximately Normal. 4. Find interval (L, U ) such that N% of probability mass falls in the
interval
• Use table of zN values COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 29 COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 30 Diﬀerence Between Hypotheses Diﬀerence Between Hypotheses Two classiﬁers h1, h2. Test h1 on sample S1, test h2 on S2. 3. Determine probability distribution that governs estimator
errorS1 (h1)(1 − errorS1 (h1)) errorS2 (h2)(1 − errorS2 (h2))
σd ≈
+
ˆ
n1
n2 Apply the fourstep procedure:
1. Pick parameter to estimate 4. Find interval (L, U ) such that N% of probability mass falls in the
interval d ≡ errorD (h1) − errorD (h2)
2. Choose an estimator ˆ
d ± zN ˆ
d ≡ errorS1 (h1) − errorS2 (h2) COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 31 errorS1 (h1)(1 − errorS1 (h1)) errorS2 (h2)(1 − errorS2 (h2))
+
n1
n2 COMP9417: May 3, 2011 Paired t test to compare hA,hB Evaluating Hypotheses: Slide 32 Paired t test to compare hA,hB 1. Partition data into k disjoint test sets T1, T2, . . . , Tk of equal size,
where this size is at least 30.
2. For i from 1 to k , do
δi ← errorTi (hA) − errorTi (hB ) N % conﬁdence interval estimate for d (diﬀerence between the true errors
of the hypotheses):
¯
δ ± tN,k−1 sδ
¯
k
1
¯
sδ ≡
( δi − δ ) 2
¯
k (k − 1) i=1
where sδ is the estimated standard deviation.
¯
Note δi approximately Normally distributed ¯
3. Return the value δ , where
k
¯1
δ≡
δi
k i=1
sample mean of the diﬀerence in error between the 2 learning methods.
COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 33 COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 34 Comparing learning algorithms LA and LB Comparing learning algorithms LA and LB But, given limited data D0, what is a good estimator? What we’d like to estimate: • could partition D0 into training set S and training set T0, and measure ES ⊂D [errorD (LA(S )) − errorD (LB (S ))] errorT0 (LA(S0)) − errorT0 (LB (S0)) where L(S ) is the hypothesis output by learner L using training set S
i.e., the expected diﬀerence in true error between hypotheses output
by learners LA and LB , when trained using randomly selected training
sets S drawn according to distribution D. COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 35 • even better, repeat this many times and average the results (next slide) COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 36 Comparing learning algorithms LA and LB Comparing learning algorithms LA and LB 1. Partition data D0 into k disjoint test sets T1, T2, . . . , Tk of equal size,
where this size is at least 30. ¯
Notice we’d like to use the paired t test on δ to obtain a conﬁdence
interval 2. For i from 1 to k , do but not really correct, because the training sets in this algorithm are not
independent (they overlap!) use Ti for the test set, and the remaining data for training set Si
• Si ← {D0 − Ti}
• hA ← L A ( Si )
• hB ← L B ( Si )
• δi ← errorTi (hA) − errorTi (hB ) more correct to view algorithm as producing an estimate of
ES ⊂D0 [errorD (LA(S )) − errorD (LB (S ))] ¯
3. Return the value δ , where instead of
ES ⊂D [errorD (LA(S )) − errorD (LB (S ))] k
¯1
δ≡
δi
k i=1
COMP9417: May 3, 2011 but even this approximation is better than no comparison Evaluating Hypotheses: Slide 37 COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 38 Parameter tuning Making the most of the data • It is important that the test data is not used in any way to create the
classiﬁer • Once evaluation is complete, all the data can be used to build the ﬁnal
classiﬁer • Some learning schemes operate in two stages: • Generally, the larger the training data the better the classiﬁer (but
returns diminish) – Stage 1: builds the basic structure
– Stage 2: optimizes parameter settings • The larger the test data the more accurate the error estimate • The test data can’t be used for parameter tuning!
• Proper procedure uses three sets: training data, validation data, and
test data • Holdout procedure: method of splitting original data into training and
test set
– Dilemma: ideally we want both, a large training and a large test set • Validation data is used to optimize parameters COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 39 COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 40 Loss functions Quadratic loss function • Most common performance measure: predictive accuracy (cf. sample
error) • p1, . . . , pk are probability estimates of all possible outcomes for an
instance • Also called 0 − 1 loss function: • c is the index of the instance’s actual class 0 if prediction is correct
1 if prediction is incorrect • i.e. a1, . . . , ak are zero, except for ac which is 1
• the quadratic loss is: i • Classiﬁers can produce class probabilities E • What is the accuracy of the probability estimates ?
• 01 loss is not appropriate COMP9417: May 3, 2011
j ( pj − a j ) 2 =
j =c p2 + (1 − pc)2
j • leads to preference for predictors giving best guess at true probabilities
Evaluating Hypotheses: Slide 41 COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 42 Informational loss function Which loss function ? • the informational loss function is − log(pc), where c is the index of the
actual class of an instance
• number of bits required to communicate the actual class • quadratic loss functions takes into account all the class probability
estimates for an instance
• informational loss focuses only on the probability estimate for the actual
class
• quadratic loss is bounded by 1 + j p2, can never exceed 2
j • if p∗, . . . , p∗ are the true class probabilities
1
k • then the expected value of the informational loss function is:
−p∗ log2(p1) − . . . − p∗ log2(pk )
1
k • informational loss can be inﬁnite • informational loss related to MDL principle (can use bits for complexity
as well as accuracy) • which is minimized for pj = p∗
j
• giving the entropy of the true distribution
−p∗ log2(p∗) − . . . − p∗ log2(p∗ )
1
1
k
k
COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 43 COMP9417: May 3, 2011 Costs of predictions Slide 44 Confusion matrix • In practice, diﬀerent types of classiﬁcation errors often incur diﬀerent
costs Twoclass prediction case:
Predicted Class
Actual Class No True Positive (TP) False Negative (FN) No Medical diagnosis (has cancer vs. not)
Loan decisions
Fault diagnosis
Promotional mailing Yes Yes • Examples:
–
–
–
– Evaluating Hypotheses: False Positive (FP) True Negative (TN) Two kinds of error:
False Positive and False Negative may have diﬀerent costs.
Two kinds of correct prediction:
True Positive and True Negative may have diﬀerent “beneﬁts”.
Note: total number of test set examples N = T P + F N + F P + T N COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 45 COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 46 Common evaluation measures Common evaluation measures Precision Accuracy TP
TP + FP TP + TN
N
Error rate (also called: Correctness, Positive Predictive Value) equivalent to 1  Accuracy, i.e., Recall
TP
TP + FN FP + FN
N (also called: T P rate, Hit rate, Sensitivity, Completeness) COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 47 COMP9417: May 3, 2011 Common evaluation measures Evaluating Hypotheses: Slide 48 Evaluating Hypotheses: Slide 50 Common evaluation measures Sensitivity True Positive (T P ) Rate
TP
TP + FN TP
TP + FN Speciﬁcity False Positive (F P ) Rate equivalent to 1  F P rate equivalent to 1  Speciﬁcity, i.e.,
TN
TN + FP FP
FP + TN (also called: T N rate) COMP9417: May 3, 2011 (also called: False alarm rate) Evaluating Hypotheses: Slide 49 COMP9417: May 3, 2011 Common evaluation measures Common evaluation measures Predicted Class Negative Predictive Value Actual Class
Yes Coverage TP FN No TN
TN + FN Yes No FP TN E.g., in concept learning, the number of instances in a sample predicted
to be in (resp. not in) the concept is the sum of the ﬁrst (resp. second)
column. TP + FP
N The number of positive (resp. negative) examples of the concept in a
sample is the sum of the ﬁrst (resp. second) row. Note:
• this is not an exhaustive list . . . Npred = T P + F P
Npos = T P + F N • same measures used under diﬀerent names in diﬀerent disciplines
COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 51 Nnot pred = F N + T N
Nneg
= FP + TN COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 52 Tradeoﬀ Common evaluation measures We can treat the evaluation measures as conditional probabilities:
TP
T P +F N
FP
F P +T N
FN
T P +F N
TN
F P +T N (Speciﬁcity) P (pos  pred) TP
T P +F P
FP
T P +F P (Pos. Pred. Value) FN
F N +T N
TN
F N +T N good coverage of positive examples: increase TP at risk of increasing FP
i.e. increase generality (Sensitivity) P (not pred  neg) = Tradeoﬀ (FN rate) P (pred  pos) P (pred  neg) =
= P (not pred  pos) = P (neg  pred) =
= P (pos  not pred) =
P (neg  not pred) = COMP9417: May 3, 2011 (FP rate) good proportion of positive examples: decrease FP at risk of decreasing
TP i.e. decrease generality, i.e. increase speciﬁcity (FN rate) Diﬀerent techniques give diﬀerent tradeoﬀs and can be plotted as two
diﬀerent lines on any of the graphical charts: Lift, ROC or recallprecision
curves. (Neg. Pred. Value) Evaluating Hypotheses: Slide 53 COMP9417: May 3, 2011
Evaluating Hypotheses: Slide 54 Lift charts Lift charts • In practice, costs are rarely known precisely • Lift = • Instead decisions often made by comparing possible scenarios
• Lift comes from market research, where a typical goal is to identify a
“proﬁtable” target subgroup out of the total population
• Example: promotional mailout to population of 1,000,000 potential
respondents
– Baseline is that 0.1% of all households in total population will
respond (1000)
– Situation 1: classiﬁer 1 identiﬁes target subgroup of 100,000 most
promising households of which 0.4% will respond (400)
– Situation 2: classiﬁer 2 identiﬁes target subgroup of 400,000 most
promising households of which 0.2% will respond (800)
COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 55 response rate of target subgroup
response rate of total population • Situation 1 gives lift of 0 .4
0 .1 =4 • Situation 2 gives lift of 0 .2
0 .1 =2 • Note that which situation is more proﬁtable depends on cost estimates
• A lift chart allows for a visual comparison
Use lift to see how well a classiﬁer is doing compared to baselevel (e.g.,
guessing most common class as in ZeroR ). COMP9417: May 3, 2011 Evaluating Hypotheses: Hypothetical Lift Chart Slide 56 Generating a lift chart
Instances are sorted according to their predicted probability of being a
true positive:
Rank
1
2
3
4
... Predicted probability
0.95
0.93
0.93
0.88
... Actual class
Yes
Yes
No
Yes
... In lift chart, x axis is sample size and y axis is number of true positives COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 57 COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 58 ROC curves A sample ROC curve • ROC curves are similar to lift charts
– ROC stands for receiver operating characteristic
– Used in signal detection to show tradeoﬀ between hit rate and false
alarm rate over noisy channel
• Diﬀerences to lift chart:
– y axis shows percentage of true positives in sample (rather than
absolute number)
– x axis shows percentage of false positives in sample (rather than
sample size) COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 59 Fmeasure combines precision and recall Evaluating Hypotheses: Slide 60 Numeric prediction evaluation measures
Based on diﬀerences between predicted (pi) and actual (ai) values on a
test set of n examples: “Information Retrieval” K. van Rijsbergen (1979) F =2× COMP9417: May 3, 2011 precision × recall
precision + recall Mean squared error
( p 1 − a 1 ) 2 + . . . + ( pn − a n ) 2
n Combines both precision and recall in a single measure giving equal weight
to both (there are variants to weight each component diﬀerently). Root mean squared error
COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 61 COMP9417: May 3, 2011 ( p1 − a 1 ) 2 + . . . + ( pn − a n ) 2
n Evaluating Hypotheses: Slide 62 Summary Numeric prediction evaluation measures Mean absolute error
• Evaluation for machine learning and data mining is a complex issue
 p 1 − a 1  + . . . +  pn − a n 
n • Some commonly used methods have been found to work well in practice
– 10 × 10fold crossvalidation
– corrected resampled ttest (Weka) Relative absolute error • Issues to be aware of  p1 − a 1  + . . . +  p n − a n 
1
, where a =
¯
ai
a1 − a + . . . +  an − a
¯
¯
ni – multiple testing: from a large set of hypotheses, some will appear
good at random
– does the oﬀtraining distribution match that of the training set ? plus others, see, e.g., Weka COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 63 Summary A cautionary tale . . .
In a WW2 survey carried out by RAF Bomber Command all bombers returning
from bombing raids over Germany over a particular period were inspected. All damage
inﬂicted by German air defences was noted and an initial recommendation was given
that armour be added in the most heavily damaged areas. Further analysis instead
made the surprising and counterintuitive recommendation that the armour be placed
in the areas which were completely untouched by damage. The reasoning was that
the survey was biased, since it only included aircraft that successfully came back from
Germany. The untouched areas were probably vital areas, which, if hit, would result
in the loss of the aircraft. COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 65 COMP9417: May 3, 2011 Evaluating Hypotheses: Slide 64 ...
View
Full
Document
 Three '11
 some
 Data Mining, Machine Learning

Click to edit the document details