Classification accuracy
The figure shows the value of
the discriminate function
p(Y 1  X)
f (x) log
p(Y 0  X)
across the test examples
The only test error is also the
decision with the lowest
confidence
FDA Approves GeneBased
Breast Cancer Test*
Mam
Classifying cancer types
We select a subset of the
genes (more in our feature
selection class later in the
course).
We compute the mean for
each of the genes in each of
the classes
= 1.8
= 0.6
2=
2=
1.1
0.4
Classifying cancer types
We compute the me
Important points
Problems with estimating full joints
Advantages of Nave Bayes assumptions
Applications to discrete and continuous cases
Problems with Nave Bayes classifiers
Possible problems with Nave
Bayes classifiers: Assumptions
In most cases, the assumption of conditional independence given
the class label is violated
 much more likely to find the word George if we saw the word
Bush regardless of the class
This is, u
MLE for Gaussian Nave Bayes
Classifier
For each class we need to estimate one global value (prior) and
two values for each feature (mean and variance)
The prior is computed in the same way we did before (counting)
which is the MLE estimate For each featur
Gaussian Bayes Classification
To determine the class when using the
Gaussian assumption we need to compute
p(xy):
P(x  y)
1
1/ 2
(2 )

exp (X
1/ 2

P(y
)T
v  x)
1
(X
Once again, we need lots of data to
compute the values of the covariance
matrix
p(x
Example
Dictionary
Washington
Congress
54. Mccain
55. Obama
56. Nader
j
54= 54(x )
j
55= 55(x )
j
56= 56(x )
=1
=1
=0
Assume we would like to classify documents
as election related or not.
Example: cont.
We would like to classify documents
as election rel
Nave Bayes Classifier
Nave Bayes classifiers assume that given the class label (Y) the
attributes are conditionally independent of each other:
p(x  y)
pi (x i  y)
i
Specific model for
atribute i
Product of probability
terms
Using this idea the full cl
Data likelihood
The global likelihood of the data can be expressed as:
L(X,Y) L(X Y)L(Y)
Since the two parts of this product do not share parameters we can
maximize them separately.
For binary attributes, assume we observe n0 instances of xi=0 and n1
Feature transformation
How do we encode the set of features (words) in the document?
What type of information do we wish to represent? What can we
ignore?
Most common encoding: Bag of Words
Treat document as a collection of words and encode each docum
Bayes decision rule
If we know the conditional probability P(X  Y) we
can determine the appropriate class by using
Bayes rule:
P(y i  x)
P(x  y i)P(y i) def
qi (x)
P(x)
But how do we determine
p(XY)?
Computing p(xy)
Consider a
dataset with 16
attri
Organic Chemistry 334
Syllabus
Fall 2014
Class Meeting Information:
Lecture Days: Tuesday and Thursday
Lecture Times and Location: 8:00  9:50 AM in Hoffmann Hall 109
Tentative Course Outline: We are going to cover chapters 110
Instructor Information:
Dr