CS 6375 Machine Learning Fall 2010 Assignment 1: Decision Tree Induction Part I: Due by Thursday, September 9, 11:59 p.m. Part II: Due by Tuesday, September 21, 11:59 p.m.
For Part I (the written problems), you may either slip a written (hard-copy) soluti
Homework-3
Question 1
The following neural network has 3 layers with nodes that compute the sigmoid function. There are no bias
connections.
6
O
V3
6w4
V2
6w3
V1
I
@
w1
@ w2
@
x1
x2
A.1 Give explicit expressions to the values of all nodes in forward propa
Bayesian Classifiers
Bayesian theory can be used to define and identify the optimal classifier. Consider the hypothesis
h and the training data D to be results of experiments where h is drawn from a set of hypotheses
and the data in D is classified accord
Homework-2
Question 1
Consider a perceptron that computes its output O according to:
O
h
=
=
g(h)
Pn
where
g(h) =
i=1
wi xi
1
.
1 + |h|
Show that the DELTA rule for updating the weights is given by:
for i = 1, . . . , n,
wi wi + xi
(
if h 0
if h < 0
where
The Nearest-Neighbor and the k-Nearest-Neighbor algorithms
In their simplest form, these algorithms do not perform any computation during training. The
computation is performed only when a test example is presented. Therefore, they are described
with inpu
Naive-Bayes-Example-Solutions
Question 1
Part I
Consider the following data set with three Boolean predictive attributes, W , X, Y , and Boolean classification
C.
W
T
T
T
F
F
X
T
F
F
T
F
Y
T
T
F
T
F
C
T
F
F
F
T
We now encounter a new example: W = F , X =
Momentum-Example
Question 1
6
AK
A
A
A
w0
A w1
A
x
1
1
The above neural net is implemented with a linear unit, and the loading algorithm is the ADALINE. Take
the initial values of the weights as w0 = 3, w1 = 0, and the learning rate as = 0.1. The network
Linear discriminants
The input for the learning task (the training data) is the pairs: (xi , yi ), where xi = (xi (1), . . . , xi (n)
is a feature vector, and yi is the desired prediction. Consider a simple linear function that attempts
to predict y from
k-fold Cross Validation
k-fold cross validation is a common technique for estimating the performance of a classifier. Given
a set of m traning examples, a single run of k-fold cross validation proceeds as follows:
1. Arrange the training examples in a ran
ID3-Example-Solutions
Question 1
You are given the following training data:
Height
short
tall
tall
tall
short
tall
tall
short
1
2
3
4
5
6
7
8
Hair
blond
blond
red
dark
dark
dark
blond
blond
Eyes
blue
brown
blue
brown
blue
blue
blue
brown
Sensitivity
yes
n
Entropy notes
We are considering instances that are identified by feature values. For example, the feature can
be color, and the feature values can be Red, Green, Blue. Probabilities are associated with the
likelihood that an instance has a particular fea
Homework-5
Question 1
Apply the linear discriminant algorithm to the following training data:
x
0 -1
00
01
02
03
10
20
21
y
P
N
P
N
N
P
P
P
1. Compute the matrix X and the vector y.
X=
y=
2. Compute the matrix B = X T X and the vector h = X T y.
B=
h=
3.
On probabilities and conditional probabilities
Intuitively, a probability is a measure of likelihood of a Boolean outcome of an experiment. If the
outcome of the experiment is the Boolean (random) variable x then we can discuss Prob(x). If x is
real, we c
Back-Propagation-Example
Question 1
O1 6
O2 6
O3 6
'$
'$
'$
V1
V2
V3
&%
&%
&%
BMB
6
B
w5
w3 B w4
B
B
'$
B
V
&%
7
]
J
J
w1
J w2
J
'$
'$
x1
&%
x2
&%
The above neural network has two layers (one hidden layer), two inputs, and three outputs. (There are NO
b
Error Back Propagation
The algorithm is described with respect to a single node in the network. The network is initialized
by assigning small random values to each weight. It is trained by repeated epoches.
m+1
m+1
_XVY
^
Z
]\[
_XVY
^
Z
]\[
.
1
bEE
<y q
E
The momentum enhancement to Back Propagation
The weights update rule in Back-Propagation is according to Steepest-Descent. The weight w is
updated by:
E
new w = old w
w
where E is the error minimized by the algorithm. If is chosen too large, the algorith
Notes on Deep Learning
Our (very restricted) view of deep learning: a neural net with many hidden layers. The motivation:
the hidden layers are viewed as levels of abstraction. while the low levels handle the raw data, the
higher levels can represent abst
More on decision trees
Windowing in ID3
Windowing is applied in ID3 as a way of dealing with large sets of training instances. Without
windowing, such an algorithm can be really slow, as it needs to do entropy calculations over huge
amounts of data. With
A.1 Give explicit expressions to the values of all nodes in forward propagation when the network is given
the input x1 = 3; x2 = 9, with the desired output y1 = 1, y2 = 0,y3 = 1. Your answer should be in terms of
the old weights w1 , w2 , w3 , w4 , w5 . Y
Gaussian Naive Bayesian
(Taken mostly from the book by Duda, Hart, and Stork.)
Gaussian density/distribution
The Gaussian density function of n-dimensional vectors is:
1
1
T 1
g(x; , C) =
e 2 (x) C (x)
1/2
n
( 2) |C|
Here is the distribution mean and C i
Representation of CNF and DNF by a neural net
Consider neural nets with thresholds (and not sigmoids) at each node. These can easily compute CNF and DNF Boolean functions. A Boolean function of n Boolean variables is a function
f (x1 , . . . xn ) that pro
Momentum-Example-Solutions
Question 1
6
AK
A
A
A
w0
A w1
A
x
1
1
ADALINE, w0 = 3, w1 = 0, = 0.1.
e1
e2
A
x1
1
0.8
y
2
1
Training with e1 we have: O = 3, = y O = 2 3 = 1.
New value of w0 = old w0 +1 = 3 + (0.1) = 2.9
New value of w1 = old w1 +x1 = 0 + (
Naive-Bayes-Example
Question 1
Part I
Consider the following data set with three Boolean predictive attributes, W , X, Y , and Boolean classification
C.
W
T
T
T
F
F
X
T
F
F
T
F
Y
T
T
F
T
F
C
T
F
F
F
T
We now encounter a new example: W = F , X = T , Y = F
SVMs with Slack
Nicholas Ruozzi
University of Texas at Dallas
Based roughly on the slides of David Sontag
Primal SVM
such that
1
min
, 2
2
+ 1, for all
Note that Slaters condition holds as long as the data is
linearly separable
2
Dual SVM
1
max
0
2
s
Lagrange Multipliers
Kernel Trick
Nicholas Ruozzi
University of Texas at Dallas
Based roughly on the slides of David Sontag
General Optimization
A mathematical detour, well come back to SVMs soon!
min 0 ()
subject to:
0,
= 0,
= 1, ,
= 1, ,
2
General
ID3-Example
Question 1
You are given the following training data:
1
2
3
4
5
6
7
8
Height
short
tall
tall
tall
short
tall
tall
short
Hair
blond
blond
red
dark
dark
dark
blond
blond
Eyes
blue
brown
blue
brown
blue
blue
blue
brown
The target attribute is Sen
The exact (optimal) steepest descent algorithm
This algorithm finds a local minimum of f (x) = f (x1 , . . . , xn ) when given f .
Start with arbitrary x.
Repeat:
r = f (x)
Compute a, where t = a minimizes (t)
(t) = f (x + tr)
x = x + ar
Terminate when
ID3-Example-Solutions
Question 1
You are given the following training data:
Height
short
tall
tall
tall
short
tall
tall
short
1
2
3
4
5
6
7
8
Hair
blond
blond
red
dark
dark
dark
blond
blond
Eyes
blue
brown
blue
brown
blue
blue
blue
brown
Sensitivity
yes
n
CS 6375 Machine Learning Fall 2010 Assignment 4: Neural Networks, MLE, and Instance-Based Learning Part I: Due by Tuesday, November 9, 11:59 p.m. Part II: Due by Monday, November 15, 11:59 p.m.
Submission instructions for the written problems: Slip a hard