CS 6375 Machine Learning Fall 2010 Assignment 1: Decision Tree Induction Part I: Due by Thursday, September 9, 11:59 p.m. Part II: Due by Tuesday, September 21, 11:59 p.m.
For Part I (the written problems), you may either slip a written (hard-copy) soluti
Homework-3
Question 1
The following neural network has 3 layers with nodes that compute the sigmoid function. There are no bias
connections.
6
O
V3
6w4
V2
6w3
V1
I
@
w1
@ w2
@
x1
x2
A.1 Give explicit expressions to the values of all nodes in forward propa
ID3-Example-Solutions
Question 1
You are given the following training data:
Height
short
tall
tall
tall
short
tall
tall
short
1
2
3
4
5
6
7
8
Hair
blond
blond
red
dark
dark
dark
blond
blond
Eyes
blue
brown
blue
brown
blue
blue
blue
brown
Sensitivity
yes
n
Representation of CNF and DNF by a neural net
Consider neural nets with thresholds (and not sigmoids) at each node. These can easily compute CNF and DNF Boolean functions. A Boolean function of n Boolean variables is a function
f (x1 , . . . xn ) that pro
Momentum-Example-Solutions
Question 1
6
AK
A
A
A
w0
A w1
A
x
1
1
ADALINE, w0 = 3, w1 = 0, = 0.1.
e1
e2
A
x1
1
0.8
y
2
1
Training with e1 we have: O = 3, = y O = 2 3 = 1.
New value of w0 = old w0 +1 = 3 + (0.1) = 2.9
New value of w1 = old w1 +x1 = 0 + (
Naive-Bayes-Example
Question 1
Part I
Consider the following data set with three Boolean predictive attributes, W , X, Y , and Boolean classification
C.
W
T
T
T
F
F
X
T
F
F
T
F
Y
T
T
F
T
F
C
T
F
F
F
T
We now encounter a new example: W = F , X = T , Y = F
One layer neural-nets (perceptrons)
The input to a neural net is denoted by the variables x1 , . . . , xn . The output of a single unit
perceptron is denoted by O. It is obtained by applying a function g to a weighted sum of the input:
O = g(h)
Pn
h =
i=1
The exact (optimal) steepest descent algorithm
This algorithm finds a local minimum of f (x) = f (x1 , . . . , xn ) when given f .
Start with arbitrary x.
Repeat:
r = f (x)
Compute a, where t = a minimizes (t)
(t) = f (x + tr)
x = x + ar
Terminate when
Ensemble Methods: Bagging
Nicholas Ruozzi
University of Texas at Dallas
Based on the slides of Vibhav Gogate and David Sontag
Last Time
PAC learning
Bias/variance tradeoff
small hypothesis spaces (not enough flexibility) can have high
bias
rich hypoth
Decision Trees
Nicholas Ruozzi
University of Texas at Dallas
Based on the slides of Vibhav Gogate and David Sontag
Supervised Learning
Input: labelled training data
i.e., data plus desired output
Assumption: there exists a function that maps data
items
Ensemble Methods: Boosting
Nicholas Ruozzi
University of Texas at Dallas
Based on the slides of Vibhav Gogate and Rob Schapire
Last Time
Variance reduction via bagging
Generate new training data sets by sampling with
replacement from the empirical distr
Nearest Neighbor Methods
Nicholas Ruozzi
University of Texas at Dallas
Based on the slides of Vibhav Gogate and David Sontag
Decision Trees
2
Decision Tree Learning
Basic decision tree building algorithm:
Pick the feature/attribute with the highest info
MORE
Learning Theory
Nicholas Ruozzi
University of Texas at Dallas
Based on the slides of Vibhav Gogate and David Sontag
Last Time
Probably approximately correct (PAC)
The only reasonable expectation of a learner is that with
high probability it learns
Support Vector Machines
Nicholas Ruozzi
University of Texas at Dallas
Slides adapted from David Sontag and Vibhav Gogate
Announcements
Homework 1 is now available online
Join the Piazza discussion group
Reminder: my office hours are 11am-12pm on Tuesda
Learning Theory
Nicholas Ruozzi
University of Texas at Dallas
Based on the slides of Vibhav Gogate and David Sontag
Announcements
TA: Baoye Xue
Office hours: Monday and Wednesday 5pm-6pm in the
Clark Center CN 1.202D
Email: [email protected]
2
Lea
k-fold Cross Validation
k-fold cross validation is a common technique for estimating the performance of a classifier. Given
a set of m traning examples, a single run of k-fold cross validation proceeds as follows:
1. Arrange the training examples in a ran
Linear discriminants
The input for the learning task (the training data) is the pairs: (xi , yi ), where xi = (xi (1), . . . , xi (n)
is a feature vector, and yi is the desired prediction. Consider a simple linear function that attempts
to predict y from
Momentum-Example
Question 1
6
AK
A
A
A
w0
A w1
A
x
1
1
The above neural net is implemented with a linear unit, and the loading algorithm is the ADALINE. Take
the initial values of the weights as w0 = 3, w1 = 0, and the learning rate as = 0.1. The network
On probabilities and conditional probabilities
Intuitively, a probability is a measure of likelihood of a Boolean outcome of an experiment. If the
outcome of the experiment is the Boolean (random) variable x then we can discuss Prob(x). If x is
real, we c
Back-Propagation-Example
Question 1
O1 6
O2 6
O3 6
'$
'$
'$
V1
V2
V3
&%
&%
&%
BMB
6
B
w5
w3 B w4
B
B
'$
B
V
&%
7
]
J
J
w1
J w2
J
'$
'$
x1
&%
x2
&%
The above neural network has two layers (one hidden layer), two inputs, and three outputs. (There are NO
b
Error Back Propagation
The algorithm is described with respect to a single node in the network. The network is initialized
by assigning small random values to each weight. It is trained by repeated epoches.
m+1
m+1
_XVY
^
Z
]\[
_XVY
^
Z
]\[
.
1
bEE
<y q
E
The momentum enhancement to Back Propagation
The weights update rule in Back-Propagation is according to Steepest-Descent. The weight w is
updated by:
E
new w = old w
w
where E is the error minimized by the algorithm. If is chosen too large, the algorith
Notes on Deep Learning
Our (very restricted) view of deep learning: a neural net with many hidden layers. The motivation:
the hidden layers are viewed as levels of abstraction. while the low levels handle the raw data, the
higher levels can represent abst
More on decision trees
Windowing in ID3
Windowing is applied in ID3 as a way of dealing with large sets of training instances. Without
windowing, such an algorithm can be really slow, as it needs to do entropy calculations over huge
amounts of data. With
A.1 Give explicit expressions to the values of all nodes in forward propagation when the network is given
the input x1 = 3; x2 = 9, with the desired output y1 = 1, y2 = 0,y3 = 1. Your answer should be in terms of
the old weights w1 , w2 , w3 , w4 , w5 . Y
Gaussian Naive Bayesian
(Taken mostly from the book by Duda, Hart, and Stork.)
Gaussian density/distribution
The Gaussian density function of n-dimensional vectors is:
1
1
T 1
g(x; , C) =
e 2 (x) C (x)
1/2
n
( 2) |C|
Here is the distribution mean and C i
Homework-5
Question 1
Apply the linear discriminant algorithm to the following training data:
x
0 -1
00
01
02
03
10
20
21
y
P
N
P
N
N
P
P
P
1. Compute the matrix X and the vector y.
X=
y=
2. Compute the matrix B = X T X and the vector h = X T y.
B=
h=
3.
Entropy notes
We are considering instances that are identified by feature values. For example, the feature can
be color, and the feature values can be Red, Green, Blue. Probabilities are associated with the
likelihood that an instance has a particular fea
Bayesian Classifiers
Bayesian theory can be used to define and identify the optimal classifier. Consider the hypothesis
h and the training data D to be results of experiments where h is drawn from a set of hypotheses
and the data in D is classified accord
Homework-2
Question 1
Consider a perceptron that computes its output O according to:
O
h
=
=
g(h)
Pn
where
g(h) =
i=1
wi xi
1
.
1 + |h|
Show that the DELTA rule for updating the weights is given by:
for i = 1, . . . , n,
wi wi + xi
(
if h 0
if h < 0
where
1.
QUESTION 1
Consider Bayes Net shown below:
How many total parameters are needed to define the conditional probability tables for this
network?
4
1
2
8
6
10 points
1.
QUESTION 2
For the scenario shown in previous question, what is the ratio of parameter
4. State whether the following statement is true or false:
A decision tree can only work with data that is linearly separable
True
False
3. Question 3
If you have a dataset with d attributes-, how many total nodes will there be in the decision tree?
d2
d+
1.
Consider the training data on the left and the decision tree model based on it shown on the right.
What is the training error, expressed as a percentage of the total number of instances in the
training dataset? [x] %
(Write a number with 0 decimal plac
1.
QUESTION 1
Consider the joint probability distributions of three Boolean variables as shown below:
What are the prior probabilities of smart and study i.e. P(smart) and P(study)
P(Smart)=0.3
P(Study)=0.4
P(Smart)=0.6
P(Study)=0.4
P(Smart)=0.6
P(Study)=
Question 1
1. Suppose n candidates have been called for a job and have been ranked 1,2,3,.,n.
Let X = the rank of a randomly selected candidate, so that X has pmf
p(x) = 1/n if x = 1, 2, 3, .,n
0 otherwise
Compute E(X) and var(X) for this scenario.
10 poi