This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ular task, a natural way to train a deep network is to frame it as an optimization problem by specifying a supervised cost function on the output layer with
respect to the desired target and use a gradient based optimization algorithm in order to adjust the weights and biases of the network so that its output has low cost on
samples in the training set. Unfortunately, deep networks trained in that manner have generally been found to perform worse than neural networks with one or two hidden
layers.
We discuss two hypotheses that may explain this difficulty. The first one is that gradient descent can easily get stuck in poor local minima (Auer et al., 1996) or plateaus of
the non convex training criterion. The number and quality of these local minima and plateaus (Fukumizu and Amari, 2000) clearly also influence the chances for random
initialization to be in the basin of attraction (via gradient descent) of a poor solution. It may be that with more layers, the number or the width of such poor basins increases.
To reduce the difficulty, it has been suggested to train a neural network in a constructive manner in order to divide the hard optimization problem into several greedy but
simpler ones, either by adding one neuron (e.g., see Fahlman and Lebiere, 1990) or one layer (e.g., see Lengell´e and Denoeux, 1996) at a time. These two approaches
have demonstrated to be very effective for learning particularly complex functions, such as a very non linear classification problem in 2 dimensions. However, these are
exceptionally hard problems, and for learning tasks usually found in practice, this approach commonly overfits.
This observation leads to a second hypothesis. For high capacity and highly flexible deep networks, there actually exists many basins of attraction in its parameter space (i.e.,
yielding different solutions with gradient descent) that can give low training error but that can have very different generalization errors. So even when gradient descent is able
to find a (possibly local) good minimum in terms of training error, there are no guarantees that the associated parameter conf...
View Full
Document
 Winter '13

Click to edit the document details