5.
Thus, training an RNN is simply unrolling the RNN for a given size of input
(and, correspondingly, the expected output) and training the unrolled RNN via
computing the gradients and using stochastic gradient descent.
As mentioned earlier in the chapter, RNNs can deal with arbitrarily long inputs and correspondingly,
they need to be trained on arbitrarily long inputs. Figure
67
illustrates how an RNN is unrolled for different
sizes of inputs. Note that once the RNN is unrolled, the process of training the RNN is identical to training
a regular neural network which we have covered in earlier chapters. In Figure
67
the RNN described in
Figure
61
is unrolled for input sizes of 1,2,3 and 4.
CHAPTER 6
■
RECURRENT NEURAL NETWORKS
87
Figure 67.
Unrolling the RNN corresponding to Figure
61
for different sizes of inputs
CHAPTER 6
■
RECURRENT NEURAL NETWORKS
88
Figure 68.
Teacher Forcing (Top – Training, Bottom  Prediction)
Given that the data set to be trained on consists of sequences of varying sizes, the input sequences are
grouped so that the sequences of the same size fall in one group. Then for a group, we can unroll the RNN
for the sequence length and train. Training for a different group will require the RNN to be unrolled for a
different sequence length. Thus, it is possible to train the RNN on inputs of varying sizes by unrolling and
training with the unrolling done based on the sequence length.
CHAPTER 6
■
RECURRENT NEURAL NETWORKS
89
It must be noted that training the unrolled RNN (illustrated in Figure
61
) is essentially a sequential
process, as the hidden states are dependent on each other. In the case of RNNs wherein the recurrence is
over the output instead of the hidden state (Figure
62
), it is possible to use a technique called teacher
forcing as illustrated in Figure
68
. The key idea here is to use
y
t

()
1
instead of
ˆ
y
t

1
in the computation of
h
(
t
)
while training. While making predictions (when the model is deployed for usage), however,
ˆ
y
t

1
is used.
Bidirectional RNNs
Let us now take a look at another variation on RNNs, namely, the bidirectional RNN. The key intuition
behind a bidirectional RNN is to use the entities that lie further in the sequence to make a prediction for the
current entity. For all the RNNs we have considered so far we have been using the previous entities (captured
by the hidden state) and the current entity in the sequence to make the prediction. However, we have not
been using information concerning the entities that lie further in the sequence to make predictions. A
bidirectional RNN leverages this information and can give improved predictive accuracy in many cases.
A bidirectional RNN can be described using the following equations:
hU
xW
hb
f
t
f
t
f
t
f
+
=+
+
tanh
1
b
t
b
t
b
t
b

+
tanh
1
ˆ
ys
oftmaxVh
Vh
c
t
bb
t
ff
t
+
The following points are to be noted:
1.
The RNN computation involves first computing the forward hidden state and
backward hidden state for an entity in the sequence. This is denoted by
h
f
(
t
)
and
h
b
(
t
)
respectively.