Recurrent Neural Network - LSTM and GRU
1 Recap
The first post in the series discussed the basic structure of recurrent cells and their limitations. We defined two families of functions, the first is
which contains all the affine transformations
for any
and
followed by an element-wise activation function
And another family
which is some kind of extension of
in the sense that the input space is
rather than
Formally, the definition of
is
In this post, which is the second in the series, we will focus on two activation functions:
and
. Next section describes what are gate functions, then we move to review two common improvements to the basic recurrent cell: the LSTM and GRU cells.
2 Gate Functions
Recall that recurrent layer has two steps. The first is updating its inner state based on both the current input and the previous state vector, then producing an output by applying some other function to the new state. So the input to the layer at each time step
is actually the tuple
.
A gate is a function that takes such a tuple and produces a vector of values between zero and one. Note that any function in
and in
is actually a gate because of the properties of the
and
functions. We add gates into a recurrent neuron in order to control how much information will flow over the “recurrent connection”. That is, suppose that
is the layer’s current state that should be used at the next time step, a gated layer will use the vector
instead of using
(when the symbol
denotes element-wise multiplication). Note that any coordinate in
is a “moderated” version of the corresponding coordinate in
because each entry in the gate’s output is in
We will see a concrete example in the next section.
3 LSTM Cell
Long Short Term Memory Cell (LSTM cell) is an improved version of the recurrent neuron that was proposed on 1997. It went through minor modifications until the version presented below (which is from 2013). LSTM cells solve both limitations of the basic recurrent neuron; it prevents the exploding and vanishing gradients problem, and it can remember as well as it can forget.
The main addition to the recurrent layer structure is the use of gates and a memory vector for each time step, denoted by
. The LSTM layer gets as inputs the tuple
and the previous memory vector
then outputs
and an updated memory vector
Here is how an LSTM layer computes its outputs: it have the following gates
named the forget, input and output gate respectively, and a state-transition function
. It first updates the memory vector
and then computes
Those equations can be explained as follows: the first equation is an element-wise summation of two terms. The first term is the previous memory vector
moderated by the forget gate. That is, the layer uses the current input
and previous output
to determine how much to shrink each coordinate of the previous memory vector. The second term is the candidate for the new state, i.e.
moderated by the input gate. Note that all the gates operate on the same input tuple
The input and forget gates control the long-short term dependencies (i.e. the recurrent connection), and allow the LSTM layer to adaptively control the balance between new information that come from the state-transition function and the history information that comes from the memory vector, hence the names for the gates: input and forget.
Another difference is that LSTM layer controls how much of its inner-memory to expose by using the output gate. That is being formulated in the second equation.
The addition of gates is what preventing the exploding and vanishing gradients problem. It makes the LSTM layer able to learn long and short term dependencies at the cost of increasing the number of parameters that are needed to be trained, and that makes the network harder to optimize.
4 GRU
Another gated cell proposed on 2014 is the Gated Recurrent Unit (GRU). It has similar advantages as the LSTM cell, but fewer parameters to train because the memory vector was removed and one of the gates.
GRU has two gate functions
named the update and reset gate respectively, and a state transition
The input to GRU layer is only the tuple
and the output is
computed as follow: first the layer computes its gates for the current time step, denoted by
for the update gate and by
for the reset gate. Then the output of the layer is
with all the arithmetic operations being done element wise.
The term
is a candidate for the next state. Note that the state-transition function accepts the previous state moderated by the reset gate, allowing it to forget past states (therefore the name of the gate: reset). Then the output of the layer is a linear interpolation between the previous state and the candidate state, controlled by the update gate.
As oppose to the LSTM cell, the GRU doesn’t have an output gate to control how much of its inner state to expose therefore the entire state is being exposed at each time step.
Both LSTM and GRU are very common in many recurrent network architectures and achieve great results in many tasks. LSTMs and GRUs can learn dependencies of various lengths which make the network very expressive. However, a too expressive network can leads sometimes to overfitting, to prevent that it is common to use some type of regularization, such as Dropout.
Next post I will discuss the dropout variations that are specific for RNNs.