$\newcommand{\lyxlock}{}$

Warning: MathJax requires JavaScript to correctly process the mathematics on this page. Please enable JavaScript on your browser.

1 Recap

The first post in the series discussed the basic structure of recurrent cells and their limitations. We defined two families of functions, the first is

which contains all the affine transformations

for any

and

followed by an element-wise activation function

And another family

which is some kind of extension of

in the sense that the input space is

rather than

Formally, the definition of

In this post, which is the second in the series, we will focus on two activation functions:

and

. Next section describes what are gate functions, then we move to review two common improvements to the basic recurrent cell: the LSTM and GRU cells.

2 Gate Functions

Recall that recurrent layer has two steps. The first is updating its inner state based on both the current input and the previous state vector, then producing an output by applying some other function to the new state. So the input to the layer at each time step

is actually the tuple

A gate is a function that takes such a tuple and produces a vector of values between zero and one. Note that any function in

and in

is actually a gate because of the properties of the

and

functions. We add gates into a recurrent neuron in order to control how much information will flow over the “recurrent connection”. That is, suppose that

is the layer’s current state that should be used at the next time step, a gated layer will use the vector

instead of using

(when the symbol

denotes element-wise multiplication). Note that any coordinate in

is a “moderated” version of the corresponding coordinate in

because each entry in the gate’s output is in

We will see a concrete example in the next section.

3 LSTM Cell

Long Short Term Memory Cell (LSTM cell) is an improved version of the recurrent neuron that was proposed on 1997. It went through minor modifications until the version presented below (which is from 2013). LSTM cells solve both limitations of the basic recurrent neuron; it prevents the exploding and vanishing gradients problem, and it can remember as well as it can forget.

The main addition to the recurrent layer structure is the use of gates and a memory vector for each time step, denoted by

. The LSTM layer gets as inputs the tuple

and the previous memory vector

then outputs

and an updated memory vector

Here is how an LSTM layer computes its outputs: it have the following gates

named the forget, input and output gate respectively, and a state-transition function

. It first updates the memory vector

and then computes

Those equations can be explained as follows: the first equation is an element-wise summation of two terms. The first term is the previous memory vector

moderated by the forget gate. That is, the layer uses the current input

and previous output

to determine how much to shrink each coordinate of the previous memory vector. The second term is the candidate for the new state, i.e.

moderated by the input gate. Note that all the gates operate on the same input tuple

The input and forget gates control the long-short term dependencies (i.e. the recurrent connection), and allow the LSTM layer to adaptively control the balance between new information that come from the state-transition function and the history information that comes from the memory vector, hence the names for the gates: input and forget.

Another difference is that LSTM layer controls how much of its inner-memory to expose by using the output gate. That is being formulated in the second equation.

The addition of gates is what preventing the exploding and vanishing gradients problem. It makes the LSTM layer able to learn long and short term dependencies at the cost of increasing the number of parameters that are needed to be trained, and that makes the network harder to optimize.

4 GRU

Another gated cell proposed on 2014 is the Gated Recurrent Unit (GRU). It has similar advantages as the LSTM cell, but fewer parameters to train because the memory vector was removed and one of the gates.

GRU has two gate functions

named the update and reset gate respectively, and a state transition

The input to GRU layer is only the tuple

and the output is

computed as follow: first the layer computes its gates for the current time step, denoted by

for the update gate and by

for the reset gate. Then the output of the layer is

with all the arithmetic operations being done element wise.

The term

is a candidate for the next state. Note that the state-transition function accepts the previous state moderated by the reset gate, allowing it to forget past states (therefore the name of the gate: reset). Then the output of the layer is a linear interpolation between the previous state and the candidate state, controlled by the update gate.

As oppose to the LSTM cell, the GRU doesn’t have an output gate to control how much of its inner state to expose therefore the entire state is being exposed at each time step.

Both LSTM and GRU are very common in many recurrent network architectures and achieve great results in many tasks. LSTMs and GRUs can learn dependencies of various lengths which make the network very expressive. However, a too expressive network can leads sometimes to overfitting, to prevent that it is common to use some type of regularization, such as Dropout.

Next post I will discuss the dropout variations that are specific for RNNs.