1 Recap

The first post in the series discussed the basic structure of recurrent cells and their limitations. We defined two families of functions, the first is which contains all the affine transformations for any and followed by an element-wise activation function And another family which is some kind of extension of in the sense that the input space is rather than Formally, the definition of is In this post, which is the second in the series, we will focus on two activation functions: and . Next section describes what are gate functions, then we move to review two common improvements to the basic recurrent cell: the LSTM and GRU cells.

2 Gate Functions

Recall that recurrent layer has two steps. The first is updating its inner state based on both the current input and the previous state vector, then producing an output by applying some other function to the new state. So the input to the layer at each time step is actually the tuple .
A gate is a function that takes such a tuple and produces a vector of values between zero and one. Note that any function in and in is actually a gate because of the properties of the and functions. We add gates into a recurrent neuron in order to control how much information will flow over the “recurrent connection”. That is, suppose that is the layer’s current state that should be used at the next time step, a gated layer will use the vector instead of using (when the symbol denotes element-wise multiplication). Note that any coordinate in is a “moderated” version of the corresponding coordinate in because each entry in the gate’s output is in We will see a concrete example in the next section.

3 LSTM Cell

Long Short Term Memory Cell (LSTM cell) is an improved version of the recurrent neuron that was proposed on 1997. It went through minor modifications until the version presented below (which is from 2013). LSTM cells solve both limitations of the basic recurrent neuron; it prevents the exploding and vanishing gradients problem, and it can remember as well as it can forget.
The main addition to the recurrent layer structure is the use of gates and a memory vector for each time step, denoted by . The LSTM layer gets as inputs the tuple and the previous memory vector then outputs and an updated memory vector
Here is how an LSTM layer computes its outputs: it have the following gates named the forget, input and output gate respectively, and a state-transition function . It first updates the memory vector and then computes Those equations can be explained as follows: the first equation is an element-wise summation of two terms. The first term is the previous memory vector moderated by the forget gate. That is, the layer uses the current input and previous output to determine how much to shrink each coordinate of the previous memory vector. The second term is the candidate for the new state, i.e. moderated by the input gate. Note that all the gates operate on the same input tuple
The input and forget gates control the long-short term dependencies (i.e. the recurrent connection), and allow the LSTM layer to adaptively control the balance between new information that come from the state-transition function and the history information that comes from the memory vector, hence the names for the gates: input and forget.
Another difference is that LSTM layer controls how much of its inner-memory to expose by using the output gate. That is being formulated in the second equation.
The addition of gates is what preventing the exploding and vanishing gradients problem. It makes the LSTM layer able to learn long and short term dependencies at the cost of increasing the number of parameters that are needed to be trained, and that makes the network harder to optimize.

4 GRU

Another gated cell proposed on 2014 is the Gated Recurrent Unit (GRU). It has similar advantages as the LSTM cell, but fewer parameters to train because the memory vector was removed and one of the gates.
GRU has two gate functions named the update and reset gate respectively, and a state transition The input to GRU layer is only the tuple and the output is computed as follow: first the layer computes its gates for the current time step, denoted by for the update gate and by for the reset gate. Then the output of the layer is with all the arithmetic operations being done element wise.
The term is a candidate for the next state. Note that the state-transition function accepts the previous state moderated by the reset gate, allowing it to forget past states (therefore the name of the gate: reset). Then the output of the layer is a linear interpolation between the previous state and the candidate state, controlled by the update gate.
As oppose to the LSTM cell, the GRU doesn’t have an output gate to control how much of its inner state to expose therefore the entire state is being exposed at each time step.
Both LSTM and GRU are very common in many recurrent network architectures and achieve great results in many tasks. LSTMs and GRUs can learn dependencies of various lengths which make the network very expressive. However, a too expressive network can leads sometimes to overfitting, to prevent that it is common to use some type of regularization, such as Dropout.
Next post I will discuss the dropout variations that are specific for RNNs.