Recurrent Neural Network - LSTM and GRU
1 Recap
The first post in the series discussed the basic structure of recurrent cells and their limitations. We defined two families of functions, the first is 
 which contains all the affine transformations 
 for any 
 and 
 followed by an element-wise activation function 
 And another family 
 which is some kind of extension of 
 in the sense that the input space is 
 rather than 
 Formally, the definition of 
 is
In this post, which is the second in the series,  we will focus on two activation functions: 
 and 
. Next section describes what are gate functions, then we move to review two common improvements to the basic recurrent cell: the LSTM and GRU cells. 
2 Gate Functions
Recall that recurrent layer has two steps. The first is updating its inner state based on both the current input and the previous state vector, then producing an output by applying some other function to the new state. So the input to the layer at each time step 
 is actually the tuple 
.
A gate is a function that takes such a tuple and produces a vector of values between zero and one. Note that any function in 
 and in 
 is actually a gate because of the properties of the 
 and 
 functions. We add gates into a recurrent neuron in order to control how much information will flow over the “recurrent connection”. That is, suppose that 
 is the layer’s current state that should be used at the next time step, a gated layer will use the vector 
 instead of using 
 (when the symbol 
 denotes element-wise multiplication). Note that any coordinate in 
 is a “moderated” version of the corresponding coordinate in 
 because each entry in the gate’s output is in 
 We will see a concrete example in the next section. 
3 LSTM Cell
Long Short Term Memory Cell (LSTM cell) is an improved version of the recurrent neuron that was proposed on 1997. It went through minor modifications until the version presented below (which is from 2013). LSTM cells solve both limitations of the basic recurrent neuron; it prevents the exploding and vanishing gradients problem, and it can remember as well as it can forget. 
The main addition to the recurrent layer structure is the use of gates and a memory vector for each time step, denoted by 
. The LSTM layer gets as inputs the tuple 
 and the previous memory vector 
 then outputs 
 and an updated memory vector 
Here is how an LSTM layer computes its outputs: it have the following gates 
 named the forget, input and output gate respectively, and a state-transition function 
. It first updates the memory vector 
and then computes
 Those equations can be explained as follows: the first equation is an element-wise summation of two terms. The first term is the previous memory vector 
 moderated by the forget gate. That is, the layer uses the current input 
 and previous output 
 to determine how much to shrink each coordinate of the previous memory vector. The second term is the candidate for the new state, i.e. 
 moderated by the input gate. Note that all the gates operate on the same input tuple 
The input and forget gates control the long-short term dependencies (i.e. the recurrent connection), and allow the LSTM layer to adaptively control the balance between new information that come from the state-transition function and the history information that comes from the memory vector, hence the names for the gates: input and forget.
Another difference is that LSTM layer controls how much of its inner-memory to expose by using the output gate. That is being formulated in the second equation. 
The addition of gates is what preventing the exploding and vanishing gradients problem. It makes the LSTM layer able to learn long and short term dependencies at the cost of increasing the number of parameters that are needed to be trained, and that makes the network harder to optimize.
4 GRU
Another gated cell proposed on 2014 is the Gated Recurrent Unit (GRU). It has similar advantages as the LSTM cell, but fewer parameters to train because the memory vector was removed and one of the gates. 
GRU has two gate functions 
 named the update and reset gate respectively, and a state transition 
 The input to GRU layer is only the tuple 
 and the output is 
 computed as follow: first the layer computes its gates for the current time step, denoted by 
 for the update gate and by 
for the reset gate. Then the output of the layer is 
with all the arithmetic operations being done element wise. 
The term 
 is a candidate for the next state. Note that the state-transition function accepts the previous state moderated by the reset gate, allowing it to forget past states (therefore the name of the gate: reset). Then the output of the layer is a linear interpolation between the previous state and the candidate state, controlled by the update gate.
As oppose to the LSTM cell, the GRU doesn’t have an output gate to control how much of its inner state to expose therefore the entire state is being exposed at each time step. 
Both LSTM and GRU are very common in many recurrent network architectures and achieve great results in many tasks. LSTMs and GRUs can learn dependencies of various lengths which make the network very expressive. However, a too expressive network can leads sometimes to overfitting, to prevent that it is common to use some type of regularization, such as Dropout. 
Next post I will discuss the dropout variations that are specific for RNNs.