1 Last Post of the Series

This post is the last in the series that discussed Recurrent Neural Networks (RNNs). The first post introduced the basic recurrent cells and their limitations, the second post presented the formulation of gated cells as LSTMs and GRUs, and the third post overviewed dropout variants for LSTMs. This post will present some of the recent advancements in the filed of RNNs (mostly related to LSTM units).
We start by reviewing some of the notations that we used during the series.
We defined to be the family of all affine transformations for any dimensions and , followed by an element-wise activation function , that is,
and noted that with the as the activation function, all the functions in are gate functions, that is, their output belongs to .
Usually we do not specify and , we assume that whenever there is a matrix multiplication of element-wise operation the dimensions always match, as they are not important for the discussion.

2 Variations of the Gate Functions for LSTMs

2.1 Recap: LSTM Formulation - The Use of Gates

LSTM layer is a recurrent layer that keeps a memory vector for any time step . Upon receiving the current time-step input and the output from previous time-step , the layer computes gate activations by applying a the corresponding gate function on the tuple , note that these activations are vectors in .
A new memory-vector is then being computed by where is element-wise multiplication and is the state-transition function. The layer then outputs .
The gate functions are of great importance for LSTM layer, they are what allow the layer to learn short and long term dependencies, and they also help avoiding the exploding and vanishing gradients problem.

2.2 Simple Variations

The more expressive power a gate function have, the better it can learn short \ long term dependencies. That is, if a gate function is very rich, it can account for subtle changes in the input as well to subtle changes in the “past”, i.e. in . One way to enrich a gate function is by making it depends on the previous memory vector in addition to the regular tuple . That will redefine our gate functions family to be for any and diagonal . Such construction can be found in here, note that we can think of as a bias vector that is memory-dependent, thus there are formulation in which the vector is omitted.

2.3 Multiplicative LSTMs

In this paper from October 2016, the authors took that idea one step further. They wanted to make a unique gate function for each possible input. Recall that any function in is of the form . For simplicity, consider the case where .
We end up with a sum of two components: one that depends on the current input and one that depends on the past (i.e. which encodes information about previous time steps), and the component with larger magnitude will dominate the transition. If is larger, the layer will not be sensitive enough to the past, and if is larger, then the layer will not be sensitive to subtle changes in the input.
The author noted that since in most cases the input is an 1-hot vector, then multiply by is just selecting a specific column. So we may think of any gate function as a fixed base affine transformation combined with input-dependent bias vector . That is that formula emphasis the additive effect of the current input-vector on the transition of .
The goal in multiplicative LSTM (mLSTM), is to have an unique affine transformation (in our case, unique matrix ) for each possible input. Obviously, if the number of possible inputs is very large, the number of the parameters will explode and it won’t be feasible to train the network. To overcome that, the authors of the paper suggested to learn shared intermediate matrices that will be used to construct an unique for each input, and because the factorization is shared, there are less parameters to learn.
The factorization is defined as follow: the unique matrix is constructed by where and are intermediate matrices which are shared across all the possible inputs, and is an operation that maps any vector to a square diagonal matrix with the elements of on its diagonal.
Note that the target dimension of may be arbitrarily large (in the paper they chose it to be the same as ).
The difference between LSTM and mLSTM is only in the definition of the gate functions family, all the equations for updating the memory cell and the output are the same. In mLSTM the family is being replaced with Note that we can reduce the number of parameters even further by forcing all the gate functions to use the same and matrices. That is, each gate will be parametrized only by and .
Formally, define the following transformation such that and then we can define for some fixed learned transformation as in other words, mLSTM is an LSTM that apply its gates to the tuple rather than , and is another learned transformation. If you look again at the exact formula of you will see the multiplicative effect of the input vector on the transformation of . That way, the authors said, can yield much richer family of input-dependent transformations for , which can be sensitive to the past as well as to subtle changes in the current input.

3 Recurrent Batch Normalization

Batch normalization is an operator applied to a layer before going through the non-linearity function in order to “normalize” the values before activation. That operator has two hyper-parameters for tuning and two statistics that it accumulates internally. Formally, given a vector of pre-activation values , batch-normalization is where and are the empirical mean and variance of the current batch respectively. and are vectors, and the division in the operator is being computed element-wise. Note that at inference time, the population statistics are estimated by averaging the empirical statistics across all the batches.
That paper from February 2017, applied batch normalization also to the recurrent connections in LSTM layers, and they showed empirically that it helped the network to converge faster. The usage is simple, each gate function and the transition function computes when the hyper-parameters and are shared across different gates. Then the output of the layer is . The authors suggested to set because there is already a bias parameter and to prevent redundancy. In addition, they said that sharing the internal statistics across time degrades performance severely. Therefore, one should use “fresh” operators for each time-step with their own internal statistics (but share the and parameters).

4 The Ever-growing Field

RNNs is a rapidly growing field and this series only covers a small part of it. There are much more advanced models, some of them are only from few months ago. If you are interested in this field, you may find very interesting stuff here.