$\newcommand{\lyxlock}{}$

Warning: MathJax requires JavaScript to correctly process the mathematics on this page. Please enable JavaScript on your browser.

1 Last Post of the Series

This post is the last in the series that discussed Recurrent Neural Networks (RNNs). The first post introduced the basic recurrent cells and their limitations, the second post presented the formulation of gated cells as LSTMs and GRUs, and the third post overviewed dropout variants for LSTMs. This post will present some of the recent advancements in the filed of RNNs (mostly related to LSTM units).

We start by reviewing some of the notations that we used during the series.

We defined

to be the family of all affine transformations

for any dimensions

and

, followed by an element-wise activation function

, that is,

and noted that with the

as the activation function, all the functions in

are gate functions, that is, their output belongs to

Usually we do not specify

and

, we assume that whenever there is a matrix multiplication of element-wise operation the dimensions always match, as they are not important for the discussion.

2 Variations of the Gate Functions for LSTMs

2.1 Recap: LSTM Formulation - The Use of Gates

LSTM layer is a recurrent layer that keeps a memory vector

for any time step

. Upon receiving the current time-step input

and the output from previous time-step

, the layer computes gate activations

by applying a the corresponding gate function

on the tuple

, note that these activations are vectors in

A new memory-vector is then being computed by

where

is element-wise multiplication and

is the state-transition function. The layer then outputs

The gate functions are of great importance for LSTM layer, they are what allow the layer to learn short and long term dependencies, and they also help avoiding the exploding and vanishing gradients problem.

2.2 Simple Variations

The more expressive power a gate function have, the better it can learn short \ long term dependencies. That is, if a gate function is very rich, it can account for subtle changes in the input

as well to subtle changes in the “past”, i.e. in

. One way to enrich a gate function is by making it depends on the previous memory vector

in addition to the regular tuple

. That will redefine our gate functions family to be

for any

and diagonal

. Such construction can be found in here, note that we can think of

as a bias vector that is memory-dependent, thus there are formulation in which the vector

is omitted.

2.3 Multiplicative LSTMs

In this paper from October 2016, the authors took that idea one step further. They wanted to make a unique gate function for each possible input. Recall that any function in

is of the form

. For simplicity, consider the case where

We end up with a sum of two components: one that depends on the current input and one that depends on the past (i.e.

which encodes information about previous time steps), and the component with larger magnitude will dominate the transition. If

is larger, the layer will not be sensitive enough to the past, and if

is larger, then the layer will not be sensitive to subtle changes in the input.

The author noted that since in most cases the input

is an 1-hot vector, then multiply by

is just selecting a specific column. So we may think of any gate function

as a fixed base affine transformation

combined with input-dependent bias vector

. That is

that formula emphasis the additive effect of the current input-vector on the transition of

The goal in multiplicative LSTM (mLSTM), is to have an unique affine transformation (in our case, unique matrix

) for each possible input. Obviously, if the number of possible inputs is very large, the number of the parameters will explode and it won’t be feasible to train the network. To overcome that, the authors of the paper suggested to learn shared intermediate matrices that will be used to construct an unique

for each input, and because the factorization is shared, there are less parameters to learn.

The factorization is defined as follow: the unique matrix

is constructed by

where

and

are intermediate matrices which are shared across all the possible inputs, and

is an operation that maps any vector

to a square diagonal matrix with the elements of

on its diagonal.

Note that the target dimension of

may be arbitrarily large (in the paper they chose it to be the same as

The difference between LSTM and mLSTM is only in the definition of the gate functions family, all the equations for updating the memory cell and the output are the same. In mLSTM the family

is being replaced with

Note that we can reduce the number of parameters even further by forcing all the gate functions to use the same

and

matrices. That is, each gate will be parametrized only by

and

Formally, define the following transformation

such that

and then we can define

for some fixed learned transformation

in other words, mLSTM is an LSTM that apply its gates to the tuple

rather than

, and

is another learned transformation. If you look again at the exact formula of

you will see the multiplicative effect of the input vector on the transformation of

. That way, the authors said,

can yield much richer family of input-dependent transformations for

, which can be sensitive to the past as well as to subtle changes in the current input.

3 Recurrent Batch Normalization

Batch normalization is an operator applied to a layer before going through the non-linearity function in order to “normalize” the values before activation. That operator has two hyper-parameters for tuning and two statistics that it accumulates internally. Formally, given a vector of pre-activation values

, batch-normalization is

where

and

are the empirical mean and variance of the current batch respectively.

and

are vectors, and the division in the operator is being computed element-wise. Note that at inference time, the population statistics are estimated by averaging the empirical statistics across all the batches.

That paper from February 2017, applied batch normalization also to the recurrent connections in LSTM layers, and they showed empirically that it helped the network to converge faster. The usage is simple, each gate function

and the transition function

computes

when the hyper-parameters

and

are shared across different gates. Then the output of the layer is

. The authors suggested to set

because there is already a bias parameter

and to prevent redundancy. In addition, they said that sharing the internal

statistics across time degrades performance severely. Therefore, one should use “fresh”

operators for each time-step with their own internal statistics (but share the

and

parameters).

4 The Ever-growing Field

RNNs is a rapidly growing field and this series only covers a small part of it. There are much more advanced models, some of them are only from few months ago. If you are interested in this field, you may find very interesting stuff here.