Recurrent Neural Network - Recent Advancements
1 Last Post of the Series
This post is the last in the series that discussed Recurrent Neural Networks (RNNs). The first post introduced the basic recurrent cells and their limitations, the second post presented the formulation of gated cells as LSTMs and GRUs, and the third post overviewed dropout variants for LSTMs. This post will present some of the recent advancements in the filed of RNNs (mostly related to LSTM units).
We start by reviewing some of the notations that we used during the series.
We defined
to be the family of all affine transformations
for any dimensions
and
, followed by an element-wise activation function
, that is,
and noted that with the
as the activation function, all the functions in
are gate functions, that is, their output belongs to
.
Usually we do not specify
and
, we assume that whenever there is a matrix multiplication of element-wise operation the dimensions always match, as they are not important for the discussion.
2 Variations of the Gate Functions for LSTMs
2.1 Recap: LSTM Formulation - The Use of Gates
LSTM layer is a recurrent layer that keeps a memory vector
for any time step
. Upon receiving the current time-step input
and the output from previous time-step
, the layer computes gate activations
by applying a the corresponding gate function
on the tuple
, note that these activations are vectors in
.
A new memory-vector is then being computed by
where
is element-wise multiplication and
is the state-transition function. The layer then outputs
.
The gate functions are of great importance for LSTM layer, they are what allow the layer to learn short and long term dependencies, and they also help avoiding the exploding and vanishing gradients problem.
2.2 Simple Variations
The more expressive power a gate function have, the better it can learn short \ long term dependencies. That is, if a gate function is very rich, it can account for subtle changes in the input
as well to subtle changes in the “past”, i.e. in
. One way to enrich a gate function is by making it depends on the previous memory vector
in addition to the regular tuple
. That will redefine our gate functions family to be
for any
and diagonal
. Such construction can be found in here, note that we can think of
as a bias vector that is memory-dependent, thus there are formulation in which the vector
is omitted.
2.3 Multiplicative LSTMs
In this paper from October 2016, the authors took that idea one step further. They wanted to make a unique gate function for each possible input. Recall that any function in
is of the form
. For simplicity, consider the case where
.
We end up with a sum of two components: one that depends on the current input and one that depends on the past (i.e.
which encodes information about previous time steps), and the component with larger magnitude will dominate the transition. If
is larger, the layer will not be sensitive enough to the past, and if
is larger, then the layer will not be sensitive to subtle changes in the input.
The author noted that since in most cases the input
is an 1-hot vector, then multiply by
is just selecting a specific column. So we may think of any gate function
as a fixed base affine transformation
combined with input-dependent bias vector
. That is
that formula emphasis the additive effect of the current input-vector on the transition of
.
The goal in multiplicative LSTM (mLSTM), is to have an unique affine transformation (in our case, unique matrix
) for each possible input. Obviously, if the number of possible inputs is very large, the number of the parameters will explode and it won’t be feasible to train the network. To overcome that, the authors of the paper suggested to learn shared intermediate matrices that will be used to construct an unique
for each input, and because the factorization is shared, there are less parameters to learn.
The factorization is defined as follow: the unique matrix
is constructed by
where
and
are intermediate matrices which are shared across all the possible inputs, and
is an operation that maps any vector
to a square diagonal matrix with the elements of
on its diagonal.
Note that the target dimension of
may be arbitrarily large (in the paper they chose it to be the same as
).
The difference between LSTM and mLSTM is only in the definition of the gate functions family, all the equations for updating the memory cell and the output are the same. In mLSTM the family
is being replaced with
Note that we can reduce the number of parameters even further by forcing all the gate functions to use the same
and
matrices. That is, each gate will be parametrized only by
and
.
Formally, define the following transformation
such that
and then we can define
for some fixed learned transformation
as
in other words, mLSTM is an LSTM that apply its gates to the tuple
rather than
, and
is another learned transformation. If you look again at the exact formula of
you will see the multiplicative effect of the input vector on the transformation of
. That way, the authors said,
can yield much richer family of input-dependent transformations for
, which can be sensitive to the past as well as to subtle changes in the current input.
3 Recurrent Batch Normalization
Batch normalization is an operator applied to a layer before going through the non-linearity function in order to “normalize” the values before activation. That operator has two hyper-parameters for tuning and two statistics that it accumulates internally. Formally, given a vector of pre-activation values
, batch-normalization is
where
and
are the empirical mean and variance of the current batch respectively.
and
are vectors, and the division in the operator is being computed element-wise. Note that at inference time, the population statistics are estimated by averaging the empirical statistics across all the batches.
That paper from February 2017, applied batch normalization also to the recurrent connections in LSTM layers, and they showed empirically that it helped the network to converge faster. The usage is simple, each gate function
and the transition function
computes
when the hyper-parameters
and
are shared across different gates. Then the output of the layer is
. The authors suggested to set
because there is already a bias parameter
and to prevent redundancy. In addition, they said that sharing the internal
statistics across time degrades performance severely. Therefore, one should use “fresh”
operators for each time-step with their own internal statistics (but share the
and
parameters).
4 The Ever-growing Field
RNNs is a rapidly growing field and this series only covers a small part of it. There are much more advanced models, some of them are only from few months ago. If you are interested in this field, you may find very interesting stuff here.