1 Recap

This is the third post in the series about Recurrent Neural Networks. The first post defined the basic notations and the formulation of a Recurrent Cell, the second post discussed its extensions such as LSTM and GRU. Recall that LSTM layer receives vector - the input for the current time step , and the memory-vector and the output-vector from previous time-step respectively, then the layer computes a new memory vector where denotes element wise multiplication. Then the layer outputs where are the forget, input and output gates respectively, and is the state transition function. denotes a family of all affine transformations following by an element-wise activation function , for any input-dimension and output-dimension .
In this post, we will focus on the different variants of dropout regularization that are being used in LSTM networks.

2 Dropout Regularization

Dropout is a very popular regularization mechanism that is being attached to layers of a neural network in order to reduce overfitting and improve generalization. Applying dropout to a layer starts by fixing some . Then, any time that layer produces an output during training, each one of its neurons is being zeroed-out independently with probability , and with probability it is being scaled by . Scaling the output is important because it keeps the expected output-value of the neuron unchanged. During test time, the dropout mechanism is being turned-off, that is, the output of each neuron is being passed as is.
For fully-connected layers, we can formulate dropout as follows: Suppose that should be the output of some layer that has dropout mechanism. Generate a mask , by picking each coordinate i.i.d w.p. from , and output instead of . Note that we hide here the output-scaling for simplicity, and that the mask is being generated any time the layer required to produce an output.
At first glance it seems trivial to apply dropout to LSTM layers. However, when we come to implement the mechanism we encounter some technical issues that should be addressed. For instance, we need to decide whether to apply the given mask only to the input or also to the recurrent connection (i.e. to ), and if we choose to apply on both, should it be the same mask or different masks? Should a mask be shared across time or be generated for each time-step?
Several works tried to apply dropout to LSTM layers in several naive ways but without success. It seems that just dropping randomly some coordinates from the recurrent connections impair the ability of the LSTM layer to learn long\short term dependencies and does not improve generalization. In the next section we will review some works that applied dropout to LSTMs in a way that successfully yields better generalization.

3 Variants

3.1 Mask Only Inputs; Regenerate Masks

Wojciech Zaremba, Ilya Sutskever and Oriol Vinyals published in 2014 a paper that describe a successful dropout variation for LSTM layers. Their idea was that dropout should be applied only to the inputs of the layer and not to the recurrent connections. Moreover, a new mask should be generated for each time step.
Formally, for each time-step , generate a mask and compute . Then continue to compute the LSTM layer as usual but use as the input to the layer rather than .

3.2 rnnDrop: Mask Only the Memory; Fixed Mask

On 2015, Taesup Moon, Heeyoul Choi, Hoshik Lee and Inchul Song published a different dropout variation: rnnDrop.
They suggested to generate a mask for each training sequence and fix it for all the time-steps in that sequence, that is, the mask is being shared across time. The mask is then being applied to the memory vector of the layer rather than the input. In their formulation, only the second formula of the LSTM changed: where if the fixed mask to the entire current training sequence.

3.3 Mask Input and Hidden; Fixed Mask

A relatively recent work of Yarin Gal and Zoubin Ghahramani from 2016 also use a make that is shared across time, however it is being applied to the inputs as well on the recurrent connections. This is one of the first successful dropout variants that actually apply the mask to the recurrent connection.
Formally, for each training sequence generate two masks and compute and . Then use as the input to the “regular” LSTM layer.

3.4 Mask Gates; Fixed Mask

Another paper from 2016, of Stanislau Semeniuta, Aliaksei Severyn and Erhardt Barth demonstrate the mask being applied to some of the gates rather than the input\hidden vectors. For any time-step, generate and then use it to mask the input gate and the second LSTM equation left unchanged.
Small note, the authors also addressed in their paper some issues related to scaling the not-dropped coordinates, which won’t be covered here.

3.5 Zoneout

The very last dropout variation (to the day I wrote this post) is by David Krueger et al, from 2017. They suggested to treat the memory vector and the output vector (a.k.a hidden vector) as follows: each coordinate in each of the vectors either being updated as usual or preserved its value from previous time-step. As opposed to regular dropout, where “dropping a coordinate” meaning to make it zero, in zoneout it just keeps its previous value - acting as a random identity map that allows better gradients propagation through more time steps. Note that preserving a coordinate in the memory vector should not affect the computation of the hidden vector. Therefore we have to rewrite the formulas for the LSTM layer: given and . Generate two masks for the current time-step . Start by computing a candidate memory-vector with the regular formula, and then use only some of its coordinates to update the memory-vector, that is, Similarly, for the hidden-vector, compute a candidate and selectively update Observe again that the computation of the candidate hidden-vector is based on therefore unaffected by the mask .

4 More and More and More

In the last year, dropout for LSTM networks gained more attention, and the list of different variants is growing quickly. This post tried to cover only some fraction of all the variants available out there. It is left for the curious reader to search the literature for more about this topic.
Note that some of the dropout variations discussed above can be applied to basic RNN and GRU cells without much modifications - please refer to the papers themselves for more details.
Next post, I will discuss recent advancement in the field of RNNs, such as Multiplicative LSTMs, Recurrent Batch Normalization and more.