1 The First Post in the Series

Recurrent Neural Network (RNN) is a neural network architecture that leads to state of the art results in many tasks, such as machine translation, speech recognition, image captioning and many more. Because it is a rapidly evolving field, it is difficult to find all the information (both basic and advanced material) in one place, so I decided to write a series of posts that will discuss Recurrent Neural Networks from their basic formulation to their most recent variations.
This post, as the first in the series, will present the concept of a recurrent neuron and its limitations. Next posts will discuss gated neurons like LSTM and GRU, variants of dropout regularization for RNNs and some advanced stuff that I found interesting (bi-directional RNN, Multiplicative LSTM and etc).
The next section is dedicated to review the notations and basic terminology that will be used during the entire series. I will assume that the reader is familiar with the basics of neural network (fully connected architecture, back-propagation algorithm, etc). If you are not familiar with one of those concepts, I strongly recommend to review them online before going further. Those notes provides concise introduction to the field. There is also an excellent video-lecture (in Hebrew) by Prof. Shai Shalev-Shwartz from 2016. If you really want to dive deeper, you can find broad introduction in Chapter 6 in Deep Learning book (legally free electronic version here).

2 Notations and Basic Terminology

Artificial Neural Network (ANN) is a mathematical hierarchal model inspired from neuroscience. The basic computation unit is a neuron, which is a function from to , where are the input and weight vectors respectively, and is the bias. The function is a non-linear activation function , for example or . Note that a neuron can ignore some of its input coordinates by setting their corresponding entries in its weight vector to zero.
Neurons are organized within layers. In the very basic ANN architecture, called a fully connected (FC) neural network, a layer is just a collection of neurons that will be computed in parallel (meaning that they are not connected to each other). For any layer , each of its neurons accepts as inputs all neurons from layer .
A layer in FC architecture can be formulate mathematically as a function from to such that where each is a function of a neuron as defined above, with its own weight vector and a bias scalar. The formulation can be simplified using matrix notations: where contains in the i-th row the weight vector of , and is the bias of . The function is the same as before but since its input is now a vector, it apply element-wise on each coordinate of the input, producing an output vector of the same length.
We denote the family of all possible FC functions based on some fixed activation function as that family contains all possible layers for any input\output dimension. For abbreviation, I will not mention the dimensions of layers or the specific activation function when they are not important for the discussion.
Applying several layers sequentially yield a feedforward FC neural network. That is, given , a FC network with layers is where for some fixed .
The number of layers in the network is its depth, and the size of the layers is its width (assuming all the layers are of the same dimension). Training the network is optimizing all the weight matrices and bias vectors such that the network will approximate some target (usually unknown) function. We can think of the optimization process as a process of iteratively updating the choice of layer functions from the family .
The FC architecture is simple, yet yields a very powerful model. When wide enough, even a single hidden layer network can approximate to any arbitrary precision almost any continuous function (under some mild assumptions). However, for some tasks it might not be optimal, and different architectures are being used, instead of the FC one.

3 The Need for Context

There are tasks that require some notion of context in order to make a decent prediction. For example, tasks that involve time-steps can consider the history until the current time step as a context (a.k.a time-series tasks). For concrete example, consider predicting the next word in a given sequence. Each word has as a context all words preceding it. Note that there are more ways to define the notion of a context, and it mostly depends on the task in hand. For didactic purposes we will focus on the example of next-word prediction.
How can we build a model for next-word prediction using FC neural network?
The obvious answer is to fix some window size , and build a network that gets the last words in the sentence as features, in order to predict the next word. One immediate limitation of this approach is that the window size is the same for all the examples in the training set, and it is not simple to find a suitable window size. On the one hand, choosing small window may cause the model to miss some important long term relations between words that are more than steps apart (for sentences longer than ). On the other hand, choosing very large increases the number of parameters in the model, which make the optimization harder, and increase the computation time per example, which affect both training-time and prediction-time.
Another limitation is that FC layers regards their inputs as ordered information. That is, a FC layer assigns a specific weight for each coordinate in its input vector, causing the order of the features to be meaningful. But in some tasks the order of the features may vary, for example, the subject of a sentence doesn’t limit to specific location inside an English sentence. If we want FC layer to account for that, we have to get a lot of examples (and they need to cover all different cases).
What we would like to have here is called parameter sharing. We will come back to that term later in this post.

4 Recurrent Cell - Learning to Remember

We want our network to be able to remember information about previous invocations (i.e. the past inputs). In order to do that, our network have to maintain a state. So we need to extend our neuron formulation to include another variable, , that will be its state and be kept during consecutive invocations.That state variable is usually initialized to be zero so that it doesn’t affect the first computation.
We will use an upper-script to denote the time step, so is the state of a neuron at time . Since the neurons need to pass information to themselves (i.e. their state from previous invocation) they become recurrent neurons, a.k.a recurrent cells.
Recurrent neuron uses internally two functions. The first function updates the inner state based on current input and previous state, which is the transition from time to time , is a new parameter denoting the weight for the previous state.
And the second function is for computing the output of the neuron at current time , based on its new state with as the weight and bias of the computation respectively.
We can move again to matrix notations to denote several parallel recurrent cells that forms a recurrent layer. The first function updates the vector of states for the entire layer and the second function for computing the output of the layer based on the current state When we come to compare that to the FC formulation, we see that actually the function itself hasn’t changed, but now it accepts as input the state of the layer , rather than the input to the layer . Another change is that before the layer computes its output, it first update its hidden state.
Note that we can extend the definition of the family to contains functions like the state-transforming function , by making each accepts two vectors as input instead of one (and we also need another weight matrix for the second input). So denote the family of all hidden-state transition functions as usually the activation function is . We will use that family later on, when discuss gated units as LSTM and GRU.

5 Comparing the Architectures - an Example

Consider again the next-word prediction problem. Suppose we have an ordered dictionary that contains the entire vocabulary we will ever need. We will represent any word by an integer indicating its index in the dictionary. In this section we will construct two single-neuron networks, once using the FC architecture and once using the Recurrent architecture.
FC network for next-word prediction: Let the window size be . The input to the network can be represent as a vector and the output as . The FC network amounts to for some that maps from to .
Recurrent neural network (RNN), however, leads to different representation: the input and the output both integers so the state transition function and the output function are both from to . The idea of the recurrent architecture is to apply the same recurrent layer several times on different inputs (each input from different time-step) in order to get a prediction.
Here is a detailed description of how to do that: we take a series of three consecutive words from the beginning of a sentence: and a target next word: . First, feed the network with , it updates the inner state from to some hidden state and outputs . We ignore that output and feed the next word , the network updates its state to and produces that is also ignored. Lastly, we feed to the network, the inner state is being updated to and another output is produced which is now considered as the prediction of the network. Note that in each invocation of the network, the output depends on both the current input and the previous state, therefore is the result of the entire history and (which was initialized to zero).
At first glance it looks like there is no meaningful differences between the architectures. We have an implicit sliding window over the words in the sentence for the recurrent neuron rather than explicit sliding window as with the FC neuron. Moreover, the output of FC neuron also depends on all previous time steps because the input is just the values of .
However, note that in the FC architecture, each word in the input vector has its own weight and bias whereas in recurrent architecture we use the same weights over and over again (by applying the same function). In other words, in recurrent layers we share the parameters across time. The formula below demonstrate how we applied the same function to updates the hidden-state across different time steps then making a prediction using some on the current state Parameter sharing has a very powerful effect on the learnability of the network both because it reduces the number of parameters that needed to be optimized and it make the network robust to small changes in the context (recall the example about the order of features in the input). In addition, the recurrent layer practically has a dynamic window size. Going back to our next-word prediction network, we can continue to apply the network on consecutive inputs, teaching it dependencies between words that are further than 3 time steps.
Note that in our example, we ignored the “intermediate” predictions of the network: and , but we could keep them and feed them as inputs to another recurrent layer making our RNN network deeper. We will discuss it in a later post.

6 Limitations of Recurrent Cells

Unfortunately, RNNs are not perfect and maintaining memory is not enough, sometimes the networks needs to be able to forget, for instance, when predicting next-word in a sentence the network needs to reset its state once it reach the end of the sentence.
Furthermore, parameter sharing has a serious drawback. Recall that applying the network for multiple time steps leads to a composition of the same upon itself times, something like this which may cause the gradient during back-propagation to either explode or vanish. This problem is known as the vanishing and exploding gradients. In order to get an intuition about that, we make the following simplifications: assume that the activation function inside is the identity instead of , and for all . Suppose that is given to us and it is not all zeros (that is, it encodes some information about the past).
We have that so applying the recurrent neuron for time steps amounts to . What happen to as increases? it easier to see the answer when considering the scalar case. If was a real positive number, then either when , or when . The same goes with matrices, suppose has an eigenvalue decomposition such that for orthogonal , so causing the small entries on the diagonal of to vanish, and the large entries to explode. Since the gradients in our network are being scaled according to , they are either explode or vanished. The problem with very large gradients is that they cause the network to diverge (large updates to the parameters), on the other hand, very small gradients cause infinitesimally small updates to the parameters which prevent the network from learning. That is describe much more formally and rigorously here.
Next post I will discuss two popular improvements to the recurrent neurons that helps avoid the limitations above: the LSTM and GRU cells.
Acknowledgments: I want to thank Ron Shiff for his useful comments and corrections to this post.