What is a Recurrent Neural Network, what is a Long Short Term Memory (LSTM) network, and is it always better?

machine-learning neural-network sequence lstm

10,587

Solution 1

What is a Recurrent neural network?

The basic idea is that recurrent networks have loops. These loops allow the network to use information from previous passes, which acts as memory. The length of this memory depends on a number of factors but it is important to note that it is not indefinite. You can think of the memory as degrading, with older information being less and less usable.

For example, let's say we just want the network to do one thing: Remember whether an input from earlier was 1, or 0. It's not difficult to imagine a network which just continually passes the 1 around in a loop. However every time you send in a 0, the output going into the loop gets a little lower (This is a simplification, but displays the idea). After some number of passes the loop input will be arbitrarily low, making the output of the network 0. As you are aware, the vanishing gradient problem is essentially the same, but in reverse.

Why not just use a window of time inputs?

You offer an alternative: A sliding window of past inputs being provided as current inputs. That's is not a bad idea, but consider this: While the RNN may have eroded over time, you will always lose the entirety of your time information after you window ends. And while you would remove the vanishing gradient problem, you would have to increase the number of weights of your network by several times. Having to train all those additional weights will hurt you just as badly as (if not worse than) vanishing gradient.

What is an LSTM network?

You can think of LSTM as a special type of RNN. The difference is that LSTM is able to actively maintain self connecting loops without them degrading. This is accomplished through a somewhat fancy activation, involving an additional "memory" output for the self looping connection. The network must then be trained to select what data gets put onto this bus. By training the network to explicit select what to remember, we don't have to worry about new inputs destroying important information, and the vanishing gradient doesn't affect the information we decided to keep.

There are two main drawbacks:

It is more expensive to calculate the network output and apply back propagation. You simply have more math to do because of the complex activation. However this is not as important as the second point.
The explicit memory adds several more weights to each node, all of which must be trained. This increases the dimensionality of the problem, and potentially makes it harder to find an optimal solution.

Is it always better?

Which structure is better depends on a number of factors, like the number of nodes you need for you problem, the amount of available data, and how far back you want your network's memory to reach. However if you only want the theoretical answer, I would say that given infinite data and computing speed, an LSTM is the better choice, however one should not take this as practical advice.

Solution 2

A feed forward neural network has connections from layer n to layer n+1.

A recurrent neural network allows connections from layer n to layer n as well.

These loops allow the network to perform computations on data from previous cycles, which creates a network memory. The length of this memory depends on a number of factors and is an area of active research, but could be anywhere from tens to hundreds of time steps.

To make it a bit more clear, the carried 1 in your example is stored in the same way as the inputs: in a pattern of activation of a neural layer. It's just the recurrent (same layer) connections that allow the 1 to persist through time.

Obviously it would be infeasible to replicate every input stream for more than a few past time steps, and choosing which historical streams are important would be very difficult (and lead to reduced flexibility).

LSTM is a very different model which I'm only familiar with by comparison to the PBWM model, but in that review LSTM was able to actively maintain neural representations indefinitely, so I believe it is more intended for explicit storage. RNNs are more suited to non-linear time series learning, not storage. I don't know if there are drawbacks to using LSTM rather RNNs.

Solution 3

Both RNN and LSTM can be sequence learners. RNN suffers from vanishing gradient point problem. This problem causes the RNN to have trouble in remembering values of past inputs after more than 10 timesteps approx. (RNN can remember previously seen inputs for a few time steps only)

LSTM is designed to solve the vanishing gradient point problem in RNN. LSTM has the capability of bridging long time lags between inputs. In other words, it is able to remember inputs from up to 1000 time steps in the past (some papers even made claims it can go more than this). This capability makes LSTM an advantage for learning long sequences with long time lags. Refer to Alex Graves Ph.D. thesis Supervised Sequence Labelling with Recurrent Neural Networks for some details. If you are new to LSTM, I recommend Colah's blog for super simple and easy explanation.

However, recent advances in RNN also claim that with careful initialization, RNN can also learn long sequences comparable to the performance of LSTM. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units.

10,587

Author by

Essam Al-Mansouri

Updated on June 28, 2022

Comments

Essam Al-Mansouri almost 2 years
First, let me apologize for cramming three questions in that title. I'm not sure what better way is there.

I'll get right to it. I think I understand feedforward neural networks pretty well.

But LSTM really escapes me, and I feel maybe this is because I don't have a very good grasp of Recurrent neural networks in general. I have went through Hinton's and Andrew Ng's course on Coursera. A lot of it still doesn't make sense to me.

From what I understood, recurrent neural networks are different from feedforward neural networks in that past values influence the next prediction. Recurrent neural network are generally used for sequences.

The example I saw of recurrent neural network was binary addition.
```
    010
+   011
```
A recurrent neural network would take the right most 0 and 1 first, output a 1. Then take the 1,1 next, output a zero, and carry the 1. Take the next 0,0 and output a 1 because it carried the 1 from last calculation. Where does it store this 1? In feed forward networks the result is basically:
```
    y = a(w*x + b)
where w = weights of connections to previous layer
and x = activation values of previous layer or inputs
```
How is a recurrent neural network calculated? I am probably wrong but from what I understood, recurrent neural networks are pretty much feedforward neural network with T hidden layers, T being number of timesteps. And each hidden layer takes the X input at timestep T and it's outputs are then added to the next respective hidden layer's inputs.
```
    a(l) = a(w*x + b + pa)
where l = current timestep
and x = value at current timestep
and w = weights of connections to input layer
and pa = past activation values of hidden layer 
    such that neuron i in layer l uses the output value of neuron i in layer l-1

    y = o(w*a(l-1) + b)
where w = weights of connections to last hidden layer
```
But even if I understood this correctly, I don't see the advantage of doing this over simply using past values as inputs to a normal feedforward network (sliding window or whatever it's called).

For example, what is the advantage of using a recurrent neural network for binary addition instead of than training a feedforward network with two output neurons. One for the binary result and the other for the carry? And then take the carry output and plug it back into the feedforward network.

However, I'm not sure how is this different than simply having past values as inputs in a feedforward model.

It seems to me that the more timesteps there are, recurrent neural networks are only a disadvantage over feedforward networks because of vanishing gradient. Which brings me to my second question, from what I understood, LSTM is a solution to the problem of vanishing gradient. But I have no actual grasp of how they work. Furthermore, are they simply better than recurrent neural networks, or are there sacrifices to using a LSTM?