Shuffling training data with LSTM RNN

13,212

In general, when you shuffle the training data (a set of sequences), you shuffle the order in which sequences are fed to the RNN, you don't shuffle the ordering within individual sequences. This is fine to do when your network is stateless:

Stateless Case:

The network's memory only persists for the duration of a sequence. Training on sequence B before sequence A doesn't matter because the network's memory state does not persist across sequences.

On the other hand:

Stateful Case:

The network's memory persists across sequences. Here, you cannot blindly shuffle your data and expect optimal results. Sequence A should be fed to the network before sequence B because A comes before B, and we want the network to evaluate sequence B with memory of what was in sequence A.

Share:
13,212
hellowill89
Author by

hellowill89

I'm a student at the University of Michigan EECS

Updated on June 03, 2022

Comments

  • hellowill89
    hellowill89 almost 2 years

    Since an LSTM RNN uses previous events to predict current sequences, why do we shuffle the training data? Don't we lose the temporal ordering of the training data? How is it still effective at making predictions after being trained on shuffled training data?

  • hellowill89
    hellowill89 almost 7 years
    So, all sequences in a batch are in-order, or have they all been shuffled at that point?
  • hellowill89
    hellowill89 almost 7 years
    Thanks for the response by the way.
  • Brian Bartoldson
    Brian Bartoldson almost 7 years
    I think it depends on your data. Try taking a look at this discussion of Keras LSTMs and statefulness philipperemy.github.io/keras-stateful-lstm. The author talks about when shuffling matters. Lmk if it doesn't make sense.
  • hellowill89
    hellowill89 almost 7 years
    What is your opinion on whether a stock market predictor is stateful or not?
  • Brian Bartoldson
    Brian Bartoldson almost 7 years
    "Stateful" is a choice. If you believe that the last N observations are all you need to predict a future stock price, then you can train without statefulness and sequence length=N (this means you are free to shuffle your sequences). Otherwise, you should look into using a stateful approach to allow the hidden state to persist across batches. My choice would depend on the frequency of the observations. If I had daily observations, I might believe that I can model the future stock price with sequences of length N=50 days.
  • mspadaccino
    mspadaccino almost 7 years
    Excellent answer Brian! I only have some doubts about splitting a shuffled dataset for an unstateful lstm model: in this way, don't we in some sense "contaminate" the test set since in some of its samples there will be, inside the sequence, some "deja-vu" (i.e. already seen data)? Thanks for your opinion...
  • Brian Bartoldson
    Brian Bartoldson almost 7 years
    Let me see if I understand what you're asking. I'm imagining a setup where we have a training set of 20 sequences that are each 4 elements (stock prices) long, and a test set with 5 sequences. Is your concern that a sequence in the training set might also appear in the test set?
  • mspadaccino
    mspadaccino almost 7 years
    Not the whole sequence, but the fact that as i suppose the sequences are overlapping then some elements (i.e. stock prices observed at some specific day j) are present as part of sequences on both the training and the test sets with different locations. Starting from your example, the first sequence of the test set will contain a price (at t-3 for example) that had also appeared inside the last sequences in the training set (possibly indexed at another location than t-3 in this sequence). Would this be a problem?
  • Brian Bartoldson
    Brian Bartoldson almost 7 years
    The sequences shouldn't be overlapping. If you have 100 price observations (one price per day for 100 consecutive days) then you can make a dataset of 25 sequences, each sequence being the prices from 4 contiguous/consecutive days. Then, you randomly select 5 sequences, and those 5 become your test data. You train on the other 20 sequences. So there's no data point that's used for both training and testing. It's possible for a price that showed up in the training set to show up in the testing set, but that's not a problem. I can explain why if you want. Lmk if this didn't clear things up.
  • Brian Bartoldson
    Brian Bartoldson almost 7 years
    I think I see what you meant now. You want to create more than 25 sequences from your 100 prices by making overlapping sequences (e.g., the first sequence is days 1-4, the second is days 2-5, etc.)? That sounds okay, but do not let sub-sequences (e.g., days 2-4) from your training data appear as sub-sequences in your test data. Your original hunch is right: you would be contaminating the test data by doing that.
  • MysteryGuy
    MysteryGuy over 5 years
    Hi @BrianBartoldson, I appreciate your answer bu sth didn't convince me however. When you write "Sequence A should be fed to the network before sequence B because A comes before B, and we want the network to evaluate sequence B with memory of what was in sequence A.". Actually, if I understand well, the sequences are split between batches and what is really important is to keep the same order of sequences from one batch to another, but this particular order is not so important itself (as you tend to tell).(1/2)
  • MysteryGuy
    MysteryGuy over 5 years
    (Following previous post) Look at @Daniel Möller brillant answer in stackoverflow.com/questions/38714959/understanding-keras-lst‌​ms/… and give me your feedback (2/2)
  • Brian Bartoldson
    Brian Bartoldson over 5 years
    I agree that the answer you link to is very nice, and better illustrated than mine! But the content of our answers is consistent. Please read my answer as if the batch size is 1 sequence, then let me know if it still seems inconsistent with the other answer. E.g.: say we have a character RNN and our data is sequences of 5 characters. If sequence A is "Hello", and sequence B is "Brian", then we probably want stateful (we should feed A as batch 1, and B as batch 2). If the two sequences have nothing to do with each other, we can use stateless and shuffle the order in which we feed them.
  • Yoan B. M.Sc
    Yoan B. M.Sc about 4 years
    There's one thing I still don't get. I though the point of using RNN was that the state of the previous instance at time "t" does weight into how the step at "t+1" is processed. If you shuffle the sequence and "t+1" has nothing to do with "t", why use RNN at all and not CNN instead ? Thank you
  • Brian Bartoldson
    Brian Bartoldson about 4 years
    Say you have two sequences A, and B: A = [1,2,3]; B = [4,5,6]. When you shuffle the sequences, you're shuffling the collection of sequences (A,B): shuffle 1 = (A,B); shuffle 2 = (B,A). You do not shuffle the items within A or within B---that would seemingly defeat the purpose of the RNN, yes.
  • Yoan B. M.Sc
    Yoan B. M.Sc over 3 years
    @BrianBartoldson stateless LSTM reset at each batch not sequence. So If you shuffle the sequence inside a batch you do loose the temporal context of each element. Let say I'm forecasting a stock price using a year of data with a sequence length of a day. How can the model learn long term relation (on the scale of week or months) if the sequence are randomly fed ? What would forecasting values for the "next" day mean in this context ?
  • Yoan B. M.Sc
    Yoan B. M.Sc over 3 years
    @BrianBartoldson : unless this post are incorrect : stackoverflow.com/questions/39681046/… stackoverflow.com/questions/43882796/…
  • Brian Bartoldson
    Brian Bartoldson about 3 years
    @YoanB.M.Sc: I agree with the post that you linked to and think it agrees with my understanding and response above. Maybe you're noticing that I did not use the word "batch", while the post that you linked to uses that word. Instead of saying that a stateless LSTM resets the states at the end of a batch, I said that it does so at the end of a "sequence". I went on to say that two sequences [A] and [B] could be parts of a larger sequence [A,B] (this could be like two days of price data in your example), in which case the stateful approach would be good for the reasons we each stated.