doubts regarding batch size and time steps in RNN

18,960

Solution 1

Batch size pertains to the amount of training samples to consider at a time for updating your network weights. So, in a feedforward network, let's say you want to update your network weights based on computing your gradients from one word at a time, your batch_size = 1. As the gradients are computed from a single sample, this is computationally very cheap. On the other hand, it is also very erratic training.

To understand what happen during the training of such a feedforward network, I'll refer you to this very nice visual example of single_batch versus mini_batch to single_sample training.

However, you want to understand what happens with your num_steps variable. This is not the same as your batch_size. As you might have noticed, so far I have referred to feedforward networks. In a feedforward network, the output is determined from the network inputs and the input-output relation is mapped by the learned network relations:

hidden_activations(t) = f(input(t))

output(t) = g(hidden_activations(t)) = g(f(input(t)))

After a training pass of size batch_size, the gradient of your loss function with respect to each of the network parameters is computed and your weights updated.

In a recurrent neural network (RNN), however, your network functions a tad differently:

hidden_activations(t) = f(input(t), hidden_activations(t-1))

output(t) = g(hidden_activations(t)) = g(f(input(t), hidden_activations(t-1)))

=g(f(input(t), f(input(t-1), hidden_activations(t-2)))) = g(f(inp(t), f(inp(t-1), ... , f(inp(t=0), hidden_initial_state))))

As you might have surmised from the naming sense, the network retains a memory of its previous state, and the neuron activations are now also dependent on the previous network state and by extension on all states the network ever found itself to be in. Most RNNs employ a forgetfulness factor in order to attach more importance to more recent network states, but that is besides the point of your question.

Then, as you might surmise that it is computationally very, very expensive to calculate the gradients of the loss function with respect to network parameters if you have to consider backpropagation through all states since the creation of your network, there is a neat little trick to speed up your computation: approximate your gradients with a subset of historical network states num_steps.

If this conceptual discussion was not clear enough, you can also take a look at a more mathematical description of the above.

Solution 2

I found this diagram which helped me visualize the data structure.

Data Structure

From the image, 'batch size' is the number of examples of a sequence you want to train your RNN with for that batch. 'Values per timestep' are your inputs.' (in my case, my RNN takes 6 inputs) and finally, your time steps are the 'length', so to speak, of the sequence you're training

I'm also learning about recurrent neural nets and how to prepare batches for one of my projects (and stumbled upon this thread trying to figure it out).

Batching for feedforward and recurrent nets are slightly different and when looking at different forums, terminology for both gets thrown around and it gets really confusing, so visualizing it is extremely helpful.

Hope this helps.

Solution 3

RNN's "batch size" is to speed up computation (as there're multiple lanes in parallel computation units); it's not mini-batch for backpropagation. An easy way to prove this is to play with different batch size values, an RNN cell with batch size=4 might be roughly 4 times faster than that of batch size=1 and their loss are usually very close.

As to RNN's "time steps", let's look into the following code snippets from rnn.py. static_rnn() calls the cell for each input_ at a time and BasicRNNCell::call() implements its forward part logic. In a text prediction case, say batch size=8, we can think input_ here is 8 words from different sentences of in a big text corpus, not 8 consecutive words in a sentence. In my experience, we decide the value of time steps based on how deep we would like to model in "time" or "sequential dependency". Again, to predict next word in a text corpus with BasicRNNCell, small time steps might work. A large time step size, on the other hand, might suffer gradient exploding problem.

def static_rnn(cell,
       inputs,
       initial_state=None,
       dtype=None,
       sequence_length=None,
       scope=None):
"""Creates a recurrent neural network specified by RNNCell `cell`.
   The simplest form of RNN network generated is:

   state = cell.zero_state(...)
   outputs = []
   for input_ in inputs:
       output, state = cell(input_, state)
   outputs.append(output)
   return (outputs, state)
"""

class BasicRNNCell(_LayerRNNCell):
def call(self, inputs, state):
"""Most basic RNN: output = new_state = 
   act(W * input + U * state + B).
"""
    gate_inputs = math_ops.matmul(
        array_ops.concat([inputs, state], 1), self._kernel)
    gate_inputs = nn_ops.bias_add(gate_inputs, self._bias)
    output = self._activation(gate_inputs)
    return output, output

To visualize how these two parameters are related to the data set and weights, Erik Hallström's post is worth reading. From this diagram and above code snippets, it's obviously that RNN's "batch size" will no affect weights (wa, wb, and b) but "time steps" does. So, one could decide RNN's "time steps" based on their problem and network model and RNN's "batch size" based on computation platform and data set.

18,960

Author by

derek

Updated on June 07, 2022

Comments

derek almost 2 years

In Tensorflow's tutorial of RNN: https://www.tensorflow.org/tutorials/recurrent . It mentions two parameters: batch size and time steps. I am confused by the concepts. In my opinion, RNN introduces batch is because the fact that the to-train sequence can be very long such that backpropagation cannot compute that long(exploding/vanishing gradients). So we divide the long to-train sequence into shorter sequences, each of which is a mini-batch and whose size is called "batch size". Am I right here?

Regarding time steps, RNN consists of only a cell (LSTM or GRU cell, or other cell) and this cell is sequential. We can understand the sequential concept by unrolling it. But unrolling a sequential cell is a concept, not real which means we do not implement it in unroll way. Suppose the to-train sequence is a text corpus. Then we feed one word each time to the RNN cell and then update the weights. So why do we have time steps here? Combining my understanding of the above "batch size", I am even more confused. Do we feed the cell one word or multiple words (batch size)?