How do I create padded batches in Tensorflow for tf.train.SequenceExample data using the DataSet API?

14,990

Solution 1

You need to pass a tuple of shapes. In your case you should pass

dataset = dataset.padded_batch(4, padded_shapes=([vectorSize],[None]))

or try

dataset = dataset.padded_batch(4, padded_shapes=([None],[None]))

Check this code for more details. I had to debug this method to figure out why it wasn't working for me.

Solution 2

If your current Dataset object contains a tuple, you can also to specify the shape of each padded element.

For example, I have a (same_sized_images, Labels) dataset and each label has different length but same rank.

def process_label(resized_img, label):
    # Perfrom some tensor transformations
    # ......

    return resized_img, label

dataset = dataset.map(process_label)
dataset = dataset.padded_batch(batch_size, 
                               padded_shapes=([None, None, 3], 
                                              [None, None]))  # my label has rank 2

Solution 3

You may need to get help from the dataset output shapes:

padded_shapes = dataset.output_shapes
Share:
14,990

Related videos on Youtube

Marijn Huijbregts
Author by

Marijn Huijbregts

Updated on June 08, 2022

Comments

  • Marijn Huijbregts
    Marijn Huijbregts almost 2 years

    For training an LSTM model in Tensorflow, I have structured my data into a tf.train.SequenceExample format and stored it into a TFRecord file. I would now like to use the new DataSet API to generate padded batches for training. In the documentation there is an example for using padded_batch, but for my data I can't figure out what the value of padded_shapes should be.

    For reading the TFrecord file into the batches I have written the following Python code:

    import math
    import tensorflow as tf
    import numpy as np
    import struct
    import sys
    import array
    
    if(len(sys.argv) != 2):
      print "Usage: createbatches.py [RFRecord file]"
      sys.exit(0)
    
    
    vectorSize = 40
    inFile = sys.argv[1]
    
    def parse_function_dataset(example_proto):
      sequence_features = {
          'inputs': tf.FixedLenSequenceFeature(shape=[vectorSize],
                                               dtype=tf.float32),
          'labels': tf.FixedLenSequenceFeature(shape=[],
                                               dtype=tf.int64)}
    
      _, sequence = tf.parse_single_sequence_example(example_proto, sequence_features=sequence_features)
    
      length = tf.shape(sequence['inputs'])[0]
      return sequence['inputs'], sequence['labels']
    
    sess = tf.InteractiveSession()
    
    filenames = tf.placeholder(tf.string, shape=[None])
    dataset = tf.contrib.data.TFRecordDataset(filenames)
    dataset = dataset.map(parse_function_dataset)
    # dataset = dataset.batch(1)
    dataset = dataset.padded_batch(4, padded_shapes=[None])
    iterator = dataset.make_initializable_iterator()
    
    batch = iterator.get_next()
    
    # Initialize `iterator` with training data.
    training_filenames = [inFile]
    sess.run(iterator.initializer, feed_dict={filenames: training_filenames})
    
    print(sess.run(batch))
    

    The code works well if I use dataset = dataset.batch(1) (no padding needed in that case), but when I use the padded_batch variant, I get the following error:

    TypeError: If shallow structure is a sequence, input must also be a sequence. Input has type: .

    Can you help me figuring out what I should pass for the padded_shapes parameter?

    (I know there is lots of example code using threading and queues for this, but I'd rather use the new DataSet API for this project)

  • Marijn Huijbregts
    Marijn Huijbregts over 6 years
    Thanks! That makes a lot of sense. The following worked for my example: padded_shapes=([None,vectorSize],[None]). The first tensor is a list of vectors with dimension vectorSize and the second is a list with integer labels.
  • Conchylicultor
    Conchylicultor about 6 years
    Just as complement, padded_shapes is sensitive to the type of the nested struct (if the dataset return a tuple, padded_shapes should be a tuple too and not a list)
  • gary69
    gary69 about 4 years
    Why did you use 3 for third image dimension instead of None?