how to store numpy arrays as tfrecord?

13,418

Solution 1

The function _floats_feature described in the Tensorflow-Guide expects a scalar (either float32 or float64) as input.

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

As you can see the inputted scalar is written into a list (value=[value]) which is subsequently given to tf.train.FloatList as input. tf.train.FloatList expects an iterator that outputs a float in each iteration (as the list does).

If your feature is not a scalar but a vectur, _float_feature can be rewritten to pass the iterator directly to tf.train.FloatList (instead of putting it into a list first).

def _float_array_feature(value):
  return tf.train.Feature(float_list=tf.train.FloatList(value=value))

However if your feature has two or more dimensions this solution does not work anymore. Like @mmry described in his answer in this case flattening your feature or splitting it into several one-dimensional features would be a solution. The disadvantage of these two approaches is that the information about the actual shape of the feature is lost if no further effort is invested.

Another possibility to write an example message for a higher dimensional array is to convert the array into a byte string and then use the _bytes_feature function described in the Tensorflow-Guide to write the example message for it. The example message is then serialized and written into a TFRecord file.

import tensorflow as tf
import numpy as np

def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))): # if value ist tensor
        value = value.numpy() # get value of tensor
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def serialize_array(array):
  array = tf.io.serialize_tensor(array)
  return array


#----------------------------------------------------------------------------------
# Create example data
array_blueprint = np.arange(4, dtype='float64').reshape(2,2)
arrays = [array_blueprint+1, array_blueprint+2, array_blueprint+3]

#----------------------------------------------------------------------------------
# Write TFrecord file
file_path = 'data.tfrecords'
with tf.io.TFRecordWriter(file_path) as writer:
  for array in arrays:
    serialized_array = serialize_array(array)
    feature = {'b_feature': _bytes_feature(serialized_array)}
    example_message = tf.train.Example(features=tf.train.Features(feature=feature))
    writer.write(example_message.SerializeToString())

The serialized example messages stored in the TFRecord file can be accessed via tf.data.TFRecordDataset. After the example messages have been parsed, the original array needs to be restored from the byte string it was converted to. This is possible via tf.io.parse_tensor.

# Read TFRecord file
def _parse_tfr_element(element):
  parse_dic = {
    'b_feature': tf.io.FixedLenFeature([], tf.string), # Note that it is tf.string, not tf.float32
    }
  example_message = tf.io.parse_single_example(element, parse_dic)

  b_feature = example_message['b_feature'] # get byte string
  feature = tf.io.parse_tensor(b_feature, out_type=tf.float64) # restore 2D array from byte string
  return feature


tfr_dataset = tf.data.TFRecordDataset('data.tfrecords') 
for serialized_instance in tfr_dataset:
  print(serialized_instance) # print serialized example messages

dataset = tfr_dataset.map(_parse_tfr_element)
for instance in dataset:
  print()
  print(instance) # print parsed example messages with restored arrays

Solution 2

The tf.train.Feature class only supports lists (or 1-D arrays) when using the float_list argument. Depending on your data, you might try one of the following approaches:

  1. Flatten the data in your array before passing it to tf.train.Feature:

    def _floats_feature(value):
      return tf.train.Feature(float_list=tf.train.FloatList(value=value.reshape(-1)))
    

    Note that you might need to add another feature to indicate how this data should be reshaped when you parse it again (and you could use an int64_list feature for that purpose).

  2. Split the multidimensional feature into multiple 1-D features. For example, if c2d contains an N * 2 array of x- and y-coordinates, you could split that feature into separate train/coord2d/x and train/coord2d/y features, each containing the x- and y-coordinate data, respectively.

Solution 3

The documentation about Tfrecord recommends to use serialize_tensor

TFRecord and tf.train.Example

Note: To stay simple, this example only uses scalar inputs. The simplest way to handle non-scalar features is to use tf.io.serialize_tensor to convert tensors to binary-strings. Strings are scalars in tensorflow. Use tf.io.parse_tensor to convert the binary-string back to a tensor.

2 lines of code does the trick for me:

tensor = tf.convert_to_tensor(array)
result = tf.io.serialize_tensor(tensor)
Share:
13,418
csbk
Author by

csbk

Updated on June 07, 2022

Comments

  • csbk
    csbk almost 2 years

    I am trying to create a dataset in tfrecord format from numpy arrays. I am trying to store 2d and 3d coordinates.

    2d coordinates are numpy array of shape (2,10) of type float64 3d coordinates are numpy array of shape (3,10) of type float64

    this is my code:

    def _floats_feature(value):
        return tf.train.Feature(float_list=tf.train.FloatList(value=value))
    
    
    train_filename = 'train.tfrecords'  # address to save the TFRecords file
    writer = tf.python_io.TFRecordWriter(train_filename)
    
    
    for c in range(0,1000):
    
        #get 2d and 3d coordinates and save in c2d and c3d
    
        feature = {'train/coord2d': _floats_feature(c2d),
                       'train/coord3d': _floats_feature(c3d)}
        sample = tf.train.Example(features=tf.train.Features(feature=feature))
        writer.write(sample.SerializeToString())
    
    writer.close()
    

    when i run this i get the error:

      feature = {'train/coord2d': _floats_feature(c2d),
      File "genData.py", line 19, in _floats_feature
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))
      File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\google\protobuf\internal\python_message.py", line 510, in init
    copy.extend(field_value)
      File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\google\protobuf\internal\containers.py", line 275, in extend
    new_values = [self._type_checker.CheckValue(elem) for elem in elem_seq_iter]
      File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\google\protobuf\internal\containers.py", line 275, in <listcomp>
    new_values = [self._type_checker.CheckValue(elem) for elem in elem_seq_iter]
      File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\google\protobuf\internal\type_checkers.py", line 109, in CheckValue
    raise TypeError(message)
    TypeError: array([-163.685,  240.818, -114.05 , -518.554,  107.968,  427.184,
        157.418, -161.798,   87.102,  406.318]) has type <class 'numpy.ndarray'>, but expected one of: ((<class 'numbers.Real'>,),)
    

    I dont know how to fix this. should i store the features as int64 or bytes? I have no clue how to go about this since i am completely new to tensorflow. any help would be great! thanks