Converting pandas.DataFrame to bytes

21,578

You can use df.to_records() to convert your dataframe to a numpy recarray, then call .tostring() to convert this to a string of bytes:

rec = df.to_records(index=False)

print(repr(rec))
# rec.array([(10, 18446744073709551615, 13240000000.0), (15, 230498234019, 3.14159),
#  (20, 32094812309, 234.1341)], 
#           dtype=[('a', '|u1'), ('b', '<u8'), ('c', '<f8')])

s = rec.tostring()
rec2 = np.fromstring(s, rec.dtype)

print(np.all(rec2 == rec))
# True
Share:
21,578
Paul Joireman
Author by

Paul Joireman

Python, AWS, Java, Angular, C++, OOP, Design Patterns,

Updated on April 28, 2022

Comments

  • Paul Joireman
    Paul Joireman about 2 years

    I need convert the data stored in a pandas.DataFrame into a byte string where each column can have a separate data type (integer or floating point). Here is a simple set of data:

    df = pd.DataFrame([ 10, 15, 20], dtype='u1', columns=['a'])
    df['b'] = np.array([np.iinfo('u8').max, 230498234019, 32094812309], dtype='u8')
    df['c'] = np.array([1.324e10, 3.14159, 234.1341], dtype='f8')
    

    and df looks something like this:

        a            b                  c
    0   10  18446744073709551615    1.324000e+10
    1   15  230498234019            3.141590e+00
    2   20  32094812309             2.341341e+02
    

    The DataFrame knows about the types of each column df.dtypes so I'd like to do something like this:

    data_to_pack = [tuple(record) for _, record in df.iterrows()]
    data_array = np.array(data_to_pack, dtype=zip(df.columns, df.dtypes))
    data_bytes = data_array.tostring()
    

    This typically works fine but in this case (due to the maximum value stored in df['b'][0]. The second line above converting the array of tuples to an np.array with a given set of types causes the following error:

    OverflowError: Python int too large to convert to C long
    

    The error results (I believe) in the first line which extracts the record as a Series with a single data type (defaults to float64) and the representation chosen in float64 for the maximum uint64 value is not directly convertible back to uint64.

    1) Since the DataFrame already knows the types of each column is there a way to get around creating a row of tuples for input into the typed numpy.array constructor? Or is there a better way than outlined above to preserve the type information in such a conversion?

    2) Is there a way to go directly from DataFrame to a byte string representing the data using the type information for each column.

  • user17242583
    user17242583 about 2 years
    While this code may solve the question, including an explanation of how and why this solves the problem would really help to improve the quality of your post. Remember that you are answering the question for readers in the future, not just the person asking now. Please edit your answer to add explanations and give an indication of what limitations and assumptions apply.