Writing a pickle file to an s3 bucket in AWS

24,694

Solution 1

I've found the solution, need to call BytesIO into the buffer for pickle files instead of StringIO (which are for CSV files).

import io
import boto3

pickle_buffer = io.BytesIO()
s3_resource = boto3.resource('s3')

new_df.to_pickle(pickle_buffer)
s3_resource.Object(bucket, key).put(Body=pickle_buffer.getvalue())

Solution 2

Further to you answer, you don't need to convert to csv. pickle.dumps method returns a byte obj. see here: https://docs.python.org/3/library/pickle.html

import boto3
import pickle

bucket='your_bucket_name'
key='your_pickle_filename.pkl'
pickle_byte_obj = pickle.dumps([var1, var2, ..., varn]) 
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket,key).put(Body=pickle_byte_obj)

Solution 3

this worked for me with pandas 0.23.4 and boto3 1.7.80 :

bucket='your_bucket_name'
key='your_pickle_filename.pkl'
new_df.to_pickle(key)
s3_resource.Object(bucket, key).put(Body=open(key, 'rb'))

Solution 4

This solution (using s3fs) worked perfectly and elegantly for my team:

import s3fs
from pickle import dump

fs = s3fs.S3FileSystem(anon=False)

bucket = 'bucket1'
key = 'your_pickle_filename.pkl'

dump(data, fs.open(f's3://{bucket}/{key}', 'wb'))

Share:
24,694

Related videos on Youtube

himi64
Author by

himi64

Software developer for machine learning applications.

Updated on July 27, 2021

Comments

  • himi64
    himi64 almost 3 years

    I'm trying to write a pandas dataframe as a pickle file into an s3 bucket in AWS. I know that I can write dataframe new_df as a csv to an s3 bucket as follows:

    bucket='mybucket'
    key='path'
    
    csv_buffer = StringIO()
    s3_resource = boto3.resource('s3')
    
    new_df.to_csv(csv_buffer, index=False)
    s3_resource.Object(bucket,path).put(Body=csv_buffer.getvalue())
    

    I've tried using the same code as above with to_pickle() but with no success.

  • Sip
    Sip over 5 years
    Do you have a suggestion, how to use this with a pandas-Dataframe? i tried pickle_byte_obj = df.to_pickle(None).encode() but it doesn't seem to work
  • whs2k
    whs2k about 5 years
    import s3fs and then you can df.to_csv('s3://bucket/path/fn.csv')
  • Falc
    Falc over 4 years
    I get the error: ValueError: Unrecognized compression type: infer when using this code
  • TheProletariat
    TheProletariat almost 4 years
    I get the error ValueError: I/O operation on closed file.
  • TheProletariat
    TheProletariat almost 4 years
    I think you mean to put key instead of path, so that it reads: s3_resource.Object(bucket, key).put(Body=open(key, 'rb')), right? Also, this worked for me and did not throw a I/O operation on a closed file error once I replaced 'path'. Thanks!
  • Mehrad Eslami
    Mehrad Eslami over 3 years
    if you are getting ValueError: infer, change the compression to None. it's set by default to infer. df.to_pickle(buffer_pickle, compression=None)
  • Med Zamrik
    Med Zamrik over 3 years
    using this method caused ValueError: I/O operation on closed file. error for me as well, I used buffer = pickle.dumps(df) and then used buffer as the Body for s3 put