Write a Pandas DataFrame to Google Cloud Storage or BigQuery

python google-cloud-storage google-cloud-platform google-cloud-datalab

44,632

Solution 1

Try the following working example:

from datalab.context import Context
import google.datalab.storage as storage
import google.datalab.bigquery as bq
import pandas as pd

# Dataframe to write
simple_dataframe = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create storage bucket if it does not exist
if not sample_bucket.exists():
    sample_bucket.create()

# Define BigQuery dataset and table
dataset = bq.Dataset(bigquery_dataset_name)
table = bq.Table(bigquery_dataset_name + '.' + bigquery_table_name)

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(simple_dataframe)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert(simple_dataframe)

I used this example, and the _table.py file from the datalab github site as a reference. You can find other datalab source code files at this link.

Solution 2

Uploading to Google Cloud Storage without writing a temporary file and only using the standard GCS module

from google.cloud import storage
import os
import pandas as pd

# Only need this if you're running this code locally.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'/your_GCP_creds/credentials.json'

df = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

client = storage.Client()
bucket = client.get_bucket('my-bucket-name')
    
bucket.blob('upload_test/test.csv').upload_from_string(df.to_csv(), 'text/csv')

Solution 3

I spent a lot of time to find the easiest way to solve this:

import pandas as pd

df = pd.DataFrame(...)

df.to_csv('gs://bucket/path')

Solution 4

Using the Google Cloud Datalab documentation

import datalab.storage as gcs
gcs.Bucket('bucket-name').item('to/data.csv').write_to(simple_dataframe.to_csv(),'text/csv')

Solution 5

Writing a Pandas DataFrame to BigQuery

Update on @Anthonios Partheniou's answer.
The code is a bit different now - as of Nov. 29 2017

To define a BigQuery dataset

Pass a tuple containing project_id and dataset_id to bq.Dataset.

# define a BigQuery dataset    
bigquery_dataset_name = ('project_id', 'dataset_id')
dataset = bq.Dataset(name = bigquery_dataset_name)

To define a BigQuery table

Pass a tuple containing project_id, dataset_id and the table name to bq.Table.

# define a BigQuery table    
bigquery_table_name = ('project_id', 'dataset_id', 'table_name')
table = bq.Table(bigquery_table_name)

Create the dataset/ table and write to table in BQ

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(dataFrame_name)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to a BigQuery table
table.insert(dataFrame_name)

View more solutions

44,632

Author by

EcoWarrior

Updated on July 09, 2022

Comments

EcoWarrior almost 2 years
Hello and thanks for your time and consideration. I am developing a Jupyter Notebook in the Google Cloud Platform / Datalab. I have created a Pandas DataFrame and would like to write this DataFrame to both Google Cloud Storage(GCS) and/or BigQuery. I have a bucket in GCS and have, via the following code, created the following objects:
```
import gcp
import gcp.storage as storage
project = gcp.Context.default().project_id    
bucket_name = 'steve-temp'           
bucket_path  = bucket_name   
bucket = storage.Bucket(bucket_path)
bucket.exists()  
```
I have tried various approaches based on Google Datalab documentation but continue to fail. Thanks
dartdog about 8 years

Just a note: I believe you need to execute the %%storage commands in a separate cell from the Python code?
Anthonios Partheniou about 8 years

It depends on whether you want to execute a line magic or cell magic command. For cell magic it is %%storage, for line magic it is %storage. It's ok to use line magic commands in the same cell as other code. Cell magic commands must be in a separate cell from other code
dartdog about 8 years

Thanks for the clarification
EcoWarrior about 8 years

Thanks very much Anthonios... I was able to successfully create all of the objects (e.g., the table and the schema are in my Project/Dataset in BQ). However, no rows were actually written to the table and no error messages were generated.
EcoWarrior about 8 years

A populated table was generated in the Jupyter Notebook after table.Insert_data(out) and this line was at the bottom of that table: (rows: 0, edw-p19090000:ClickADS2.ADS_Logit1)
Anthonios Partheniou about 8 years

I found a similar stackoverflow question related to delayed data. Please check the solution at the following link to see if you are experiencing a similar issue: stackoverflow.com/questions/35656910/…
Elona Mishmika over 6 years

It's very slow. The direct conversion from dataframe to bigquery. Anyways faster?
Anthonios Partheniou over 6 years

One possible faster method: Write the csv to Google Cloud Storage first, then use the command line bq tool to load from GCS to BigQuery. You could also look into using Google Cloud Dataflow.
Elona Mishmika over 6 years

I'm doing this now :) However, I found when I wrote it into GCS, it doesn't have the comma to seperate every columns. Do you have this problem too?
pascalwhoop about 5 years

the exists() function doesn't exist for me on 1.11.2 for google-cloud-bigquery in python
adamc over 4 years

Really appreciate this one for using no other modules and an existing bucket.
Amjad Desai about 3 years

if you only want to push the file to a bucket on GCS then this is a more suitable solution. This can also be used in case you want to push out json format : bucket.blob('upload_test/test.json').upload_from_string(df.t‌o_json(), 'text/json')
Nermin almost 3 years

Use df.to_csv(index=False) if you don't want the index as a column in your file
bsplosion over 2 years

This is hilariously simple. Just make sure to also install gcsfs as a prerequisite (though it'll remind you anyway). If you're coming here in 2020 or later, just skip the complexity and do this.
Danish Bansal over 2 years

Is there a way to make a saved file publically accessible directly by passing any argument?
Shiv Krishna Jaiswal about 2 years

It is not working. I have created a ubuntu server and installed pip install pandas fsspec gcsfs. I am able to read csv file using pd.read_csv(gs://BUCKET_PATH) but not able to write
Shiv Krishna Jaiswal about 2 years

Got the answer of my own question. It is access issue. See this link