Write a Pandas DataFrame to Google Cloud Storage or BigQuery
Solution 1
Try the following working example:
from datalab.context import Context
import google.datalab.storage as storage
import google.datalab.bigquery as bq
import pandas as pd
# Dataframe to write
simple_dataframe = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'
# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)
# Create storage bucket if it does not exist
if not sample_bucket.exists():
sample_bucket.create()
# Define BigQuery dataset and table
dataset = bq.Dataset(bigquery_dataset_name)
table = bq.Table(bigquery_dataset_name + '.' + bigquery_table_name)
# Create BigQuery dataset
if not dataset.exists():
dataset.create()
# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(simple_dataframe)
table.create(schema = table_schema, overwrite = True)
# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable simple_dataframe --object $sample_bucket_object
# Write the DataFrame to a BigQuery table
table.insert(simple_dataframe)
I used this example, and the _table.py file from the datalab github site as a reference. You can find other datalab
source code files at this link.
Solution 2
Uploading to Google Cloud Storage without writing a temporary file and only using the standard GCS module
from google.cloud import storage
import os
import pandas as pd
# Only need this if you're running this code locally.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'/your_GCP_creds/credentials.json'
df = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])
client = storage.Client()
bucket = client.get_bucket('my-bucket-name')
bucket.blob('upload_test/test.csv').upload_from_string(df.to_csv(), 'text/csv')
Solution 3
I spent a lot of time to find the easiest way to solve this:
import pandas as pd
df = pd.DataFrame(...)
df.to_csv('gs://bucket/path')
Solution 4
Using the Google Cloud Datalab documentation
import datalab.storage as gcs
gcs.Bucket('bucket-name').item('to/data.csv').write_to(simple_dataframe.to_csv(),'text/csv')
Solution 5
Writing a Pandas DataFrame to BigQuery
Update on @Anthonios Partheniou's answer.
The code is a bit different now - as of Nov. 29 2017
To define a BigQuery dataset
Pass a tuple containing project_id
and dataset_id
to bq.Dataset
.
# define a BigQuery dataset
bigquery_dataset_name = ('project_id', 'dataset_id')
dataset = bq.Dataset(name = bigquery_dataset_name)
To define a BigQuery table
Pass a tuple containing project_id
, dataset_id
and the table name to bq.Table
.
# define a BigQuery table
bigquery_table_name = ('project_id', 'dataset_id', 'table_name')
table = bq.Table(bigquery_table_name)
Create the dataset/ table and write to table in BQ
# Create BigQuery dataset
if not dataset.exists():
dataset.create()
# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(dataFrame_name)
table.create(schema = table_schema, overwrite = True)
# Write the DataFrame to a BigQuery table
table.insert(dataFrame_name)
EcoWarrior
Updated on July 09, 2022Comments
-
EcoWarrior almost 2 years
Hello and thanks for your time and consideration. I am developing a Jupyter Notebook in the Google Cloud Platform / Datalab. I have created a Pandas DataFrame and would like to write this DataFrame to both Google Cloud Storage(GCS) and/or BigQuery. I have a bucket in GCS and have, via the following code, created the following objects:
import gcp import gcp.storage as storage project = gcp.Context.default().project_id bucket_name = 'steve-temp' bucket_path = bucket_name bucket = storage.Bucket(bucket_path) bucket.exists()
I have tried various approaches based on Google Datalab documentation but continue to fail. Thanks
-
dartdog about 8 yearsJust a note: I believe you need to execute the %%storage commands in a separate cell from the Python code?
-
Anthonios Partheniou about 8 yearsIt depends on whether you want to execute a line magic or cell magic command. For cell magic it is %%storage, for line magic it is %storage. It's ok to use line magic commands in the same cell as other code. Cell magic commands must be in a separate cell from other code
-
dartdog about 8 yearsThanks for the clarification
-
EcoWarrior about 8 yearsThanks very much Anthonios... I was able to successfully create all of the objects (e.g., the table and the schema are in my Project/Dataset in BQ). However, no rows were actually written to the table and no error messages were generated.
-
EcoWarrior about 8 yearsA populated table was generated in the Jupyter Notebook after table.Insert_data(out) and this line was at the bottom of that table: (rows: 0, edw-p19090000:ClickADS2.ADS_Logit1)
-
Anthonios Partheniou about 8 yearsI found a similar stackoverflow question related to delayed data. Please check the solution at the following link to see if you are experiencing a similar issue: stackoverflow.com/questions/35656910/…
-
Elona Mishmika over 6 yearsIt's very slow. The direct conversion from dataframe to bigquery. Anyways faster?
-
Anthonios Partheniou over 6 yearsOne possible faster method: Write the csv to Google Cloud Storage first, then use the command line
bq
tool to load from GCS to BigQuery. You could also look into using Google Cloud Dataflow. -
Elona Mishmika over 6 yearsI'm doing this now :) However, I found when I wrote it into GCS, it doesn't have the comma to seperate every columns. Do you have this problem too?
-
pascalwhoop about 5 yearsthe
exists()
function doesn't exist for me on1.11.2
forgoogle-cloud-bigquery
in python -
adamc over 4 yearsReally appreciate this one for using no other modules and an existing bucket.
-
Amjad Desai about 3 yearsif you only want to push the file to a bucket on GCS then this is a more suitable solution. This can also be used in case you want to push out json format : bucket.blob('upload_test/test.json').upload_from_string(df.to_json(), 'text/json')
-
Nermin almost 3 yearsUse
df.to_csv(index=False)
if you don't want the index as a column in your file -
bsplosion over 2 yearsThis is hilariously simple. Just make sure to also install
gcsfs
as a prerequisite (though it'll remind you anyway). If you're coming here in 2020 or later, just skip the complexity and do this. -
Danish Bansal over 2 yearsIs there a way to make a saved file publically accessible directly by passing any argument?
-
Shiv Krishna Jaiswal about 2 yearsIt is not working. I have created a ubuntu server and installed
pip install pandas fsspec gcsfs
. I am able to read csv file usingpd.read_csv(gs://BUCKET_PATH)
but not able to write -
Shiv Krishna Jaiswal about 2 yearsGot the answer of my own question. It is access issue. See this link