Efficiently write a Pandas dataframe to Google BigQuery
Solution 1
I did the comparison for alternative 1 and 3 in Datalab
using the following code:
from datalab.context import Context
import datalab.storage as storage
import datalab.bigquery as bq
import pandas as pd
from pandas import DataFrame
import time
# Dataframe to write
my_data = [{1,2,3}]
for i in range(0,100000):
my_data.append({1,2,3})
not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])
#Alternative 1
start = time.time()
not_so_simple_dataframe.to_gbq('TestDataSet.TestTable',
Context.default().project_id,
chunksize=10000,
if_exists='append',
verbose=False
)
end = time.time()
print("time alternative 1 " + str(end - start))
#Alternative 3
start = time.time()
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'
# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)
# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)
# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable not_so_simple_dataframe --object $sample_bucket_object
# Write the DataFrame to a BigQuery table
table.insert_data(not_so_simple_dataframe)
end = time.time()
print("time alternative 3 " + str(end - start))
and here are the results for n = {10000,100000,1000000}:
n alternative_1 alternative_3
10000 30.72s 8.14s
100000 162.43s 70.64s
1000000 1473.57s 688.59s
Judging from the results, alternative 3 is faster than alternative 1.
Solution 2
Having also had performance issues with to_gbq() I just tried the native google client and it's miles faster (approx 4x), and if you omit the step where you wait for the result, it's approx 20x faster.
Worth noting that best practice would be to wait for the result and check it, but in my case there's extra steps later on that validate the results.
I'm using pandas_gbq version 0.15 (the latest at the time of writing). Try this:
from google.cloud import bigquery
import pandas
df = pandas.DataFrame(
{
'my_string': ['a', 'b', 'c'],
'my_int64': [1, 2, 3],
'my_float64': [4.0, 5.0, 6.0],
'my_timestamp': [
pandas.Timestamp("1998-09-04T16:03:14"),
pandas.Timestamp("2010-09-13T12:03:45"),
pandas.Timestamp("2015-10-02T16:00:00")
],
}
)
client = bigquery.Client()
table_id = 'my_dataset.new_table'
# Since string columns use the "object" dtype, pass in a (partial) schema
# to ensure the correct BigQuery data type.
job_config = bigquery.LoadJobConfig(schema=[
bigquery.SchemaField("my_string", "STRING"),
])
job = client.load_table_from_dataframe(
df, table_id, job_config=job_config
)
# Wait for the load job to complete. (I omit this step)
# job.result()
Pablo
Updated on December 03, 2021Comments
-
Pablo over 2 years
I'm trying to upload a
pandas.DataFrame
to Google Big Query using thepandas.DataFrame.to_gbq()
function documented here. The problem is thatto_gbq()
takes 2.3 minutes while uploading directly to Google Cloud Storage takes less than a minute. I'm planning to upload a bunch of dataframes (~32) each one with a similar size, so I want to know what is the faster alternative.This is the script that I'm using:
dataframe.to_gbq('my_dataset.my_table', 'my_project_id', chunksize=None, # I have tried with several chunk sizes, it runs faster when it's one big chunk (at least for me) if_exists='append', verbose=False ) dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds # manually upload the file into GCS GUI print(dataframe.shape) (363364, 21)
My question is, what is faster?
- Upload
Dataframe
usingpandas.DataFrame.to_gbq()
function - Saving
Dataframe
as CSV and then upload it as a file to BigQuery using the Python API - Saving
Dataframe
as CSV and then upload the file to Google Cloud Storage using this procedure and then reading it from BigQuery
Update:
Alternative 1 seems faster than Alternative 2 , (using
pd.DataFrame.to_csv()
andload_data_from_file()
17.9 secs more in average with 3 loops
):def load_data_from_file(dataset_id, table_id, source_file_name): bigquery_client = bigquery.Client() dataset_ref = bigquery_client.dataset(dataset_id) table_ref = dataset_ref.table(table_id) with open(source_file_name, 'rb') as source_file: # This example uses CSV, but you can use other formats. # See https://cloud.google.com/bigquery/loading-data job_config = bigquery.LoadJobConfig() job_config.source_format = 'text/csv' job_config.autodetect=True job = bigquery_client.load_table_from_file( source_file, table_ref, job_config=job_config) job.result() # Waits for job to complete print('Loaded {} rows into {}:{}.'.format( job.output_rows, dataset_id, table_id))
- Upload