Efficiently write a Pandas dataframe to Google BigQuery

36,597

Solution 1

I did the comparison for alternative 1 and 3 in Datalab using the following code:

from datalab.context import Context
import datalab.storage as storage
import datalab.bigquery as bq
import pandas as pd
from pandas import DataFrame
import time

# Dataframe to write
my_data = [{1,2,3}]
for i in range(0,100000):
    my_data.append({1,2,3})
not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])

#Alternative 1
start = time.time()
not_so_simple_dataframe.to_gbq('TestDataSet.TestTable', 
                 Context.default().project_id,
                 chunksize=10000, 
                 if_exists='append',
                 verbose=False
                 )
end = time.time()
print("time alternative 1 " + str(end - start))

#Alternative 3
start = time.time()
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable not_so_simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert_data(not_so_simple_dataframe)
end = time.time()
print("time alternative 3 " + str(end - start))

and here are the results for n = {10000,100000,1000000}:

n       alternative_1  alternative_3
10000   30.72s         8.14s
100000  162.43s        70.64s
1000000 1473.57s       688.59s

Judging from the results, alternative 3 is faster than alternative 1.

Solution 2

Having also had performance issues with to_gbq() I just tried the native google client and it's miles faster (approx 4x), and if you omit the step where you wait for the result, it's approx 20x faster.

Worth noting that best practice would be to wait for the result and check it, but in my case there's extra steps later on that validate the results.

I'm using pandas_gbq version 0.15 (the latest at the time of writing). Try this:

from google.cloud import bigquery
import pandas

df = pandas.DataFrame(
    {
        'my_string': ['a', 'b', 'c'],
        'my_int64': [1, 2, 3],
        'my_float64': [4.0, 5.0, 6.0],
        'my_timestamp': [
            pandas.Timestamp("1998-09-04T16:03:14"),
            pandas.Timestamp("2010-09-13T12:03:45"),
            pandas.Timestamp("2015-10-02T16:00:00")
        ],
    }
)

client = bigquery.Client()
table_id = 'my_dataset.new_table'

# Since string columns use the "object" dtype, pass in a (partial) schema
# to ensure the correct BigQuery data type.
job_config = bigquery.LoadJobConfig(schema=[
    bigquery.SchemaField("my_string", "STRING"),
])

job = client.load_table_from_dataframe(
    df, table_id, job_config=job_config
)

# Wait for the load job to complete. (I omit this step)
# job.result()
Share:
36,597
Pablo
Author by

Pablo

Updated on December 03, 2021

Comments

  • Pablo
    Pablo over 2 years

    I'm trying to upload a pandas.DataFrame to Google Big Query using the pandas.DataFrame.to_gbq() function documented here. The problem is that to_gbq() takes 2.3 minutes while uploading directly to Google Cloud Storage takes less than a minute. I'm planning to upload a bunch of dataframes (~32) each one with a similar size, so I want to know what is the faster alternative.

    This is the script that I'm using:

    dataframe.to_gbq('my_dataset.my_table', 
                     'my_project_id',
                     chunksize=None, # I have tried with several chunk sizes, it runs faster when it's one big chunk (at least for me)
                     if_exists='append',
                     verbose=False
                     )
    
    dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds 
    # manually upload the file into GCS GUI
    print(dataframe.shape)
    (363364, 21)
    

    My question is, what is faster?

    1. Upload Dataframe using pandas.DataFrame.to_gbq() function
    2. Saving Dataframe as CSV and then upload it as a file to BigQuery using the Python API
    3. Saving Dataframe as CSV and then upload the file to Google Cloud Storage using this procedure and then reading it from BigQuery

    Update:

    Alternative 1 seems faster than Alternative 2 , (using pd.DataFrame.to_csv() and load_data_from_file() 17.9 secs more in average with 3 loops):

    def load_data_from_file(dataset_id, table_id, source_file_name):
        bigquery_client = bigquery.Client()
        dataset_ref = bigquery_client.dataset(dataset_id)
        table_ref = dataset_ref.table(table_id)
        
        with open(source_file_name, 'rb') as source_file:
            # This example uses CSV, but you can use other formats.
            # See https://cloud.google.com/bigquery/loading-data
            job_config = bigquery.LoadJobConfig()
            job_config.source_format = 'text/csv'
            job_config.autodetect=True
            job = bigquery_client.load_table_from_file(
                source_file, table_ref, job_config=job_config)
    
        job.result()  # Waits for job to complete
    
        print('Loaded {} rows into {}:{}.'.format(
            job.output_rows, dataset_id, table_id))