Read csv from Google Cloud storage to pandas dataframe
Solution 1
UPDATE
As of version 0.24 of pandas, read_csv
supports reading directly from Google Cloud Storage. Simply provide link to the bucket like this:
df = pd.read_csv('gs://bucket/your_path.csv')
The read_csv
will then use gcsfs
module to read the Dataframe, which means it had to be installed (or you will get an exception pointing at missing dependency).
I leave three other options for the sake of completeness.
- Home-made code
- gcsfs
- dask
I will cover them below.
The hard way: do-it-yourself code
I have written some convenience functions to read from Google Storage. To make it more readable I added type annotations. If you happen to be on Python 2, simply remove these and code will work all the same.
It works equally on public and private data sets, assuming you are authorised. In this approach you don't need to download first the data to your local drive.
How to use it:
fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
df = pd.read_csv(fileobj)
The code:
from io import BytesIO, StringIO
from google.cloud import storage
from google.oauth2 import service_account
def get_byte_fileobj(project: str,
bucket: str,
path: str,
service_account_credentials_path: str = None) -> BytesIO:
"""
Retrieve data from a given blob on Google Storage and pass it as a file object.
:param path: path within the bucket
:param project: name of the project
:param bucket_name: name of the bucket
:param service_account_credentials_path: path to credentials.
TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
:return: file object (BytesIO)
"""
blob = _get_blob(bucket, path, project, service_account_credentials_path)
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
byte_stream.seek(0)
return byte_stream
def get_bytestring(project: str,
bucket: str,
path: str,
service_account_credentials_path: str = None) -> bytes:
"""
Retrieve data from a given blob on Google Storage and pass it as a byte-string.
:param path: path within the bucket
:param project: name of the project
:param bucket_name: name of the bucket
:param service_account_credentials_path: path to credentials.
TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
:return: byte-string (needs to be decoded)
"""
blob = _get_blob(bucket, path, project, service_account_credentials_path)
s = blob.download_as_string()
return s
def _get_blob(bucket_name, path, project, service_account_credentials_path):
credentials = service_account.Credentials.from_service_account_file(
service_account_credentials_path) if service_account_credentials_path else None
storage_client = storage.Client(project=project, credentials=credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(path)
return blob
gcsfs
gcsfs is a "Pythonic file-system for Google Cloud Storage".
How to use it:
import pandas as pd
import gcsfs
fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
df = pd.read_csv(f)
dask
Dask "provides advanced parallelism for analytics, enabling performance at scale for the tools you love". It's great when you need to deal with large volumes of data in Python. Dask tries to mimic much of the pandas
API, making it easy to use for newcomers.
Here is the read_csv
How to use it:
import dask.dataframe as dd
df = dd.read_csv('gs://bucket/data.csv')
df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!
# df is now Dask dataframe, ready for distributed processing
# If you want to have the pandas version, simply:
df_pd = df.compute()
Solution 2
Another option is to use TensorFlow which comes with the ability to do a streaming read from Google Cloud Storage:
from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://bucket/file.csv', 'r') as f:
df = pd.read_csv(f)
Using tensorflow also gives you a convenient way to handle wildcards in the filename. For example:
Reading wildcard CSV into Pandas
Here is code that will read all CSVs that match a specific pattern (e.g: gs://bucket/some/dir/train-*) into a Pandas dataframe:
import tensorflow as tf
from tensorflow.python.lib.io import file_io
import pandas as pd
def read_csv_file(filename):
with file_io.FileIO(filename, 'r') as f:
df = pd.read_csv(f, header=None, names=['col1', 'col2'])
return df
def read_csv_files(filename_pattern):
filenames = tf.gfile.Glob(filename_pattern)
dataframes = [read_csv_file(filename) for filename in filenames]
return pd.concat(dataframes)
usage
DATADIR='gs://my-bucket/some/dir'
traindf = read_csv_files(os.path.join(DATADIR, 'train-*'))
evaldf = read_csv_files(os.path.join(DATADIR, 'eval-*'))
Solution 3
As of pandas==0.24.0
this is supported natively if you have gcsfs
installed: https://github.com/pandas-dev/pandas/pull/22704.
Until the official release you can try it out with pip install pandas==0.24.0rc1
.
Solution 4
Since Pandas 1.2 it's super easy to load files from google storage into a DataFrame.
If you work on your local machine it looks like this:
df = pd.read_csv('gcs://your-bucket/path/data.csv.gz',
storage_options={"token": "credentials.json"})
It's imported that you add as token the credentials.json file from google.
If you work on google cloud do this:
df = pd.read_csv('gcs://your-bucket/path/data.csv.gz',
storage_options={"token": "cloud"})
Solution 5
I was taking a look at this question and didn't want to have to go through the hassle of installing another library, gcsfs
, which literally says in the documentation, This software is beta, use at your own risk
... but I found a great workaround that I wanted to post here in case this is helpful to anyone else, using just the google.cloud storage library and some native python libraries. Here's the function:
import pandas as pd
from google.cloud import storage
import os
import io
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/creds.json'
def gcp_csv_to_df(bucket_name, source_file_name):
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
data = blob.download_as_string()
df = pd.read_csv(io.BytesIO(data))
print(f'Pulled down file from bucket {bucket_name}, file name: {source_file_name}')
return df
Further, although it is outside of the scope of this question, if you would like to upload a pandas dataframe to GCP using a similar function, here is the code to do so:
def df_to_gcp_csv(df, dest_bucket_name, dest_file_name):
storage_client = storage.Client()
bucket = storage_client.bucket(dest_bucket_name)
blob = bucket.blob(dest_file_name)
blob.upload_from_string(df.to_csv(), 'text/csv')
print(f'DataFrame uploaded to bucket {dest_bucket_name}, file name: {dest_file_name}')
Hope this is helpful! I know I'll be using these functions for sure.
Admin
Updated on September 29, 2021Comments
-
Admin over 2 years
I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline from io import BytesIO from google.cloud import storage storage_client = storage.Client() bucket = storage_client.get_bucket('createbucket123') blob = bucket.blob('my.csv') path = "gs://createbucket123/my.csv" df = pd.read_csv(path)
It shows this error message:
FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist
What am I doing wrong, I am not able to find any solution which does not involve google datalab?
-
MT467 about 5 yearsnew version does 0.24.2
-
dkapitan almost 5 yearsTo add to @LukaszTracewski, I find that the fs_gcsfs is more robust than gcsfs. Passing bucket-object to a BytesIO works for me.
-
Lukasz Tracewski over 4 years@JohnAndrews It's outside of scope of this question, but AFAIK
read_excel
will work nowadays same way asread_csv
. According to this github.com/pandas-dev/pandas/issues/19454read_*
have been implemented. -
Hansang over 4 yearsgcsfs is nice! If connecting to a secured GCS bucket, see this on how to add your credentials gcsfs.readthedocs.io/en/latest/#credentials I have tested working
-
Akhilesh_IN about 4 yearsThanks. This made
BytesIO()
more simple, I was downloading to the path and then removing it. -
Lukasz Tracewski over 3 yearsYou don't need to
import gcsfs
, but indeed thegcsfs
dependency has to be installed. I edited my answer to make sure it is clear. -
norman123123 almost 3 yearsIn the first example the variable
source_blob_name
would be the path to the file inside the bucket? -
Lle.4 almost 3 yearsExactly! So it's path/to/file.csv
-
william_grisaitis almost 3 years
pip install pandas>=0.24.0
-
Nilo Araujo almost 2 yearsYou can also set the GOOGLE_APPLICATION_CREDENTIALS environment variable with the absolute path of your token instead of using storage_options.