Google Cloud Storage + Python : Any way to list obj in certain folder in GCS?

50,709

Solution 1

Update: the below is true for the older "Google API Client Libraries" for Python, but if you're not using that client, prefer the newer "Google Cloud Client Library" for Python ( https://googleapis.dev/python/storage/latest/index.html ). For the newer library, the equivalent to the below code is:

from google.cloud import storage

client = storage.Client()
for blob in client.list_blobs('bucketname', prefix='abc/myfolder'):
  print(str(blob))

Answer for older client follows.

You may find it easier to work with the JSON API, which has a full-featured Python client. It has a function for listing objects that takes a prefix parameter, which you could use to check for a certain directory and its children in this manner:

from apiclient import discovery

# Auth goes here if necessary. Create authorized http object...
client = discovery.build('storage', 'v1') # add http=whatever param if auth
request = client.objects().list(
    bucket="mybucket",
    prefix="abc/myfolder")
while request is not None:
  response = request.execute()
  print json.dumps(response, indent=2)
  request = request.list_next(request, response)

Fuller documentation of the list call is here: https://developers.google.com/storage/docs/json_api/v1/objects/list

And the Google Python API client is documented here: https://code.google.com/p/google-api-python-client/

Solution 2

This worked for me:

client = storage.Client()
BUCKET_NAME = 'DEMO_BUCKET'
bucket = client.get_bucket(BUCKET_NAME)

blobs = bucket.list_blobs()

for blob in blobs:
    print(blob.name)

The list_blobs() method will return an iterator used to find blobs in the bucket. Now you can iterate over blobs and access every object in the bucket. In this example I just print out the name of the object.

This documentation helped me alot:

I hope I could help!

Solution 3

You might also want to look at gcloud-python and documentation.

from gcloud import storage
connection = storage.get_connection(project_name, email, private_key_path)
bucket = connection.get_bucket('my-bucket')

for key in bucket:
  if key.name == 'abc.txt':
    print 'Found it!'
    break

However, you might be better off just checking if the file exists:

if 'abc.txt' in bucket:
  print 'Found it!'

Solution 4

Install python package google-cloud-storage by pip or pycharm and use below code

from google.cloud import storage
client = storage.Client()
for blob in client.list_blobs(BUCKET_NAME, prefix=FOLDER_NAME):
  print(str(blob))

Solution 5

I know this is an old question, but I stumbled over this because I was looking for the exact same answer. Answers from Brandon Yarbrough and Abhijit worked for me, but I wanted to get into more detail.

When you run this:

from google.cloud import storage
storage_client = storage.Client()
blobs = list(storage_client.list_blobs(bucket_name, prefix=PREFIX, fields="items(name)"))

You will get Blob objects, with just the name field of all files in the given bucket, like this:

[<Blob: BUCKET_NAME, PREFIX, None>, 
 <Blob: xml-BUCKET_NAME, [PREFIX]claim_757325.json, None>, 
 <Blob: xml-BUCKET_NAME, [PREFIX]claim_757390.json, None>,
 ...]

If you are like me and you want to 1) filter out the first item in the list because it does NOT represent a file - its just the prefix, 2) just get the name string value, and 3) remove the PREFIX from the file name, you can do something like this:

blob_names = [blob_name.name[len(PREFIX):] for blob_name in blobs if blob_name.name != folder_name]

Complete code to get just the string files names from a storage bucket:

from google.cloud import storage
storage_client = storage.Client()
blobs = list(storage_client.list_blobs(bucket_name, prefix=PREFIX, fields="items(name)"))
blob_names = [blob_name.name[len(PREFIX):] for blob_name in blobs if blob_name.name != folder_name]
print(f"blob_names = {blob_names}")
Share:
50,709

Related videos on Youtube

Reed_Xia
Author by

Reed_Xia

Test automation, CI, DevOps

Updated on July 09, 2022

Comments

  • Reed_Xia
    Reed_Xia almost 2 years

    I'm going to write a Python program to check if a file is in certain folder of my Google Cloud Storage, the basic idea is to get the list of all objects in a folder, a file name list, then check if the file abc.txt is in the file name list.

    Now the problem is, it looks Google only provide the one way to get obj list, which is uri.get_bucket(), see below code which is from https://developers.google.com/storage/docs/gspythonlibrary#listing-objects

    uri = boto.storage_uri(DOGS_BUCKET, GOOGLE_STORAGE)
    for obj in uri.get_bucket():
        print '%s://%s/%s' % (uri.scheme, uri.bucket_name, obj.name)
        print '  "%s"' % obj.get_contents_as_string()
    

    The defect of uri.get_bucket() is, it looks it is getting all of the object first, this is what I don't want, I just need get the obj name list of particular folder(e.g gs//mybucket/abc/myfolder) , which should be much quickly.

    Could someone help answer? Appreciate every answer!

  • Reed_Xia
    Reed_Xia about 10 years
    Could you advise how to define client? I already import json and apiclient, but it will throw NameError: name 'client' is not defined, I checked the doc and did not found this part of code, thank you!
  • Reed_Xia
    Reed_Xia about 10 years
    I'm working on Windows 7, I failed to easy_install gcloud, finally it would end with warning: GMP or MPIR library not found; Not building Crypto.PublicKey._fastmath. error: Setup script exited with error: Unable to find vcvarsall.bat, could you advise? Thank you!
  • JJ Geewax
    JJ Geewax about 10 years
    Do you have PyCrypto and all those installed? Windows installers for those are available online I believe.
  • ShanEllis
    ShanEllis about 10 years
    Added a bit above with example syntax.
  • John
    John about 5 years
    And if you want to filter files in particular folder use bucket.list_blobs(prefix="path")
  • CpILL
    CpILL about 4 years
    is there anyway to speed this up? It slow for millions of blobs
  • Ema Il
    Ema Il over 2 years
    Works great for older versions of the google api. Thanks