How to get list of folders in a given bucket using Google Cloud API

37,964

Solution 1

You can use the Python GCS API Client Library. See the Samples and Libraries for Google Cloud Storage documentation page for relevant links to documentation and downloads.

In your case, first I want to point out that you're confusing the term "bucket". I recommend reading the Key Terms page of the documentation. What you're talking about are object name prefixes.

You can start with the list-objects.py sample on GitHub. Looking at the list reference page, you'll want to pass bucket=abc, prefix=xyz/ and delimiter=/.

Solution 2

This question is about listing the folders inside a bucket/folder. None of the suggestions worked for me and after experimenting with the google.cloud.storage SDK, I suspect it is not possible (as of November 2019) to list the sub-directories of any path in a bucket. It is possible with the REST API, so I wrote this little wrapper...

from google.api_core import page_iterator
from google.cloud import storage

def _item_to_value(iterator, item):
    return item

def list_directories(bucket_name, prefix):
    if prefix and not prefix.endswith('/'):
        prefix += '/'

    extra_params = {
        "projection": "noAcl",
        "prefix": prefix,
        "delimiter": '/'
    }

    gcs = storage.Client()

    path = "/b/" + bucket_name + "/o"

    iterator = page_iterator.HTTPIterator(
        client=gcs,
        api_request=gcs._connection.api_request,
        path=path,
        items_key='prefixes',
        item_to_value=_item_to_value,
        extra_params=extra_params,
    )

    return [x for x in iterator]

For example, if you have my-bucket containing:

  • dog-bark
    • datasets
      • v1
      • v2

Then calling list_directories('my-bucket', 'dog-bark/datasets') will return:

['dog-bark/datasets/v1', 'dog-bark/datasets/v2']

Solution 3

Here's an update to this answer thread:

from google.cloud import storage

# Instantiates a client
storage_client = storage.Client()

# Get GCS bucket
bucket = storage_client.get_bucket(bucket_name)

# Get blobs in bucket (including all subdirectories)
blobs_all = list(bucket.list_blobs())

# Get blobs in specific subirectory
blobs_specific = list(bucket.list_blobs(prefix='path/to/subfolder/'))

Solution 4

To get a list of folders in a bucket, you can use the code snippet below:

import googleapiclient.discovery


def list_sub_directories(bucket_name, prefix):
    """Returns a list of sub-directories within the given bucket."""
    service = googleapiclient.discovery.build('storage', 'v1')

    req = service.objects().list(bucket=bucket_name, prefix=prefix, delimiter='/')
    res = req.execute()
    return res['prefixes']

# For the example (gs://abc/xyz), bucket_name is 'abc' and the prefix would be 'xyz/'
print(list_sub_directories(bucket_name='abc', prefix='xyz/'))

Solution 5

I also need to simply list the contents of a bucket. Ideally I would like something similar to what tf.gfile provides. tf.gfile has support for determining if an entry is a file or a directory.

I tried the various links provided by @jterrace above but my results were not optimal. With that said its worth showing the results.

Given a bucket which has a mix of "directories" and "files" its hard to navigate the "filesystem" to find items of interest. I've provided some comments in the code on how the code referenced above works.

In either case, I am using a datalab notebook with credentials included by the notebook. Given the results, I will need to use string parsing to determine which files are in a particular directory. If anyone knows how to expand these methods or an alternate method to parse the directories similar to tf.gfile, please reply.

Method One

import sys
import json
import argparse
import googleapiclient.discovery

BUCKET = 'bucket-sounds' 

def create_service():
    return googleapiclient.discovery.build('storage', 'v1')


def list_bucket(bucket):
    """Returns a list of metadata of the objects within the given bucket."""
    service = create_service()

    # Create a request to objects.list to retrieve a list of objects.
    fields_to_return = 'nextPageToken,items(name,size,contentType,metadata(my-key))'
    #req = service.objects().list(bucket=bucket, fields=fields_to_return)  # returns everything
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound')  # returns everything. UrbanSound is top dir in bucket
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/FREE') # returns the file FREESOUNDCREDITS.TXT
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/FREESOUNDCREDITS.txt', delimiter='/') # same as above
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/data/dog_bark', delimiter='/') # returns nothing
    req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/data/dog_bark/', delimiter='/') # returns files in dog_bark dir

    all_objects = []
    # If you have too many items to list in one request, list_next() will
    # automatically handle paging with the pageToken.
    while req:
        resp = req.execute()
        all_objects.extend(resp.get('items', []))
        req = service.objects().list_next(req, resp)
    return all_objects

# usage
print(json.dumps(list_bucket(BUCKET), indent=2))

This generates results like this:

[
  {
    "contentType": "text/csv", 
    "name": "UrbanSound/data/dog_bark/100032.csv", 
    "size": "29"
  }, 
  {
    "contentType": "application/json", 
    "name": "UrbanSound/data/dog_bark/100032.json", 
    "size": "1858"
  } stuff snipped]

Method Two

import re
import sys
from google.cloud import storage

BUCKET = 'bucket-sounds'

# Create a Cloud Storage client.
gcs = storage.Client()

# Get the bucket that the file will be uploaded to.
bucket = gcs.get_bucket(BUCKET)

def my_list_bucket(bucket_name, limit=sys.maxsize):
  a_bucket = gcs.lookup_bucket(bucket_name)
  bucket_iterator = a_bucket.list_blobs()
  for resource in bucket_iterator:
    print(resource.name)
    limit = limit - 1
    if limit <= 0:
      break

my_list_bucket(BUCKET, limit=5)

This generates output like this.

UrbanSound/FREESOUNDCREDITS.txt
UrbanSound/UrbanSound_README.txt
UrbanSound/data/air_conditioner/100852.csv
UrbanSound/data/air_conditioner/100852.json
UrbanSound/data/air_conditioner/100852.mp3
Share:
37,964
Shamshad Alam
Author by

Shamshad Alam

&lt;3 writing code BigQuery Apache Spark AWS Certified Solution Architect – Associate Level AdTech Machine Learning

Updated on January 15, 2022

Comments

  • Shamshad Alam
    Shamshad Alam over 2 years

    I wanted to get all the folders inside a given Google Cloud bucket or folder using Google Cloud Storage API.

    For example if gs://abc/xyz contains three folders gs://abc/xyz/x1, gs://abc/xyz/x2 and gs://abc/xyz/x3. The API should return all three folder in gs://abc/xyz.

    It can easily be done using gsutil

    gsutil ls gs://abc/xyz

    But I need to do it using python and Google Cloud Storage API.

  • Shamshad Alam
    Shamshad Alam about 8 years
    Well, when we call objects().list() with prefix and delimiter we get list of matching objects AND matching prefixes. As @jterrace answered if we pass prefix=abc/xyz with delimiter=/ we get all objects whose name start with abc/xyz as well as prefixes which can be logically considered as subfolder.
  • Maelstorm
    Maelstorm over 5 years
    This is the real answer! Thanks
  • RNHTTR
    RNHTTR over 4 years
    brilliant. I'm going to edit your answer to replace the first few instances of "path" to "prefix" so as not to be confused with the path defined to be passed to the HTTPIterator.
  • RNHTTR
    RNHTTR over 4 years
    while this might work for listing objects, this question is about listing subfolders, and this does not do that. @AntPhitlok 's answer is correct.
  • Ekaba Bisong
    Ekaba Bisong over 4 years
    @RNHTTR. You're right :) Leaving here for posterity sake.
  • Robino
    Robino almost 4 years
    OP asks for behaviour like "gsutil ls ...", which lists items in a folder. Your code lists all items in all subfolders, recursively. For a large folder structure you could get seriously more than you bargained for!
  • Robino
    Robino almost 4 years
    OP did ask to use the google.cloud.storage api...
  • Robino
    Robino almost 4 years
    Looks like a bit of a hack, using "private" member _connection. Much easier/safer method using list_blobs(..).
  • Robino
    Robino almost 4 years
    This gets all items from "dir" and all items from all subfolders, recursively. OP was asking about folders/items just at the folder level (non-recursive).
  • Robino
    Robino almost 4 years
    You don't say what values we need to use for prefix or delimiter. Can you add those to your answer please?
  • Robino
    Robino almost 4 years
    OP does not want to use gsutil!
  • Robino
    Robino almost 4 years
    I don't follow this answer. If the "url" is gs://abc/xyz then the bucket will be abc. If you also pass the bucket name in with the prefixes you are probably not going to match anything, and certainly not what you want.
  • jterrace
    jterrace almost 4 years
    @Robino you're right - I messed that up. Updated the answer.
  • PeNpeL
    PeNpeL almost 4 years
    @Robino I added an example. The prefix is used to list only files and folders that starts with prefix. It's most useful when you only want to list files and folders of a specific directory. The important thing is that the prefix should end with '/'. The delimiter however help separate between the files and the folders within said directory. and as I have written, I have used '/' as the delimiter.
  • Never_Give_Up
    Never_Give_Up over 3 years
    Is there a way to execute the list operation like gsutil -l -a gs://bucket-name/*
  • PeNpeL
    PeNpeL over 3 years
    I haven't tested it but according to the documentation (cloud.google.com/storage/docs/listing-objects), you can use the -r flag like so: gsutil la -r gs://BUCKET-NAME/PREFIX**
  • Rui Yang
    Rui Yang over 3 years
    tried with latest google-cloud-storage 1.35.1, blobs.prefixes always return empty set for me, there are prefix with /
  • PeNpeL
    PeNpeL over 3 years
    @RuiYang In order to help you please share a code snippet of what you tried, and if possible, explain the folder structure and what directory you want to list.
  • Yet Another User
    Yet Another User about 3 years
    I've added a bit to the conditional on the prefix thing to resolve a bug where you couldn't list the root of the bucket with a prefix of '' :) other than that this works perfectly for me, thanks for posting it!
  • Robino
    Robino over 2 years
    @RuiYang I am getting the same issue on 1.42 and 1.43 (tested on OSX and Linux).
  • Robino
    Robino over 2 years
    For buckets with a massive number of files this will take a massive amount of time, even if it only contains one subfolder.
  • Robino
    Robino over 2 years
    This cycles through every file path in the bucket. For massive buckets this will take a massive amount of time. GCP also charges you per lookup, so watch out!
  • Phillip Maire
    Phillip Maire over 2 years
    thanks for the heads up, to avoid this would I use something similar to your answer and use max_results=1 like so blobs=list(bucket.list_blobs(max_results=1, prefix=base_folder))?
  • Boorhin
    Boorhin over 2 years
    This is not working on the latest versions 1.43