Export BigQuery Data to CSV without using Google Cloud Storage

python google-bigquery

27,790

Solution 1

You need to use Google Cloud Storage for your export job. Exporting data from BigQuery is explained here, check also the variants for different path syntaxes.

Then you can download the files from GCS to your local storage.

Gsutil tool can help you further to download the file from GCS to local machine.

You cannot download with one move locally, you first need to export to GCS, than to transfer to local machine.

Solution 2

You can download all data directly (without routing it through Google Cloud Storage) using paging mechanism. Basically you need to generate a page token for each page, download the data in the page and iterate this until all data has been downloaded i.e. no more tokens are available. Here is an example code in Java, which hopefully clarifies the idea:

import com.google.api.client.googleapis.auth.oauth2.GoogleCredential;
import com.google.api.client.googleapis.javanet.GoogleNetHttpTransport;
import com.google.api.client.http.HttpTransport;
import com.google.api.client.json.JsonFactory;
import com.google.api.client.json.JsonFactory;
import com.google.api.client.json.jackson2.JacksonFactory;
import com.google.api.services.bigquery.Bigquery;
import com.google.api.services.bigquery.BigqueryScopes;
import com.google.api.client.util.Data;
import com.google.api.services.bigquery.model.*;

/* your class starts here */

private String projectId = ""; /* fill in the project id here */
private String query = ""; /* enter your query here */
private Bigquery bigQuery;
private Job insert;
private TableDataList tableDataList;
private Iterator<TableRow> rowsIterator;
private List<TableRow> rows;
private long maxResults = 100000L; /* max number of rows in a page */

/* run query */
public void open() throws Exception {
    HttpTransport transport = GoogleNetHttpTransport.newTrustedTransport();
    JsonFactory jsonFactory = new JacksonFactory();
    GoogleCredential credential = GoogleCredential.getApplicationDefault(transport, jsonFactory);
    if (credential.createScopedRequired())
        credential = credential.createScoped(BigqueryScopes.all());
    bigQuery = new Bigquery.Builder(transport, jsonFactory, credential).setApplicationName("my app").build();

    JobConfigurationQuery queryConfig = new JobConfigurationQuery().setQuery(query);
    JobConfiguration jobConfig = new JobConfiguration().setQuery(queryConfig);
    Job job = new Job().setConfiguration(jobConfig);
    insert = bigQuery.jobs().insert(projectId, job).execute();
    JobReference jobReference = insert.getJobReference();

    while (true) {
        Job poll = bigQuery.jobs().get(projectId, jobReference.getJobId()).execute();
        String state = poll.getStatus().getState();
        if ("DONE".equals(state)) {
            ErrorProto errorResult = poll.getStatus().getErrorResult();
            if (errorResult != null)
                throw new Exception("Error running job: " + poll.getStatus().getErrors().get(0));
            break;
        }
        Thread.sleep(10000);
    }

    tableDataList = getPage();
    rows = tableDataList.getRows();
    rowsIterator = rows != null ? rows.iterator() : null;
}

/* read data row by row */
public /* your data object here */ read() throws Exception {
    if (rowsIterator == null) return null;

    if (!rowsIterator.hasNext()) {
        String pageToken = tableDataList.getPageToken();
        if (pageToken == null) return null;
        tableDataList = getPage(pageToken);
        rows = tableDataList.getRows();
        if (rows == null) return null;
        rowsIterator = rows.iterator();
    }

    TableRow row = rowsIterator.next();
    for (TableCell cell : row.getF()) {
        Object value = cell.getV();
        /* extract the data here */
    }

    /* return the data */
}

private TableDataList getPage() throws IOException {
    return getPage(null);
}

private TableDataList getPage(String pageToken) throws IOException {
    TableReference sourceTable = insert
            .getConfiguration()
            .getQuery()
            .getDestinationTable();
    if (sourceTable == null)
        throw new IllegalArgumentException("Source table not available. Please check the query syntax.");
    return bigQuery.tabledata()
            .list(projectId, sourceTable.getDatasetId(), sourceTable.getTableId())
            .setPageToken(pageToken)
            .setMaxResults(maxResults)
            .execute();
}

Solution 3

You can run a tabledata.list() operation on that table and set "alt=csv" which will return the beginning of the table as CSV.

Solution 4

If you install the Google BigQuery API and pandas and pandas.io, you can run Python inside a Jupyter notebook, query the BQ Table, and get the data into a local dataframe. From there, you can write it out to CSV.

Solution 5

Another way to do this is from the UI, once the query results have returned you can select the "Download as CSV" button.

View more solutions

27,790

Author by

mkjoerg

Updated on August 16, 2020

Comments

mkjoerg over 3 years

I am currently writing a software, to export large amounts of BigQuery data and store the queried results locally as CSV files. I used Python 3 and the client provided by google. I did configuration and authentification, but the problem is, that i can't store the data locally. Everytime i execute, i get following error message:

googleapiclient.errors.HttpError: https://www.googleapis.com/bigquery/v2/projects/round-office-769/jobs?alt=json returned "Invalid extract destination URI 'response/file-name-*.csv'. Must be a valid Google Storage path.">

This is my Job Configuration:

def export_table(service, cloud_storage_path,
             projectId, datasetId, tableId, sqlQuery,
             export_format="CSV",
             num_retries=5):

# Generate a unique job_id so retries
# don't accidentally duplicate export
job_data = {
    'jobReference': {
        'projectId': projectId,
        'jobId': str(uuid.uuid4())
    },
    'configuration': {
        'extract': {
            'sourceTable': {
                'projectId': projectId,
                'datasetId': datasetId,
                'tableId': tableId,
            },
            'destinationUris': ['response/file-name-*.csv'],
            'destinationFormat': export_format
        },
        'query': {
            'query': sqlQuery,
        }
    }
}
return service.jobs().insert(
    projectId=projectId,
    body=job_data).execute(num_retries=num_retries)

I hoped i could just use a local path instead of a cloud storage, to store data, but i was wrong.

So my Question is:

Can i download the queried data locally(or to a local database) or do i have to use Google Cloud Storage?