Flattening Google Analytics data (with repeated fields) not working anymore

13,877

Solution 1

This is actually the result of a bugfix I submitted last week, and is preventing you from getting incorrect results.

BigQuery by default flattens all query results before returning them, but we only want to flatten one independently repeated field to avoid a cross-product expansion of data. The bug was that our checks for multiple repeated fields failed to take into account repeatedness of parent records in some cases, which caused us to fail to flatten some independently repeated fields. This meant that we could return flat rows where independently repeated values were actually "flattened" into dependently repeated values, instead of generating the cross-product, which is actually a wrong result.

What you're seeing here is a result of the stricter check: you have (at least) two repeated fields in your output schema before we attempt to flatten the results.

Another important thing to note is that the FLATTEN([table-value], [field]) function only flattens the repeatedness of the field you specify as the second argument. When you say flatten([xxxxxxxx.ga_sessions_20140829],hits), you are flattening only the "hits" record. If you also want to flatten its repeated children (product, promotion, etc.) you must explicitly add another flatten for those fields, like:

FLATTEN(FLATTEN([xxxxxxxx.ga_sessions_20140829],hits),hits.product)

--

You have a couple options to make your example work:

1) Select fewer fields. If you only care about getting flattened output of a few fields, you can remove the independently repeated fields from your query results by explicitly selecting only the fields you care about.

2) Add more FLATTENs. You'll need to flatten on each repeated field, which looks to include at least hits, hits.product, and customDimensions. You may find that the error message will then complain about different repeated fields: add more FLATTENs on the repeated fields in your schema until it works.

Solution 2

If you're using the BigQuery Web Console, select a Destination Table, click Allow Large Results and uncheck Flatten Results.

If you're using bq command-line tool:

bq query --allow_large_results --noflatten --destination_table NAME_OF_TABLE  "SELECT * from FLATTEN( [dataset], hits)"

Solution 3

For those who want to import GA data to a relational database: Please be aware that a nested schema is actually putting multiple relational tables into one nested structure - they are equivalent, so plain flattening is not the best solution here: you want to separate the tables again! That way you save a lot of space in storage and you queries might get faster.

You can think of GA data being 3 relational tables:

  • sessions table
  • hits table
  • products table

To get sessions table you can use something like

-- session table 
SELECT
  * EXCEPT(hits,
    customDimensions),
  CONCAT('{',(
    SELECT
      STRING_AGG(CONCAT(CAST(index AS string),':"',value,'"') )
    FROM
      t.customdimensions),'}') as customDimensions
FROM
  `project.dataset.ga_sessions_20171031` AS t

customDimensions is aggregated to a json string.

Similarly with the hits table:

-- hits table
SELECT
 fullvisitorid,
 visitid,
 visitstarttime,
 h.* EXCEPT(product,
 customdimensions, customMetrics, customVariables, promotion, experiment),
 CONCAT('{',(
   SELECT
     STRING_AGG(CONCAT(CAST(index AS string),':"',value,'"') )
   FROM
     h.customdimensions),'}') AS hitsCustomDimensions,
 CONCAT('{',(
   SELECT STRING_AGG(CONCAT(CAST(index AS string),':',cast(value as string)) )
   FROM h.custommetrics),'}') AS hitsCustomMetrics
FROM
 `project.dataset.ga_sessions_20171031` AS t,
 t.hits AS h

You can add promotion and experiment as I did with customDimensions using a sub-select.

Join sessions with hits using a concatenation of fullvisitorid + visitstarttime/visitid (GA sessions include midnight split: use visitStartTime instead of visitid! If you want to ignore midnight split use visitid - it stays the same despite the split)

For products you'd add the hit number to your 'session id' to get a unique identifier:

-- product table
SELECT
  fullvisitorid,
  visitid,
  visitstarttime,
  h.hitNumber,
  p.* EXCEPT(customdimensions, customMetrics),
  CONCAT('{',(
    SELECT STRING_AGG(CONCAT(CAST(index AS string),':"',value,'"') )
    FROM p.customdimensions),'}') AS productsCustomDimensions,
  CONCAT('{',(
    SELECT STRING_AGG(CONCAT(CAST(index AS string),':',cast(value as string)) )
    FROM p.custommetrics),'}') AS productsCustomMetrics
FROM
  `project.dataset.ga_sessions_20171031` AS t,
  t.hits AS h, h.product as p

Now you have 3 relational / "flat" tables you can join as needed in any relational db.

Solution 4

I suspect the tables generated by Google Analytics Premium have been augmented with the addition of new columns. Adding columns shouldn't be a problem, unless your queries use the * selector, and one of the new columns happens to contain nested values.

Recommended solution: Instead of using *, explicitly ask for the columns you need.

Solution 5

You have to use flatten the query before connecting. Simply pick the fields that are mentioned in the error and use the nested flattening:

SELECT * FROM 
flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140829],hits),customDimensions_value),hits_product_productSKU),
flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140828],hits),customDimensions_value),hits_product_productSKU),
flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140827],hits),customDimensions_value),hits_product_productSKU),    flatten(flatten(flatten([xxxxxxxx.ga_sessions_20140826],hits),customDimensions_value),hits_product_productSKU)
Share:
13,877
M Bitter
Author by

M Bitter

Updated on June 04, 2022

Comments

  • M Bitter
    M Bitter almost 2 years

    We have a premium Google Analytics account, which will give us access to row level event data. This data is exported daily to Google Bigquery and for every day a new table will be created in a data set.

    Until a week ago we were able to export this Google Analytics data to CSV by flattening it to a temp staging table and then export it to CSV. The query we used to do so was like this:

    SELECT * FROM 
        flatten([xxxxxxxx.ga_sessions_20140829],hits),
        flatten([xxxxxxxx.ga_sessions_20140828],hits),
        flatten([xxxxxxxx.ga_sessions_20140827],hits),
        flatten([xxxxxxxx.ga_sessions_20140826],hits)
    

    Yesterday I noticed this query will now throw an error:

    Cannot output multiple independently repeated fields at the same time. Found customDimensions_value and hits_product_productSKU
    

    Apparently something has changed regarding the flatten() function, as hits_product_productSKU is child of the hits field.

    I also tried some old queries which were in the query history, but they're broken as well. There is no release note mentioning any change, so what is happening?

    How can I export everything in the Google Analytics BigQuery export files again?

  • NathanS
    NathanS over 9 years
    Felipe, in my case I am not using GA or any * selections. In the past select specific columns would flatten the nested objects out. The data was loaded from nested JSON. Example query: SELECT [datasource_id] AS [datasource_id], [mentions.mentioned_name] AS [mentions_mentioned_name], [links.title] AS [links_title], [links.url] AS [links_url] FROM [data.posts] [posts] The error I now get is: Cannot output multiple independently repeated fields at the same time. Found mentions_mentioned_name and links_title In the past it would expand the empty columns to nulls.
  • Felipe Hoffa
    Felipe Hoffa over 9 years
    Job id or working (failing) query to investigate please!
  • M Bitter
    M Bitter over 9 years
    How about a flatten_all function? We want to import all the data into our own data warehouse, so nesting 5 or more flatten queries seems like a little bit hacky...
  • Mullefa
    Mullefa over 9 years
    Out of interest, why can't FLATTEN take an arbitrary number of fields to flatten by? e.g. FLATTEN(table, field1, field2)
  • fraxture
    fraxture over 6 years
    This works. It is just very annoying to write out all the flattens. Seems like a better API might be in order.