how to do subqueries in bigquery?

13,304

You might want to do something like below (just guess):

SELECT 
  p.subreddit, 
  p.body AS first_body,
  p.score AS first_score,
  CONCAT('t1_',p.id) AS first_id ,
  c.last_body,
  c.last_score,
  c.last_id 
FROM 
[fh-bigquery:reddit_comments.2016_01] p
JOIN (
  SELECT 
    body AS last_body,
    score AS last_score,
    CONCAT('t1_',id) AS last_id,
    parent_id,
    author,
    body 
  FROM  [fh-bigquery:reddit_comments.2016_01] 
  WHERE body != '[deleted]' 
  AND author != '[deleted]' 
  AND score > 1
)  c
ON  p.link_id = c.parent_id  
WHERE p.score > 1 
AND  p.author != '[deleted]' 
AND p.body != '[deleted]'
LIMIT 100

See more about JOINs

Please note, I just transformed your query to proper use JOINs, but logic of query is still for you to polish as you see needed

Added to address additional info in your comment:

SELECT 
  subreddit, 
  first_body,
  first_score,
  first_id ,
  last_body,
  last_score,
  last_id 
FROM (
  SELECT 
    subreddit, 
    body AS first_body,
    score AS first_score,
    CONCAT('t1_',id) AS first_id 
  FROM [fh-bigquery:reddit_comments.2016_01]
  WHERE score > 1 
  AND author != '[deleted]' 
  AND body != '[deleted]'
) p
JOIN (
  SELECT 
    body AS last_body,
    score AS last_score,
    CONCAT('t1_',id) AS last_id,
    parent_id,
    author,
    body 
  FROM  [fh-bigquery:reddit_comments.2016_01] 
  WHERE body != '[deleted]' 
  AND author != '[deleted]' 
  AND score > 1
)  c
ON  p.first_id = c.parent_id  
LIMIT 100  
Share:
13,304
jas
Author by

jas

Updated on June 25, 2022

Comments

  • jas
    jas almost 2 years

    Im trying to play with the reddit data on bigquery and I want to see comments and replies in one row. I see bigquery supports subqueries, but I am unable to construct the query. I have to use a subquery to self join the same table because of the structure of the data, specifically i want to join id and parent_id together, but I need to modify id before I can join. Here is how im trying to do the query:

    SELECT 
      p.subreddit, 
      p.body AS first_body,
      p.score AS first_score,
      CONCAT('t1_',p.id) AS first_id ,
      c.last_body,
      c.last_score,
      c.last_id 
    FROM 
    [fh-bigquery:reddit_comments.2016_01] p,
    (
      SELECT 
        body AS last_body,
        score AS last_score,
        CONCAT('t1_',id) AS last_id,
        parent_id,
        author,
        body 
      FROM  [fh-bigquery:reddit_comments.2016_01] 
      WHERE body != '[deleted]' 
      AND author != '[deleted]' 
      AND score > 1
    )  c
    WHERE  p.first_id = c.parent_id  
    AND p.score > 1 
    AND  p.author != '[deleted]' 
    AND p.body != '[deleted]';
    

    The error I get is:

    Field 'c.parent_id' not found in table 'fh-bigquery:reddit_comments.2016_01'; did you mean 'parent_id'?
    

    Here is where you can run the query: https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2016_01

    Im not sure how to fix this. what is the proper way to join this and get this query to run?

  • jas
    jas about 8 years
    Mikhail, I couldn't use this style specifically because of the join clause. I need to join: on concat('t1_',p.id) = c.parent_id . the id is missing the "t1_" string in front of it. bigquery does not allow joins for fields that dont exist in the table. So I need to modify the query to use subselect I believe. here is the error I get when I try to use concat: Query Failed Error: ON clause must be AND of = comparisons of one field name from each table, with all field names prefixed with table name. .
  • Mikhail Berlyant
    Mikhail Berlyant about 8 years
    Added to my original answer to reflect your spec for ON clause