Hadoop Hive Query: Multi-join

20,817

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries

Hive supports subqueries only in the FROM clause.

You can't use a subquery as a 'column' in Hive.

To work around this you'll want to use that subquery in a FROM clause and JOIN to it. (the below won't work, but is the idea)

SELECT url, 
       COUNT(url) AS access_url, 
       t2.col1, t2.col2 ...
FROM   aaa_hit
JOIN (SELECT events.event_id as evt, 
               COUNT(events.event_id) as access_evt
        FROM   aaa_event events 
               LEFT OUTER JOIN aaa_hit hits 
                 ON ( events.hit_key = hit_key )
                 ORDER BY access_evt DESC LIMIT 1), 
       (SELECT sessions.remote_address as remote_address, 
               COUNT(sessions.remote_address) as access_addr
        FROM   aaa_session sessions 
               RIGHT OUTER JOIN aaa_hit hits 
                 ON ( sessions.session_key = session_key )
                 ORDER BY access_addr DESC LIMIT 1) t2
ON (aaa_hit.THING = t2.THING)

Check out https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins for more information on using JOINs in Hive.

Share:
20,817
batman
Author by

batman

Si six cent six saucissions...

Updated on February 20, 2020

Comments

  • batman
    batman about 4 years

    How can I do sub-selections in Hive? I think I might be making a really obvious mistake that's not so obvious to me...

    Error I'm receiving: FAILED: Parse Error: line 4:8 cannot recognize input 'SELECT' in expression specification

    Here are my three source tables:

    aaa_hit -> [SESSION_KEY, HIT_KEY, URL]
    aaa_event-> [SESSION_KEY,HIT_KEY,EVENT_ID]
    aaa_session->[SESSION_KEY,REMOTE_ADDRESS]
    

    ...and what I want to do is insert the result into a result table like this:

    result -> [url, num_url, event_id, num_event_id, remote_address, num_remote_address]
    

    ...where column 1 is the URL, column 3 is the top 1 "event" per URL, and column 5 is the top 1 REMOTE_ADDRESS to visit that URL. (Even columns are "count"s of the previous column.)

    Soooooo... what did I do wrong here?

    INSERT OVERWRITE TABLE result2
    SELECT url, 
           COUNT(url) AS access_url, 
           (SELECT events.event_id as evt, 
                   COUNT(events.event_id) as access_evt
            FROM   aaa_event events 
                   LEFT OUTER JOIN aaa_hit hits 
                     ON ( events.hit_key = hit_key )
                     ORDER BY access_evt DESC LIMIT 1), 
           (SELECT sessions.remote_address as remote_address, 
                   COUNT(sessions.remote_address) as access_addr
            FROM   aaa_session sessions 
                   RIGHT OUTER JOIN aaa_hit hits 
                     ON ( sessions.session_key = session_key )
                     ORDER BY access_addr DESC LIMIT 1) 
    FROM   aaa_hit
    ORDER  BY access_url DESC;
    

    Thank you so much :)

  • batman
    batman almost 13 years
    So do I have to make another table, then?
  • batman
    batman almost 13 years
    Good to know I can't do this, but how should I get around it?
  • QuinnG
    QuinnG almost 13 years
    @Travis Powell: added details
  • WhatsThePoint
    WhatsThePoint about 6 years
    While this link may answer the question, link only answers are discouraged on Stack Overflow, you can improve this answer by taking vital parts of the link and putting it into your answer, this makes sure your answer is still an answer if the link gets changed or removed :)