How does Hive decide when to use map reduce and when not to?

10,846

Solution 1

In general, any sort of aggregation, such as min/max/count is going to require a MapReduce job. This isn't going to explain everything for you, probably.

Hive, in the style of many RDBMS, has an EXPLAIN keyword that will outline how your Hive query gets translated into MapReduce jobs. Try running explain on both your example queries and see what it is trying to do behind the scenes.

Solution 2

Whenever we fire a query like select * from tablename, Hive reads the data file and fetches the entire data without doing any aggregation(min/max/count etc.). It'll call a FetchTask rather than a mapreduce task.

This is also an optimization technique in Hive. hive.fetch.task.conversion property can (i.e. FETCH task) minimize latency of map-reduce overhead.

This is like we are reading a hadoop file : hadoop fs -cat filename

But if we use select colNames from tablename, it requires a map-reduce job as it needs to extract the 'column' from each row by parsing it from the file it loads.

Solution 3

select * from tablename;

Just reads raw data from files in HDFS, so it is much faster without MapReduce.

Share:
10,846
Lazer
Author by

Lazer

Updated on July 18, 2022

Comments

  • Lazer
    Lazer almost 2 years

    As a simple example,

    select * from tablename;
    

    DOES NOT kick in map reduce, while

    select count(*) from tablename;
    

    DOES. What is the general principle used to decide when to use map reduce (by hive)?

  • ernesto
    ernesto about 9 years
    but for a large file it has to read from all the nodes in parallel. Hive does that without MR?
  • coderplus
    coderplus almost 6 years
    With newer versions of hive, the second part isn't true anymore..select column from tablename won't run an MR with minimal or more setting of hive.fetch.task.conversion