Hive Query Execution Plan

hadoop hive mapreduce hiveql

11,151

Solution 1

This answers both questions.
Statistics are gathered automatically by defalut and for that a reduce step is needed.

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Statistics

hive.stats.autogather

Default Value: true

Added In: Hive 0.7 with HIVE-1361

A flag to gather statistics automatically during the INSERT OVERWRITE command.

Solution 2

Record writing is also a responsibility of the reduce phase. Since you are writing back to a distributed FS (HDFS) ,it would be able to write in parallel by spawning required/specified number of reducers.
"Stats aggregation" is for gathering stats from the table you are writing to. For example, the number of rows in the partition, the column data pattern etc. This data is used to generate a query plan when querying that table.

11,151

Author by

outlier229

Masters Student at Texas A&M University. "My biggest fear is not living up to my full potential"

Updated on June 04, 2022

Comments

outlier229 almost 2 years

Here is my hive query

Insert into schemaB.employee partition(year) 
select * from schemaA.employee;

Below is the Query Execution plan produced by this query.

hive> explain <query>;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
  Stage-2 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: employee
            Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: Col1 (type: binary), col2 (type: binary), col3 (type: array<string>), year (type: int)
              outputColumnNames: _col0, _col1, _col2, _col3
              Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col3 (type: int)
                sort order: +
                Map-reduce partition columns: _col3 (type: int)
                Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col0 (type: binary), _col1 (type: binary), _col2 (type: array<string>), _col3 (type: int)
      Reduce Operator Tree:
        Extract
          Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: true
            Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
                name: schemaB.employee

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            year 
          replace: false
          table:
              input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
              output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
              serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
              name: schemaB.employee

  Stage: Stage-2
    Stats-Aggr Operator

I have two questions related to the query execution plan:

Why is there a reduce step in the query plan? In my understanding, all it needs to do is copy the data from one HDFS location to another, which can be achieved by mappers alone. Is the reduce step has something to do with the partitions present in the table?
What is the Stats-Aggr Operator step present in Stage 2? I am not able to find a relevant documentation explaining this.