Hive Query Execution Plan

11,151

Solution 1

This answers both questions.
Statistics are gathered automatically by defalut and for that a reduce step is needed.

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Statistics

hive.stats.autogather

Default Value: true 

Added In: Hive 0.7 with HIVE-1361

A flag to gather statistics automatically during the INSERT OVERWRITE command.

Solution 2

  1. Record writing is also a responsibility of the reduce phase. Since you are writing back to a distributed FS (HDFS) ,it would be able to write in parallel by spawning required/specified number of reducers.
  2. "Stats aggregation" is for gathering stats from the table you are writing to. For example, the number of rows in the partition, the column data pattern etc. This data is used to generate a query plan when querying that table.
Share:
11,151
outlier229
Author by

outlier229

Masters Student at Texas A&M University. "My biggest fear is not living up to my full potential"

Updated on June 04, 2022

Comments

  • outlier229
    outlier229 almost 2 years

    Here is my hive query

    Insert into schemaB.employee partition(year) 
    select * from schemaA.employee;
    

    Below is the Query Execution plan produced by this query.

    hive> explain <query>;
    
    STAGE DEPENDENCIES:
      Stage-1 is a root stage
      Stage-0 depends on stages: Stage-1
      Stage-2 depends on stages: Stage-0
    
    STAGE PLANS:
      Stage: Stage-1
        Map Reduce
          Map Operator Tree:
              TableScan
                alias: employee
                Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: Col1 (type: binary), col2 (type: binary), col3 (type: array<string>), year (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3
                  Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
                  Reduce Output Operator
                    key expressions: _col3 (type: int)
                    sort order: +
                    Map-reduce partition columns: _col3 (type: int)
                    Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
                    value expressions: _col0 (type: binary), _col1 (type: binary), _col2 (type: array<string>), _col3 (type: int)
          Reduce Operator Tree:
            Extract
              Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: true
                Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                    output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                    serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
                    name: schemaB.employee
    
      Stage: Stage-0
        Move Operator
          tables:
              partition:
                year 
              replace: false
              table:
                  input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                  output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                  serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
                  name: schemaB.employee
    
      Stage: Stage-2
        Stats-Aggr Operator
    

    I have two questions related to the query execution plan:

    1. Why is there a reduce step in the query plan? In my understanding, all it needs to do is copy the data from one HDFS location to another, which can be achieved by mappers alone. Is the reduce step has something to do with the partitions present in the table?
    2. What is the Stats-Aggr Operator step present in Stage 2? I am not able to find a relevant documentation explaining this.