Hive Query Execution Plan
11,151
Solution 1
This answers both questions.
Statistics are gathered automatically by defalut and for that a reduce step is needed.
hive.stats.autogather
Default Value: true
Added In: Hive 0.7 with HIVE-1361
A flag to gather statistics automatically during the INSERT OVERWRITE command.
Solution 2
- Record writing is also a responsibility of the reduce phase. Since you are writing back to a distributed FS (HDFS) ,it would be able to write in parallel by spawning required/specified number of reducers.
- "Stats aggregation" is for gathering stats from the table you are writing to. For example, the number of rows in the partition, the column data pattern etc. This data is used to generate a query plan when querying that table.
Author by
outlier229
Masters Student at Texas A&M University. "My biggest fear is not living up to my full potential"
Updated on June 04, 2022Comments
-
outlier229 almost 2 years
Here is my hive query
Insert into schemaB.employee partition(year) select * from schemaA.employee;
Below is the Query Execution plan produced by this query.
hive> explain <query>; STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 Stage-2 depends on stages: Stage-0 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: employee Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: Col1 (type: binary), col2 (type: binary), col3 (type: array<string>), year (type: int) outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col3 (type: int) sort order: + Map-reduce partition columns: _col3 (type: int) Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: binary), _col1 (type: binary), _col2 (type: array<string>), _col3 (type: int) Reduce Operator Tree: Extract Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: true Statistics: Num rows: 65412411 Data size: 59121649936 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: schemaB.employee Stage: Stage-0 Move Operator tables: partition: year replace: false table: input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: schemaB.employee Stage: Stage-2 Stats-Aggr Operator
I have two questions related to the query execution plan:
- Why is there a reduce step in the query plan? In my understanding, all it needs to do is copy the data from one HDFS location to another, which can be achieved by mappers alone. Is the reduce step has something to do with the partitions present in the table?
- What is the Stats-Aggr Operator step present in Stage 2? I am not able to find a relevant documentation explaining this.