DISTRIBUTE BY clause in HIVE

hive hiveql hadoop2

28,256

Solution 1

The only thing DISTRIBUTE BY (city) says is that records with the same city will go to the same reducer. Nothing else.

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

A question by the OP:

Then what is the point of this DISTRIBUTE BY ? There's no guarantee that each (city) would go to a different reducer then why use it ?

For 2 reasons:

In the beginning of hive DISTRIBUTE BY, SORT BY and CLUSTER BY where used to process data in a way that today is being done automatically (e.g. analytic functions https://oren.lederman.name/?p=32)
You might want to stream you data through a script (Hive "Transform") and you want your script to process your data in certain groups and order. For that you can use DISTRIBUTE BY + SORT BY or CLUSTER BY. With DISTRIBUTE BY it is guaranteed that you'll have the whole group in the same reducer. With SORT BY that you'll get all the records of a group continuously.

Solution 2

In addition to @Dudu's answer, the Distribute By only distributes the rows among the reducers which is determined from the input size.

The number of reducers to be used for a Hive job will be determined by this property hive.exec.reducers.bytes.per.reducer which is dependent on the input.

As of Hive 0.14, if the input is < 256MB, only one reducer (one reducer per 256MB of input) will be used unless the number of reducers is overridden by hive.exec.reducers.max or mapred.reduce.tasks properties.

28,256

Author by

User9523

A Beginner. Mathematics(H) student.

Updated on July 05, 2022

Comments

User9523 almost 2 years
I am not able to understand what this DISTRIBUTE BY clause does in Hive. I know the definition that says, if we have DISTRIBUTE BY (city), this would send each city in a different reducer but I am not getting the same. Let us consider the data as follows:

Say we have a table called data with columns username and amount:
```
+----------+--------+
| username | amount |
+----------+--------+
| user_1   | 25     |
+----------+--------+
| user_1   | 53     |
+----------+--------+
| user_1   | 28     |
+----------+--------+
| user_1   | 50     |
+----------+--------+
| user_2   | 20     |
+----------+--------+
| user_2   | 50     |
+----------+--------+
| user_2   | 10     |
+----------+--------+
| user_2   | 5      |
+----------+--------+
```
Now If I say -
```
SELECT username, SUM(amount) FROM data DISTRIBUTE BY (username)
```
Shouldn't this run 2 separate reducers? It is still running a single reducer and I don't know why. I thought this may have to do with clustering into buckets or partitioning but I tried everything, and it still runs a single reducer. Can anyone explain why?
User9523 over 7 years

As in if I say there are four cities A,B,C and D , A should go a different reducer , B to different and so on ?
User9523 over 7 years

So If I want a different reducer for say each (city) , I am supposed to know the number of DISTINCT cities right ?
David דודו Markovitz over 7 years

No. All A would go to 1 reducer. All B will go to 1 reducer You have no guarantee that these are going to be different reducers.
User9523 over 7 years

All right so we can say at a time a reducer will contain only (1) kind of city ?
David דודו Markovitz over 7 years

Also no. Every record that is being read by a mapper is being copied to one of the reducers, decided by Hash function(s) over the distribution value(s), in this case city and this take place only after the number of reducers is being decided.
David דודו Markovitz over 7 years

No. It is clear that the number of the reducers must be greater or equal to the number of cities in order to have each city on a different reducer, but nothing guarantees it. It is a hash function and theoretically you can have 10 cities and 100 reducers and still all the cities will be on a single reducer.
User9523 over 7 years

Seriously I don't get it. Then what is the point of this DISTRIBUTE BY ? There's no guarantee that each (city) would go to a different reducer then why use it ?
David דודו Markovitz over 7 years

For 2 reasons (1) In the beginning of hive DISTRIBUTE BY, SORT BY and CLUSTER BY where used to process data in a way that today is being done automatically (e.g. analytic functions oren.lederman.name/?p=32) (2) You might want to stream you data through a script (Hive "Transform") and you want your script to process your data in certain groups and order. For that you can use DISTRIBUTE BY + SORT BY or CLUSTER BY. With DISTRIBUTE BY it is guaranteed that you'll have the whole group in the same reducer. With SORT BY that you'll get all the records of a group continuously.
User9523 over 7 years

Finally got it. Thank you very much !
Hemakshi Sachdev about 6 years

Hiii @franklinsijo both hive.exec.reducers.max and mapred.reduce.tasks doesnt seem to work. I want to set the no. of reducers to 1, so that all files go into one reducer and get merged as a single one. Since, my table does not have partitions I am not able to use DISTRIBUTE BY clause to send files of single partition into one reducer. Do you anyway I can set no. of reducers to 1??
Hemakshi Sachdev about 6 years

Hey @DavidדודוMarkovitz I used DISTRIBUTE BY clause as you said... But what if my tables don't have partitions? Then how can I make sure that my files goes into single reducer? I am trying to address the small files issue using insert overwrite of hive to merge them into bigger one.. For partitioned table distribute by seems to work fine but what when I don't have any partition columns?