DISTRIBUTE BY clause in HIVE
Solution 1
The only thing DISTRIBUTE BY (city)
says is that records with the same city
will go to the same reducer. Nothing else.
Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
A question by the OP:
Then what is the point of this DISTRIBUTE BY ? There's no guarantee that each (city) would go to a different reducer then why use it ?
For 2 reasons:
In the beginning of hive
DISTRIBUTE BY
,SORT BY
andCLUSTER BY
where used to process data in a way that today is being done automatically (e.g. analytic functions https://oren.lederman.name/?p=32)You might want to stream you data through a script (Hive "Transform") and you want your script to process your data in certain groups and order. For that you can use
DISTRIBUTE BY
+SORT BY
orCLUSTER BY
. WithDISTRIBUTE BY
it is guaranteed that you'll have the whole group in the same reducer. WithSORT BY
that you'll get all the records of a group continuously.
Solution 2
In addition to @Dudu's answer, the Distribute By only distributes the rows among the reducers which is determined from the input size.
The number of reducers to be used for a Hive job will be determined by this property hive.exec.reducers.bytes.per.reducer
which is dependent on the input.
As of Hive 0.14, if the input is < 256MB, only one reducer (one reducer per 256MB of input) will be used unless the number of reducers is overridden by hive.exec.reducers.max
or mapred.reduce.tasks
properties.
Comments
-
User9523 almost 2 years
I am not able to understand what this
DISTRIBUTE BY
clause does in Hive. I know the definition that says, if we haveDISTRIBUTE BY (city)
, this would send each city in a different reducer but I am not getting the same. Let us consider the data as follows:Say we have a table called data with columns username and amount:
+----------+--------+ | username | amount | +----------+--------+ | user_1 | 25 | +----------+--------+ | user_1 | 53 | +----------+--------+ | user_1 | 28 | +----------+--------+ | user_1 | 50 | +----------+--------+ | user_2 | 20 | +----------+--------+ | user_2 | 50 | +----------+--------+ | user_2 | 10 | +----------+--------+ | user_2 | 5 | +----------+--------+
Now If I say -
SELECT username, SUM(amount) FROM data DISTRIBUTE BY (username)
Shouldn't this run 2 separate reducers? It is still running a single reducer and I don't know why. I thought this may have to do with clustering into buckets or partitioning but I tried everything, and it still runs a single reducer. Can anyone explain why?
-
User9523 over 7 yearsAs in if I say there are four cities A,B,C and D , A should go a different reducer , B to different and so on ?
-
User9523 over 7 yearsSo If I want a different reducer for say each (city) , I am supposed to know the number of DISTINCT cities right ?
-
David דודו Markovitz over 7 yearsNo. All A would go to 1 reducer. All B will go to 1 reducer You have no guarantee that these are going to be different reducers.
-
User9523 over 7 yearsAll right so we can say at a time a reducer will contain only (1) kind of city ?
-
David דודו Markovitz over 7 yearsAlso no. Every record that is being read by a mapper is being copied to one of the reducers, decided by Hash function(s) over the distribution value(s), in this case
city
and this take place only after the number of reducers is being decided. -
David דודו Markovitz over 7 yearsNo. It is clear that the number of the reducers must be greater or equal to the number of cities in order to have each city on a different reducer, but nothing guarantees it. It is a hash function and theoretically you can have 10 cities and 100 reducers and still all the cities will be on a single reducer.
-
User9523 over 7 yearsSeriously I don't get it. Then what is the point of this DISTRIBUTE BY ? There's no guarantee that each (city) would go to a different reducer then why use it ?
-
David דודו Markovitz over 7 yearsFor 2 reasons (1) In the beginning of hive DISTRIBUTE BY, SORT BY and CLUSTER BY where used to process data in a way that today is being done automatically (e.g. analytic functions oren.lederman.name/?p=32) (2) You might want to stream you data through a script (Hive "Transform") and you want your script to process your data in certain groups and order. For that you can use DISTRIBUTE BY + SORT BY or CLUSTER BY. With DISTRIBUTE BY it is guaranteed that you'll have the whole group in the same reducer. With SORT BY that you'll get all the records of a group continuously.
-
User9523 over 7 yearsFinally got it. Thank you very much !
-
Hemakshi Sachdev about 6 yearsHiii @franklinsijo both
hive.exec.reducers.max
andmapred.reduce.tasks
doesnt seem to work. I want to set the no. of reducers to 1, so that all files go into one reducer and get merged as a single one. Since, my table does not have partitions I am not able to useDISTRIBUTE BY
clause to send files of single partition into one reducer. Do you anyway I can set no. of reducers to 1?? -
Hemakshi Sachdev about 6 yearsHey @DavidדודוMarkovitz I used
DISTRIBUTE BY
clause as you said... But what if my tables don't have partitions? Then how can I make sure that my files goes into single reducer? I am trying to address the small files issue usinginsert overwrite
of hive to merge them into bigger one.. For partitioned tabledistribute by
seems to work fine but what when I don't have any partition columns?