Number of reducers in hadoop

18,538

Solution 1

1 - The number of reducers is as number of partitions - False. A single reducer might work on one or more partitions. But a chosen partition will be fully done on the reducer it is started.

2 - That is just a theoretical number of maximum reducers you can configure for a Hadoop cluster. Which is very much dependent on the kind of data you are processing too (decides how much heavy lifting the reducers are burdened with).

3 - The mapred-site.xml configuration is just a suggestion to the Yarn. But internally the ResourceManager has its own algorithm running, optimizing things on the go. So that value is not really the number of reducer tasks running every time.

4 - This one seems a bit unrealistic. My block size might 128MB and everytime I can't have 128*5 minimum number of reducers. That's again is false, I believe.

There is no fixed number of reducers task that can be configured or calculated. It depends on the moment how much of the resources are actually available to allocate.

Solution 2

Your job may or may not need reducers, it depends on what are you trying to do. When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. One rule of thumb is to aim for reducers that each run for five minutes or so, and which produce at least one HDFS block’s worth of output. Too many reducers and you end up with lots of small files.

Solution 3

Number of reducer is internally calculated from size of the data we are processing if you don't explicitly specify using below API in driver program

job.setNumReduceTasks(x)

By default on 1 GB of data one reducer would be used.

so if you are playing with less than 1 GB of data and you are not specifically setting the number of reducer so 1 reducer would be used .

Similarly if your data is 10 Gb so 10 reducer would be used .

You can change the configuration as well that instead of 1 GB you can specify the bigger size or smaller size.

property in hive for setting size of reducer is :

hive.exec.reducers.bytes.per.reducer

you can view this property by firing set command in hive cli.

Partitioner only decides which data would go to which reducer.

Solution 4

Partitioner makes sure that same keys from multiple mappers goes to the same reducer. This doesn't mean that number of partitions is equal to number of reducers. However, you can specify number of reduce tasks in the driver program using job instance like job.setNumReduceTasks(2). If you don't specify the number of reduce tasks in the driver program then it picks from the mapred.reduce.tasks which has the default value of 1 (https://hadoop.apache.org/docs/r1.0.4/mapred-default.html) i.e. all mappers output will go to the same reducer.

Also, note that programmer will not have control over number of mappers as it depends on the input split where as programmer can control the number of reducers for any job.

Share:
18,538

Related videos on Youtube

Mohit Jain
Author by

Mohit Jain

I have 7+ years of experience into Big Data technologies like Hadoop, Pig, Spark, Hive, Oozie, AWS, Grafana, NoSQL, etc. I love programming and learning new technologies. My working experience, educational background as well as my skills can be found in my resume. Linkedin: https://www.linkedin.com/in/mohitjain012/ Programming Languages: Java, SQL, Shell Script, Scala Big Data Technologies: Hadoop, Spark, Pig, Hive, MapReduce, Flume, Sqoop, Oozie, Azkaban, Kafka, Ambari Other technologies: Grafana, Amazon Web Services (AWS), Tableau, MySQL, Git

Updated on September 14, 2022

Comments

  • Mohit Jain
    Mohit Jain over 1 year

    I was learning hadoop, I found number of reducers very confusing :

    1) Number of reducers is same as number of partitions.

    2) Number of reducers is 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node).

    3) Number of reducers is set by mapred.reduce.tasks.

    4) Number of reducers is closest to: A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible.

    I am very confused, Do we explicitly set number of reducers or it is done by mapreduce program itself?

    How is number of reducers is calculated? Please tell me how to calculate number of reducers.

  • Mohit Jain
    Mohit Jain almost 8 years
    Thanks for the comment, If there are three partitions and we set number of reduce tasks to 2, then how will data be divided, Will be like data for 2 practitioners will go to one and data from one partition will go to other reducer? Also we can set input split size, so we can set number of mappers.
  • gunner87
    gunner87 almost 8 years
    If there are 3 partitions then data is already divided and the master will assign the reducers to the 3 partitions. Master will be getting heart beat messages from the data nodes which contains information about its availability, resources etc. Master uses these information while scheduling. The reducer which gets the 2 partitions will process one partition after the another. More information about number of reducers and mappers can be found at this link: stackoverflow.com/questions/6885441/…
  • Bemipefe
    Bemipefe over 5 years
    @gunner87 I believe that if mapred.reduce.tasks is not provided the default is 1 only if all the partitions can fit in a single node. What if the produced partition size exceed the HDFS free space on a single node ?