How many mappers/reducers should be set when configuring Hadoop cluster?

12,153

Solution 1

There is no formula. It depends on how many cores and how much memory do you have. The number of mapper + number of reducer should not exceed the number of cores in general. Keep in mind that the machine is also running Task Tracker and Data Node daemons. One of the general suggestion is more mappers than reducers. If I were you, I would run one of my typical jobs with reasonable amount of data to try it out.

Solution 2

Quoting from "Hadoop The Definite Guide, 3rd edition", page 306

Because MapReduce jobs are normally I/O-bound, it makes sense to have more tasks than processors to get better utilization.

The amount of oversubscription depends on the CPU utilization of jobs you run, but a good rule of thumb is to have a factor of between one and two more tasks (counting both map and reduce tasks) than processors.

A processor in the quote above is equivalent to one logical core.

But this is just in theory, and most likely each use case is different than another, some tests need to be performed. But this number can be a good start to test with.

Share:
12,153
techlad
Author by

techlad

Updated on August 01, 2022

Comments

  • techlad
    techlad almost 2 years

    When configuring a Hadoop Cluster whats the scientific method to set the number of mappers/reducers for the cluster?

  • techlad
    techlad about 12 years
    Thanks root.. Taking 48 GB RAM available on each machine and having a 8 core machine. Lets say we reserve 1 GB RAM for each mapred task, would the optimal value be 48 GB RAM - 1 GB for DataNode - 1 GB for TaskTracker = 46 GB available RAM. In this case should we have 8 as the mappers for 1 machine or should we increase it to say 46, considering the case that all reducers start after mappers complete?
  • root1982
    root1982 about 12 years
    Most of CPUs come with hyper threading and it is enabled by default. so, if you have 16 cpu threads, you can probably increase the numbers more. I would focus on the number of CPU. For memory, even you are not using all of it, system can always find some good use of it, like caching. 1G for the daemons is the default. I would monitor the system and consider a higher number. Most of the times mappers are running in parallel with reducers.
  • root1982
    root1982 about 12 years
    Running out of characters... So, if I were you, I would start with 10 mapper and 4 reducer. How many disks do you have? mappers are going to read in parallel. Do you have multiple disk devices?
  • techlad
    techlad about 12 years
    Thanks root. As of now I have only 1 disk and than a RAID for redundancy. What I notice right now is that even though I allocate 1GB RAM for each mapper, when I run a top command I see it occupying around 1.6GB Virtual Memory and around 0.5 GB Real Memory. Is this behaviour normal?
  • root1982
    root1982 about 12 years
    Redundancy on each node is not suggested in general. HDFS has it own redundancy. Say you run 10 mappers will be reading the disk at the same time. For a normal 7200rpm disk, 2-3 mappers is a good number. For you system, with 48G mem and 16 cpu thread, I/O will likely to be the problem. I suggest you to get multiple disk for each node and set them up as JBOD. Regarding the memory issue, I wouldn't worry about it too much. Normally, if you specified 1G, the virtual memory could be more than 1G.
  • root1982
    root1982 about 12 years
    That will be very application and hardware dependent. If data are aggregated very good on the mapper side, less data is traveling over the network. In this case, if the reducer starts too early, it will just be waiting for data to process. If you have fast network, it will be the same situation. On the other hand, delay the reducer will delay the job. The goal is not to run more mapper, but to get the job finish faster.
  • SSaikia_JtheRocker
    SSaikia_JtheRocker about 12 years
    Root: Gave you a comment up!. I'm not sure but just to clarify myself, say we have a 8 core machine without HT. Lets say, here we run 5 parallel map tasks and 2 parallel reduce tasks. So, here we reserved 2 slots for reduce tasks. Is it not the case that if we lazily load the reducer, those 2 slots can be used by the map tasks instead, which increases the number of parallel map tasks to 7?
  • techlad
    techlad about 12 years
    JtheRocker: If we set mappers as 5 and reducers as 2, we can't use the slots of Reducers. Max 5 mappers can run at any moment.
  • SSaikia_JtheRocker
    SSaikia_JtheRocker about 12 years
    techlad: Got it. So will lazy loading work for you? I'm not sure though.