How to schedule Hadoop Map tasks in multi-core 8 node cluster?

hadoop mapreduce cloudera

11,027

Solution 1

"mapred.tasktracker.map.tasks.maximum" deals with the number of map tasks that should be launched on each node, not the number of nodes to be used for each map task. In the Hadoop architecture, there is 1 tasktracker for each node (slaves) and 1 job tracker on a master node (master). So if you set the property mapred.tasktracker.map.tasks.maximum, it will only change the number of map tasks to be executed per node. The range of "mapred.tasktracker.map.tasks.maximum" is from 1/2*cores/node to 2*cores/node

The number of map tasks that you want overall should be set using setNumMapTasks(int)

Solution 2

Now, my cluster has 8 nodes each with 8 cores and 8 GB of memory and shared filesystem hosted at head node.

When you say a shared filesystem hosted a the head node, do you mean the data is hosted on HDFS, or on some NFS like file system mounted on each node? I'm guessing you mean HDFS, but if you're using NFS or something similar then you should expect to see higher throughput with HDFS (you want to move the processing code to the data, rather than the moving the data to the processing machine)

How big is your input file and what is it's split size, file format (text, sequence etc), replication factor and compression methof?

Depending on the answers to the above questions, With your 8x8 setup, you might be able to get better throughput if you reduce the map split size, and up the replication factor.

Solution 3

You should definitely run the 7 map tasks on 7 different nodes, if possible. The whole advantage of MapReduce is to be able to parallelize your computing so that each task runs as efficiently as possible. If you ran 7 map tasks on one node, each task would be competing for the same resources (RAM, CPU, IO) on that single node.

A standard setting for mapred.tasktracker.map.tasks.maximum is one per core so you could change your setting to 8.

Additionally, if you have a Map-only job, you'll want a good reason to set the number of mappers to certain number. Setting the number of map tasks is just a "hint" to the jobtracker on how many maps to run, but this is ultimately decided by the jobtracker based on how DFS is storing your input data. This wiki has more details.

You do want to control the number of reduce tasks in certain cases, however. For example, if I wanted a list of numbers sorted I would want to ensure that all my data passed through a single reducer.

11,027

Author by

justin waugh

Updated on June 04, 2022

Comments

justin waugh almost 2 years

I have a "map only" (no reduce phase) program. The size of input file is large enough to create 7 map tasks and I have verified that by looking the output produced (part-000 to part006) . Now, my cluster has 8 nodes each with 8 cores and 8 GB of memory and shared filesystem hosted at head node.

My question is can I choose between running all the 7 map tasks in 1 node only or running the 7 map tasks in 7 different slave nodes (1 task per node). If I can do so, then what change in my code and configuration file is needed.

I tried setting the parameter "mapred.tasktracker.map.tasks.maximum" to 1 and 7 in my code only but I didnot find any appreciable time difference. In my configuration file it is set as 1.