When do reduce tasks start in Hadoop?

hadoop mapreduce reduce

37,834

Solution 1

The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen while mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the mappers are done. You can tell which one MapReduce is doing by looking at the reducer completion percentage: 0-33% means its doing shuffle, 34-66% is sort, 67%-100% is reduce. This is why your reducers will sometimes seem "stuck" at 33%-- it's waiting for mappers to finish.

Reducers start shuffling based on a threshold of percentage of mappers that have finished. You can change the parameter to get reducers to start sooner or later.

Why is starting the reducers early a good thing? Because it spreads out the data transfer from the mappers to the reducers over time, which is a good thing if your network is the bottleneck.

Why is starting the reducers early a bad thing? Because they "hog up" reduce slots while only copying data and waiting for mappers to finish. Another job that starts later that will actually use the reduce slots now can't use them.

You can customize when the reducers startup by changing the default value of mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait for all the mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A value of 0.5 will start the reducers when half of the mappers are complete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-job basis. In new versions of Hadoop (at least 2.4.1) the parameter is called is mapreduce.job.reduce.slowstart.completedmaps (thanks user yegor256).

Typically, I like to keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has multiple jobs running at once. This way the job doesn't hog up reducers when they aren't doing anything but copying data. If you only ever have one job running at a time, doing 0.1 would probably be appropriate.

Solution 2

The reduce phase can start long before a reducer is called. As soon as "a" mapper finishes the job, the generated data undergoes some sorting and shuffling (which includes call to combiner and partitioner). The reducer "phase" kicks in the moment post mapper data processing is started. As these processing is done, you will see progress in reducers percentage. However, none of the reducers have been called in yet. Depending on number of processors available/used, nature of data and number of expected reducers, you may want to change the parameter as described by @Donald-miner above.

Solution 3

As much I understand Reduce phase start with the map phase and keep consuming the record from maps. However since there is sort and shuffle phase after the map phase all the outputs have to be sorted and sent to the reducer. So logically you can imagine that reduce phase starts only after map phase but actually for performance reason reducers are also initialized with the mappers.

37,834

Slayer

Updated on October 30, 2020

Comments

Slayer over 3 years

In Hadoop when do reduce tasks start? Do they start after a certain percentage (threshold) of mappers complete? If so, is this threshold fixed? What kind of threshold is typically used?
daydreamer almost 12 years

do you know where can I read more about what you have mentioned?
Donald Miner almost 12 years

Slowstart is quite poorly documented in my opinion.... as are most of the obscure configuration parameters.
sufinawaz over 10 years

Good answer @Donald Miner. Just want to add that in newer Hadoop version (I'm using 1.1.2), the value is defaulted to 0.05. hadoop.apache.org/docs/r1.1.2/mapred-default.html
harry potter almost 10 years

@Donald I am using the version 0.20.205.0 of hadoop and set the parameter "mapred.reduce.slowstart.completed.maps" in mapred-site.xml to 0.1, but the reducer still runs after the mappers complete. May I know why ?
Donald Miner almost 10 years

@harrypotter only thing I can think of is if you restarted your jobtracker and maybe tasktrackers after making the change. Also, are you sure the config file propagated throughout your entire cluster?
yegor256 over 9 years

in Hadoop 2.4.1 it is mapreduce.job.reduce.slowstart.completedmaps
NishM about 9 years

@DonaldMiner From what i understand, this parameter only affects the shuffle phase(copy phase) but the sort and reduce phase only start after all the map outputs have been copied. Am i wrong?
Donald Miner about 9 years

@nishm i think you are confusing the terminology of the entire reduce phase vs. Just reduce inside of the reduce phase. The reduce phase is the shuffle, sort, and reduce. Slowstart tells it when to start the overall phase. You are right that the reduce inside of the reduce phase only start once mappers finish.
NishM about 9 years

@Donald I was mentioning the latter reduce ( invocation of reduce method with records) in this case. And thank you for confirming
zihaolucky almost 9 years

thanks! An brief answer might be: just think the wordcount case, the word "abc" is distributed in several mappers, then they are sent to shuffle which means these words must collect together before the Reduce phrase. So all reduce jobs must wait all completion of mappers.
Avinash Ganta almost 6 years

@DonaldMiner "On the other hand, sort and reduce can only start once all the mappers are done." Is not quite accurate. The COPY and SORT (MERGE) phase run in parallel.
Sai Kiriti Badam about 5 years

Doesn't address OP's question of WHEN does the reducer start.