Hadoop MapReduce: Possible to define two mappers and reducers in one hadoop job class?
Solution 1
You can have multiple mappers, but in one job, you can only have one reducer. And the features you need are MultipleInput
, MultipleOutput
and GenericWritable
.
Using MultipleInput
, you can set the mapper and the corresponding inputFormat. Here is my post about how to use it.
Using GenericWritable
, you can separate different input classes in the reducer. Here is my post about how to use it.
Using MultipleOutput
, you can output different classes in the same reducer.
Solution 2
You can use the MultipleInputs and MultipleOutputs classes for this, but the output of both mappers will go to both reducers. If the data flows for the two mapper/reducer pairs really are independent of one another then keep them as two separate jobs. By the way, MultipleInputs will run your mappers with out change, but the reducers would have to be modified in order to use MultipleOutputs
Related videos on Youtube
Bob
Updated on September 16, 2022Comments
-
Bob over 1 year
I have two separate java classes for doing two different mapreduce jobs. I can run them independently. The input files on which they are operating are the same for both of the jobs. So my question is whether it is possible to define two mappers and two reducers in one java class like
mapper1.class mapper2.class reducer1.class reducer2.class
and then like
job.setMapperClass(mapper1.class); job.setmapperClass(mapper2.class); job.setCombinerClass(reducer1); job.setCombinerClass(reducer2); job.setReducerClass(reducer1); job.setReducerClass(reducer2);
Do these set Methods actually override the previous ones or add the new ones? I tried the code, but it executes the only latest given classes which brings me thinking that it overrides. But there must be a way of doing this right?
The reason why I am asking this is I can read the input files only once (one I/O) and then process two map reduce jobs. I also would like to know how I can write the output files into two different folders. At the moment, both jobs are separate and require an input and an output directory.
-
Bob almost 12 years@Chris Both pair of MR share the same input, which made me think of being able to read the input only once. The mappers work with different keys. This means that the keys for one mapper will be different from the ones for the other mapper. The reason why I am thinking is that I can read the input files only once to process them in two different pairs of MRs which work indepedently.
-
Bob almost 12 yearsCool, did not think of it as a possibility. But how can I separate the output files, say that I can handle it in my reducer implementation, I need to somehow then specify which keys are written to where.
-
pyfunc almost 12 yearsNo Bob: You can't do that. What you can do in map1, map2 is submit K,V as K, (map1, V) so that in reducer you know where data is coming from. Each reducer creates it's own file in job output so that your output is already segregated.