Hadoop MapReduce: Possible to define two mappers and reducers in one hadoop job class?

17,027

Solution 1

You can have multiple mappers, but in one job, you can only have one reducer. And the features you need are MultipleInput, MultipleOutput and GenericWritable.

Using MultipleInput, you can set the mapper and the corresponding inputFormat. Here is my post about how to use it.

Using GenericWritable, you can separate different input classes in the reducer. Here is my post about how to use it.

Using MultipleOutput, you can output different classes in the same reducer.

Solution 2

You can use the MultipleInputs and MultipleOutputs classes for this, but the output of both mappers will go to both reducers. If the data flows for the two mapper/reducer pairs really are independent of one another then keep them as two separate jobs. By the way, MultipleInputs will run your mappers with out change, but the reducers would have to be modified in order to use MultipleOutputs

Share:
17,027

Related videos on Youtube

Bob
Author by

Bob

Updated on September 16, 2022

Comments

  • Bob
    Bob over 1 year

    I have two separate java classes for doing two different mapreduce jobs. I can run them independently. The input files on which they are operating are the same for both of the jobs. So my question is whether it is possible to define two mappers and two reducers in one java class like

    mapper1.class
    mapper2.class
    reducer1.class
    reducer2.class
    

    and then like

    job.setMapperClass(mapper1.class);
    job.setmapperClass(mapper2.class);
    job.setCombinerClass(reducer1);
    job.setCombinerClass(reducer2);
    job.setReducerClass(reducer1);
    job.setReducerClass(reducer2);
    

    Do these set Methods actually override the previous ones or add the new ones? I tried the code, but it executes the only latest given classes which brings me thinking that it overrides. But there must be a way of doing this right?

    The reason why I am asking this is I can read the input files only once (one I/O) and then process two map reduce jobs. I also would like to know how I can write the output files into two different folders. At the moment, both jobs are separate and require an input and an output directory.

  • Bob
    Bob almost 12 years
    @Chris Both pair of MR share the same input, which made me think of being able to read the input only once. The mappers work with different keys. This means that the keys for one mapper will be different from the ones for the other mapper. The reason why I am thinking is that I can read the input files only once to process them in two different pairs of MRs which work indepedently.
  • Bob
    Bob almost 12 years
    Cool, did not think of it as a possibility. But how can I separate the output files, say that I can handle it in my reducer implementation, I need to somehow then specify which keys are written to where.
  • pyfunc
    pyfunc almost 12 years
    No Bob: You can't do that. What you can do in map1, map2 is submit K,V as K, (map1, V) so that in reducer you know where data is coming from. Each reducer creates it's own file in job output so that your output is already segregated.