Hadoop Mapreduce multiple Input files

21,655

Solution 1

Based on the stacktrace, your output directory is not empty. So the simplest thing is actually to delete it before running the job:

bin/hadoop fs -rmr /user/cloudera/capital/output

Besides that, your arguments starting with the classname of your main class org.myorg.Capital. So that is the argument on the zero'th index. (Based on the stacktrace and the code you have provided).

Basically you need to shift all your indices one to the right:

Path cityInputPath = new Path(args[1]);
Path countryInputPath = new Path(args[2]);
Path outputPath = new Path(args[3]);
MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);
MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);

Don't forget to clear your output folder though!

Also a small tip for you, you can separate the files with comma "," so you can set them with a single call like this:

hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat,/user/cloudera/capital/input/Country.dat

And in your java code:

FileInputFormat.addInputPaths(job, args[1]);

Solution 2

What is happening here is that the class name is deemed to be the first argument!

By default, the first non-option argument is the name of the class to be invoked. A fully-qualified class name should be used. If the -jar option is specified, the first non-option argument is the name of a JAR archive containing class and resource f iles for the application, with the startup class indicated by the Main-Class manifest header.

So What I would suggest that you add a Manifest files to your jar where in you specify the main class. Your MANIFEST.MF files may look like:

Manifest-Version: 1.0
Main-Class: org.myorg.Capital

And now your command would look like:

hadoop jar capital.jar /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output

You can certainly just change the index values being used in your code but that's not advisable solution.

Share:
21,655
gaussd
Author by

gaussd

Im a Computer Science Student in Freiburg (Germany).

Updated on November 14, 2020

Comments

  • gaussd
    gaussd over 3 years

    So I need two files as an Input to my mapreduce program: City.dat and Country.dat

    In my main method im parsing the command line arguments like this:

    Path cityInputPath = new Path(args[0]);
    Path countryInputPath = new Path(args[1]);
    Path outputPath = new Path(args[2]);
    MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);
    MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);
    FileOutputFormat.setOutputPath(job, outputPath);
    

    If I'm now running my programm with the following command:

    hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output
    

    I get the following error:

    Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /user/cloudera/capital/input/Country.dat already exists
    

    Why does it treat this as my output directory? I specified another directory as the output directory. Can somebody explain this?

  • gaussd
    gaussd over 11 years
    That is strange because I always started my programs with this command and it never treated org.myorg.Class as the zero'th argument. Shifting all my indices strangely leads to the same error. And also my output folder does not exist. The problem is that it thinks /user/cloudera/input/Country.dat is my output folder...That's why its not empty. The question is why does it think that this is my output folder.
  • Thomas Jungblut
    Thomas Jungblut over 11 years
    If it leads to the exact same error, you are not running the code you have provided.
  • pk10
    pk10 over 9 years
    As far as I have worked with problems, @gaussd is right. org.myorg.Capital is not the 0th element in args. Its just saying that "Start with the class org.myorg.Capital in the capital.jar file"..