Hadoop Mapreduce multiple Input files
Solution 1
Based on the stacktrace, your output directory is not empty. So the simplest thing is actually to delete it before running the job:
bin/hadoop fs -rmr /user/cloudera/capital/output
Besides that, your arguments starting with the classname of your main class org.myorg.Capital
. So that is the argument on the zero'th index. (Based on the stacktrace and the code you have provided).
Basically you need to shift all your indices one to the right:
Path cityInputPath = new Path(args[1]);
Path countryInputPath = new Path(args[2]);
Path outputPath = new Path(args[3]);
MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);
MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);
Don't forget to clear your output folder though!
Also a small tip for you, you can separate the files with comma "," so you can set them with a single call like this:
hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat,/user/cloudera/capital/input/Country.dat
And in your java code:
FileInputFormat.addInputPaths(job, args[1]);
Solution 2
What is happening here is that the class name is deemed to be the first argument!
By default, the first non-option argument is the name of the class to be invoked. A fully-qualified class name should be used. If the -jar option is specified, the first non-option argument is the name of a JAR archive containing class and resource f iles for the application, with the startup class indicated by the Main-Class manifest header.
So What I would suggest that you add a Manifest files to your jar where in you specify the main class. Your MANIFEST.MF files may look like:
Manifest-Version: 1.0
Main-Class: org.myorg.Capital
And now your command would look like:
hadoop jar capital.jar /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output
You can certainly just change the index values being used in your code but that's not advisable solution.
Comments
-
gaussd over 3 years
So I need two files as an Input to my mapreduce program: City.dat and Country.dat
In my main method im parsing the command line arguments like this:
Path cityInputPath = new Path(args[0]); Path countryInputPath = new Path(args[1]); Path outputPath = new Path(args[2]); MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class); MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class); FileOutputFormat.setOutputPath(job, outputPath);
If I'm now running my programm with the following command:
hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output
I get the following error:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /user/cloudera/capital/input/Country.dat already exists
Why does it treat this as my output directory? I specified another directory as the output directory. Can somebody explain this?
-
gaussd over 11 yearsThat is strange because I always started my programs with this command and it never treated org.myorg.Class as the zero'th argument. Shifting all my indices strangely leads to the same error. And also my output folder does not exist. The problem is that it thinks /user/cloudera/input/Country.dat is my output folder...That's why its not empty. The question is why does it think that this is my output folder.
-
Thomas Jungblut over 11 yearsIf it leads to the exact same error, you are not running the code you have provided.
-
pk10 over 9 yearsAs far as I have worked with problems, @gaussd is right. org.myorg.Capital is not the 0th element in args. Its just saying that "Start with the class org.myorg.Capital in the capital.jar file"..