Hadoop: How can i merge reducer outputs to a single file?

java hadoop merge mapreduce hdfs

15,214

Solution 1

But what should I do if I want to merge these outputs after the job by HDFS API for java?

Guessing, because I haven't tried this myself, but I think the method you are looking for is FileUtil.copyMerge, which is the method that FsShell invokes when you run the -getmerge command. FileUtil.copyMerge takes two FileSystem objects as arguments - FsShell uses FileSystem.getLocal to retrieve the destination FileSystem, but I don't see any reason you couldn't instead use Path.getFileSystem on the destination to obtain an OutputStream

That said, I don't think it wins you very much -- the merge is still happening in the local JVM; so you aren't really saving very much over -getmerge followed by -put.

Solution 2

You get a single Out-put File by Setting a single Reducer in your code .

Job.setNumberOfReducer(1);

Will work for your requirement , but costly

Static method to execute a shell command. 
Covers most of the simple cases without requiring the user to implement the Shell interface.

Parameters:
env the map of environment key=value
cmd shell command to execute.
Returns:
the output of the executed command.

org.apache.hadoop.util.Shell.execCommand(String[])

15,214

Author by

thomaslee

Updated on June 05, 2022

Comments

thomaslee almost 2 years

I know that "getmerge" command in shell can do this work.

But what should I do if I want to merge these outputs after the job by HDFS API for java？

What i actually want is a single merged file on HDFS.

The only thing i can think of is to start an additional job after that.

thanks!
thomaslee over 11 years

Thanks for your answer. That indeed works，but costly as you say. Is there a way to merge them by hdfs API ?
saurabh shashank over 11 years

I will even go with your choice of another Job for it. OR i have edited the ans .
thomaslee over 11 years

yeah, may be start another job is better. I will also try execCommand before making a choice. Thank you very much!
thomaslee over 11 years

Thanks for your answer. I have just tried like this: String srcPath = "/user/hadoop/output"; String dstPath = "/user/hadoop/merged_file"; Configuration conf = new Configuration(); try { FileSystem hdfs = FileSystem.get(conf); FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, conf, null); } catch (IOException e) { }. That succesully merged output files as a single file on hdfs, and the order is just as my expection. But I have another question now. How does the function konw the files order ?
Ben McCracken over 11 years

Here's the implementation of copyMerge: grepcode.com/file/repository.cloudera.com/content/repositori‌es/… It looks like it's all down to the ordering of the items returned by the FileSystem's listStatus method. I'd guess that your output files are just concatenated together.
Nikhil Das Nomula over 11 years

@ Thomas, Ben : I am trying to merge files from my reducer's output using FileUtil.copyMerge. However I have a question here, the source destination contains _SUCCESS and _log files too apart from part-r-00000. part-r-00001. Does copyMerge take in only reducer output files or should I explicitly filter what files have to me merged? If yes, how can I do that? Thanks.
Viacheslav Dobromyslov over 9 years

Great answer. It's helpful if you want to prepare compressed Avro file for some external system. For example I process 5 JSON files 1Gb each and reduce output to 1 Avro file compressed with XZ to 100Mb. In other case I would get 5 Avro files 50Mb each ~ 250Mb total.