To change replication factor of a directory in hadoop

hadoop mapreduce hdfs

23,065

You can change the replication factor of a file using command:

hdfs dfs –setrep –w 3 /user/hdfs/file.txt

You can also change the replication factor of a directory using command:

hdfs dfs -setrep -R 2 /user/hdfs/test

But changing the replication factor for a directory will only affect the existing files and the new files under the directory will get created with the default replication factor (dfs.replication from hdfs-site.xml) of the cluster.

Please see the link to understand more on it.

Please see link to configure replication factor for HDFS.

But you can temporarily override and turn off the HDFS default replication factor by passing:

-D dfs.replication=1

This should work well when you pass it with a Map/Reduce job. This will be your job specific only.

23,065

Author by

Anish Gupta

Updated on October 16, 2020

Comments

Anish Gupta over 3 years

Is there any way to change the replication factor of a directory in Hadoop when I expect the change to be applicable on the files which will be written to that directory in the future?
Anish Gupta almost 9 years

Means as of now there is no way to set the replication factor for the upcoming files to be written in a directory. Actually I am using MultipleOutputs to write some data out from Mapper while transmitting the other data to reducer. But the mapper runs awfully slow because it replicates the data which it is writing to HDFS. Since the data written from HDFS is only needed for a while I want to make the replication factor for that data as 1. Can you suggest a better approach?
Sandeep Singh almost 9 years

Do you want to write Mapper output to HDFS before transmitting it to reducer ? Can you explain bit more ?
Anish Gupta almost 9 years

I am doing a bit of processing in Mapper. I have overriden the context.write method. Whenever context.write is called I make it do its usual stuff as well as write some additional output to MapReduce based on the key value pair passed to this write method. In order to write that data to HDFS I'm using MultipleOutputFormat. MultipleOutputFormat writes my data to HDFS at a pre designated location. Code works fine but is quite slow. The writes from the Mapper are about 400 MB so the replication makes it take more time.
Sandeep Singh almost 9 years

I have updated my answer with one possible solution, pass this configuration when you invoke your job. It should turn off replication for your hadoop job output. you can also do some performance tuning of your job by increasing and decreasing Mapper and Reducer.
Sandeep Singh almost 9 years

Actually replication should not take much time because client write the data into only one datanode and then primary datanode replicates the data into other datanode.