Moving 2TB (10 mil files + dirs), what's my bottleneck?
Solution 1
Ever heard of splitting large tasks into smaller tasks?
/home/data/repo contains 1M dirs, each of which contain 11 dirs and 10 files. It totals 2TB.
rsync -a /source/1/ /destination/1/
rsync -a /source/2/ /destination/2/
rsync -a /source/3/ /destination/3/
rsync -a /source/4/ /destination/4/
rsync -a /source/5/ /destination/5/
rsync -a /source/6/ /destination/6/
rsync -a /source/7/ /destination/7/
rsync -a /source/8/ /destination/8/
rsync -a /source/9/ /destination/9/
rsync -a /source/10/ /destination/10/
rsync -a /source/11/ /destination/11/
(...)
Coffee break time.
Solution 2
This is what is happening:
- Initially rsync will build the list of files.
- Building this list is really slow, due to an initial sorting of the file list.
- This can be avoided by using ls -f -1 and combining it with xargs for building the set of files that rsync will use, or either redirecting output to a file with the file list.
- Passing this list to rsync instead of the folder, will make rsync to start working immediately.
- This trick of ls -f -1 over folders with millions of files is perfectly described in this article: http://unixetc.co.uk/2012/05/20/large-directory-causes-ls-to-hang/
Solution 3
Even if rsync is slow (why is it slow? maybe -z will help) it sounds like you've gotten a lot of it moved over, so you could just keep trying:
If you used --remove-source-files, you could then follow-up by removing empty directories. --remove-source-files will remove all the files, but will leave the directories there.
Just make sure you DO NOT use --remove-source-files with --delete to do multiple passes.
Also for increased speed you can use --inplace
If you're getting kicked out because you're trying to do this remotely on a server, go ahead and run this inside a 'screen' session. At least that way you can let it run.
Related videos on Youtube
Tim
Updated on September 18, 2022Comments
-
Tim over 1 year
Background
I ran out of space on
/home/data
and need to transfer/home/data/repo
to/home/data2
./home/data/repo
contains 1M dirs, each of which contain 11 dirs and 10 files. It totals 2TB./home/data
is on ext3 with dir_index enabled./home/data2
is on ext4. Running CentOS 6.4.I assume these approaches are slow because of the fact that
repo/
has 1 million dirs directly underneath it.
Attempt 1:
mv
is fast but gets interruptedI could be done if this had finished:
/home/data> mv repo ../data2
But it was interrupted after 1.5TB was transferred. It was writing at about 1GB/min.
Attempt 2:
rsync
crawls after 8 hours of building file list/home/data> rsync --ignore-existing -rv repo ../data2
It took several hours to build the 'incremental file list' and then it transfers at 100MB/min.
I cancel it to try a faster approach.
Attempt 3a:
mv
complainsTesting it on a subdirectory:
/home/data/repo> mv -f foobar ../../data2/repo/ mv: inter-device move failed: '(foobar)' to '../../data2/repo/foobar'; unable to remove target: Is a directory
I'm not sure what this is error about, but maybe
cp
can bail me out..Attempt 3b:
cp
gets nowhere after 8 hours/home/data> cp -nr repo ../data2
It reads the disk for 8 hours and I decide to cancel it and go back to rsync.
Attempt 4:
rsync
crawls after 8 hours of building file list/home/data> rsync --ignore-existing --remove-source-files -rv repo ../data2
I used
--remove-source-files
thinking it might make it faster if I start cleanup now.It takes at least 6 hours to build the file list then it transfers at 100-200MB/min.
But the server was burdened overnight and my connection closed.
Attempt 5: THERES ONLY 300GB LEFT TO MOVE WHY IS THIS SO PAINFUL
/home/data> rsync --ignore-existing --remove-source-files -rvW repo ../data2
Interrupted again. The
-W
almost seemed to make "sending incremental file list" faster, which to my understanding shouldn't make sense. Regardless, the transfer is horribly slow and I'm giving up on this one.Attempt 6:
tar
/home/data> nohup tar cf - . |(cd ../data2; tar xvfk -)
Basically attempting to re-copy everything but ignoring existing files. It has to wade thru 1.7TB of existing files but at least it's reading at 1.2GB/min.
So far, this is the only command which gives instant gratification.
Update: interrupted again, somehow, even with nohup..
Attempt 7: harakiri
Still debating this one
Attempt 8: scripted 'merge' with
mv
The destination dir had about 120k empty dirs, so I ran
/home/data2/repo> find . -type d -empty -exec rmdir {} \;
Ruby script:
SRC = "/home/data/repo" DEST = "/home/data2/repo" `ls #{SRC} --color=never > lst1.tmp` `ls #{DEST} --color=never > lst2.tmp` `diff lst1.tmp lst2.tmp | grep '<' > /home/data/missing.tmp` t = `cat /home/data/missing.tmp | wc -l`.to_i puts "Todo: #{t}" # Manually `mv` each missing directory File.open('missing.tmp').each do |line| dir = line.strip.gsub('< ', '') puts `mv #{SRC}/#{dir} #{DEST}/` end
DONE.
-
Ярослав Рахматуллин over 10 yearsThe benefit I'm vaguely emphasizing is that you track the progress in small parts manually so that resuming the task will take lesss time if some part is aborted (because you know which steps were completed successfully).
-
Tim over 10 yearsThis is basically what I ended up doing in the end, except with
mv
. Unfortunate there is no tool meetingmv
andrsync
halfway. -
d-b over 9 yearsCan you give an example of how to use ls with rsync? I have a similar but not identical situation. On machine A I have rsyncd running and a large directory tree I want to transfer to machine B (actually, 90% of the directory is already at B). The problem is that I have to do this using a unstable mobile connection that frequently drops. Spending an hour on building the file list everytime I restart is pretty inefficient. Also, B is behind NAT that I don't control so it is hard to connect A -> B, while B -> A is easy.
-
redfox05 about 5 yearsAgree with @d-b. If an example could be given, that would make this answer much more useful.