Moving 2TB (10 mil files + dirs), what's my bottleneck?

7,026

Solution 1

Ever heard of splitting large tasks into smaller tasks?

/home/data/repo contains 1M dirs, each of which contain 11 dirs and 10 files. It totals 2TB.

rsync -a /source/1/ /destination/1/
rsync -a /source/2/ /destination/2/
rsync -a /source/3/ /destination/3/
rsync -a /source/4/ /destination/4/
rsync -a /source/5/ /destination/5/
rsync -a /source/6/ /destination/6/
rsync -a /source/7/ /destination/7/
rsync -a /source/8/ /destination/8/
rsync -a /source/9/ /destination/9/
rsync -a /source/10/ /destination/10/
rsync -a /source/11/ /destination/11/

(...)

Coffee break time.

Solution 2

This is what is happening:

  • Initially rsync will build the list of files.
  • Building this list is really slow, due to an initial sorting of the file list.
  • This can be avoided by using ls -f -1 and combining it with xargs for building the set of files that rsync will use, or either redirecting output to a file with the file list.
  • Passing this list to rsync instead of the folder, will make rsync to start working immediately.
  • This trick of ls -f -1 over folders with millions of files is perfectly described in this article: http://unixetc.co.uk/2012/05/20/large-directory-causes-ls-to-hang/

Solution 3

Even if rsync is slow (why is it slow? maybe -z will help) it sounds like you've gotten a lot of it moved over, so you could just keep trying:

If you used --remove-source-files, you could then follow-up by removing empty directories. --remove-source-files will remove all the files, but will leave the directories there.

Just make sure you DO NOT use --remove-source-files with --delete to do multiple passes.

Also for increased speed you can use --inplace

If you're getting kicked out because you're trying to do this remotely on a server, go ahead and run this inside a 'screen' session. At least that way you can let it run.

Share:
7,026

Related videos on Youtube

Tim
Author by

Tim

Updated on September 18, 2022

Comments

  • Tim
    Tim over 1 year

    Background

    I ran out of space on /home/data and need to transfer /home/data/repo to /home/data2.

    /home/data/repo contains 1M dirs, each of which contain 11 dirs and 10 files. It totals 2TB.

    /home/data is on ext3 with dir_index enabled. /home/data2 is on ext4. Running CentOS 6.4.

    I assume these approaches are slow because of the fact that repo/ has 1 million dirs directly underneath it.


    Attempt 1: mv is fast but gets interrupted

    I could be done if this had finished:

    /home/data> mv repo ../data2
    

    But it was interrupted after 1.5TB was transferred. It was writing at about 1GB/min.

    Attempt 2: rsync crawls after 8 hours of building file list

    /home/data> rsync --ignore-existing -rv repo ../data2
    

    It took several hours to build the 'incremental file list' and then it transfers at 100MB/min.

    I cancel it to try a faster approach.

    Attempt 3a: mv complains

    Testing it on a subdirectory:

    /home/data/repo> mv -f foobar ../../data2/repo/
    mv: inter-device move failed: '(foobar)' to '../../data2/repo/foobar'; unable to remove target: Is a directory
    

    I'm not sure what this is error about, but maybe cp can bail me out..

    Attempt 3b: cp gets nowhere after 8 hours

    /home/data> cp -nr repo ../data2
    

    It reads the disk for 8 hours and I decide to cancel it and go back to rsync.

    Attempt 4: rsync crawls after 8 hours of building file list

    /home/data> rsync --ignore-existing --remove-source-files -rv repo ../data2
    

    I used --remove-source-files thinking it might make it faster if I start cleanup now.

    It takes at least 6 hours to build the file list then it transfers at 100-200MB/min.

    But the server was burdened overnight and my connection closed.

    Attempt 5: THERES ONLY 300GB LEFT TO MOVE WHY IS THIS SO PAINFUL

    /home/data> rsync --ignore-existing --remove-source-files -rvW repo ../data2
    

    Interrupted again. The -W almost seemed to make "sending incremental file list" faster, which to my understanding shouldn't make sense. Regardless, the transfer is horribly slow and I'm giving up on this one.

    Attempt 6: tar

    /home/data> nohup tar cf - . |(cd ../data2; tar xvfk -)
    

    Basically attempting to re-copy everything but ignoring existing files. It has to wade thru 1.7TB of existing files but at least it's reading at 1.2GB/min.

    So far, this is the only command which gives instant gratification.

    Update: interrupted again, somehow, even with nohup..

    Attempt 7: harakiri

    Still debating this one

    Attempt 8: scripted 'merge' with mv

    The destination dir had about 120k empty dirs, so I ran

    /home/data2/repo> find . -type d -empty -exec rmdir {} \;
    

    Ruby script:

    SRC  = "/home/data/repo"
    DEST = "/home/data2/repo"
    
    `ls #{SRC}  --color=never > lst1.tmp`
    `ls #{DEST} --color=never > lst2.tmp`
    `diff lst1.tmp lst2.tmp | grep '<' > /home/data/missing.tmp`
    
    t = `cat /home/data/missing.tmp | wc -l`.to_i
    puts "Todo: #{t}"
    
    # Manually `mv` each missing directory
    File.open('missing.tmp').each do |line|
      dir = line.strip.gsub('< ', '')
      puts `mv #{SRC}/#{dir} #{DEST}/`
    end
    

    DONE.

  • Ярослав Рахматуллин
    Ярослав Рахматуллин over 10 years
    The benefit I'm vaguely emphasizing is that you track the progress in small parts manually so that resuming the task will take lesss time if some part is aborted (because you know which steps were completed successfully).
  • Tim
    Tim over 10 years
    This is basically what I ended up doing in the end, except with mv. Unfortunate there is no tool meeting mv and rsync halfway.
  • d-b
    d-b over 9 years
    Can you give an example of how to use ls with rsync? I have a similar but not identical situation. On machine A I have rsyncd running and a large directory tree I want to transfer to machine B (actually, 90% of the directory is already at B). The problem is that I have to do this using a unstable mobile connection that frequently drops. Spending an hour on building the file list everytime I restart is pretty inefficient. Also, B is behind NAT that I don't control so it is hard to connect A -> B, while B -> A is easy.
  • redfox05
    redfox05 about 5 years
    Agree with @d-b. If an example could be given, that would make this answer much more useful.