tar + rsync + untar. Any speed benefit over just rsync?

59,648

Solution 1

When you send the same set of files, rsync is better suited because it will only send differences. tar will always send everything and this is a waste of resources when a lot of the data are already there. The tar + rsync + untar loses this advantage in this case, as well as the advantage of keeping the folders in-sync with rsync --delete.

If you copy the files for the first time, first packeting, then sending, then unpacking (AFAIK rsync doesn't take piped input) is cumbersome and always worse than just rsyncing, because rsync won't have to do any task more than tar anyway.

Tip: rsync version 3 or later does incremental recursion, meaning it starts copying almost immediately before it counts all files.

Tip2: If you use rsync over ssh, you may also use either tar+ssh

tar -C /src/dir -jcf - ./ | ssh user@server 'tar -C /dest/dir -jxf -'

or just scp

scp -Cr srcdir user@server:destdir

General rule, keep it simple.

UPDATE:

I've created 59M demo data

mkdir tmp; cd tmp
for i in {1..5000}; do dd if=/dev/urandom of=file$i count=1 bs=10k; done

and tested several times the file transfer to a remote server (not in the same lan), using both methods

time rsync -r  tmp server:tmp2

real    0m11.520s
user    0m0.940s
sys     0m0.472s

time (tar cf demo.tar tmp; rsync demo.tar server: ; ssh server 'tar xf demo.tar; rm demo.tar'; rm demo.tar)

real    0m15.026s
user    0m0.944s
sys     0m0.700s

while keeping separate logs from the ssh traffic packets sent

wc -l rsync.log rsync+tar.log 
   36730 rsync.log
   37962 rsync+tar.log
   74692 total

In this case, I can't see any advantage in less network traffic by using rsync+tar, which is expected when the default mtu is 1500 and while the files are 10k size. rsync+tar had more traffic generated, was slower for 2-3 seconds and left two garbage files that had to be cleaned up.

I did the same tests on two machines on the same lan, and there the rsync+tar did much better times and much much less network traffic. I assume cause of jumbo frames.

Maybe rsync+tar would be better than just rsync on a much larger data set. But frankly I don't think it's worth the trouble, you need double space in each side for packing and unpacking, and there are a couple of other options as I've already mentioned above.

Solution 2

rsync also does compression. Use the -z flag. If running over ssh, you can also use ssh's compression mode. My feeling is that repeated levels of compression is not useful; it will just burn cycles without significant result. I'd recommend experimenting with rsync compression. It seems quite effective. And I'd suggest skipping usage of tar or any other pre/post compression.

I usually use rsync as rsync -abvz --partial....

Solution 3

I had to back up my home directory to NAS today and ran into this discussion, thought I'd add my results. Long story short, tar'ing over the network to the target file system is way faster in my environment than rsyncing to the same destination.

Environment: Source machine i7 desktop using SSD hard drive. Destination machine Synology NAS DS413j on a gigabit lan connection to the Source machine.

The exact spec of the kit involved will impact performance, naturally, and I don't know the details of my exact setup with regard to quality of network hardware at each end.

The source files are my ~/.cache folder which contains 1.2Gb of mostly very small files.

1a/ tar files from source machine over the network to a .tar file on remote machine

$ tar cf /mnt/backup/cache.tar ~/.cache

1b/ untar that tar file on the remote machine itself

$ ssh admin@nas_box
[admin@nas_box] $ tar xf cache.tar

2/ rsync files from source machine over the network to remote machine

$ mkdir /mnt/backup/cachetest
$ rsync -ah .cache /mnt/backup/cachetest

I kept 1a and 1b as completely separate steps just to illustrate the task. For practical applications I'd recommend what Gilles posted above involving pipeing tar output via ssh to an untarring process on the receiver.

Timings:

1a - 33 seconds

1b - 1 minutes 48 seconds

2 - 22 minutes

It's very clear that rsync performed amazingly poorly compared to a tar operation, which can presumably be attributed to both the network performance mentioned above.

I'd recommend anyone who wants to back up large quantities of mostly small files, such as a home directory backup, use the tar approach. rsync seems a very poor choice. I'll come back to this post if it seems I've been inaccurate in any of my procedure.

Nick

Solution 4

Using rsync to send a tar archive as asked actually would be a waste or ressources, since you'd add a verification layer to the process. Rsync would checksum the tar file for correctness, when you'd rather have the check on the individual files. (It doesn't help to know that the tar file which may have been defective on the sending side already shows the same effect on the receiving end). If you're sending an archive, ssh/scp is all you need.

The one reason you might have to select sending an archive would be if the tar of your choice were able to preserve more of the filesystem specials, such as Access Control List or other Metadata often stored in Extended Attributes (Solaris) or Ressource Forks (MacOS). When dealing with such things, your main concern will be as to which tools are able to preserve all information that's associated with the file on the source filesystem, providing the target filesystem has the capability to keep track of them as well.

When speed is your main concern, it depends a lot on the size of your files. In general, a multitude of tiny files will scale badly over rsync or scp, since the'll all waste individual network packets each, where a tar file would include several of them within the data load of a single network packet. Even better if the tar file were compressed, since the small files would most likely compress better as a whole than individually. As far as I know, both rsync and scp fail to optimize when sending entire single files as in an initial transfer, having each file occupy an entire data frame with its entire protocol overhead (and wasting more on checking forth and back). However Janecek states this to be true for scp only, detailling that rsync would optimize the network traffic but at the cost of building huge data structures in memory. See article Efficient File Transfer, Janecek 2006. So according to him it's still true that both scp and rsync scale badly on small files, but for entirely different reasons. Guess I'll have to dig into sources this weekend to find out.

For practical relevance, if you know you're sending mostly larger files, there won't be much of a difference in speed, and using rsync has the added benefit of being able to take up where it left when interrupted.

Postscriptum: These days, rdist seems to sink into oblivition, but before the days of rsync, it was a very capable tool and widely used (safely when used over ssh, unsafe otherwise). I would not perform as good as rsync though since it didn't optimize to just transfer content that had changed. Its main difference to rsync lies in the way it is configured, and how the rules for updating files are spelled out.

Solution 5

For small directories (small as in used disk space), it depends on the overhead of checking the file information for the files being synced. On one hand, rsync saves the time of transferring the unmodified files, on the other hand, it indeed has to transfer information about each file.

I don't know exactly the internals of rsync. Whether the file stats cause lag depends on how rsync transfers data — if file stats are transferred one by one, then the RTT may make tar+rsync+untar faster.

But if you have, say 1 GiB of data, rsync will be way faster, well, unless your connection is really fast!

Share:
59,648

Related videos on Youtube

Amelio Vazquez-Reina
Author by

Amelio Vazquez-Reina

I'm passionate about people, technology and research. Some of my favorite quotes: "Far better an approximate answer to the right question than an exact answer to the wrong question" -- J. Tukey, 1962. "Your title makes you a manager, your people make you a leader" -- Donna Dubinsky, quoted in "Trillion Dollar Coach", 2019.

Updated on September 18, 2022

Comments

  • Amelio Vazquez-Reina
    Amelio Vazquez-Reina almost 2 years

    I often find myself sending folders with 10K - 100K of files to a remote machine (within the same network on-campus).

    I was just wondering if there are reasons to believe that,

     tar + rsync + untar
    

    Or simply

     tar (from src to dest) + untar
    

    could be faster in practice than

    rsync 
    

    when transferring the files for the first time.

    I am interested in an answer that addresses the above in two scenarios: using compression and not using it.

    Update

    I have just run some experiments moving 10,000 small files (total size = 50 MB), and tar+rsync+untar was consistently faster than running rsync directly (both without compression).

    • JBRWilkinson
      JBRWilkinson over 12 years
      Are you running rsync in daemon mode at the other end?
    • Gilles 'SO- stop being evil'
      Gilles 'SO- stop being evil' over 12 years
      Re. your ancillary question: tar cf - . | ssh remotehost 'cd /target/dir && tar xf -'
    • Aska Ray
      Aska Ray over 12 years
      Syncing smaller files individually through rsync or scp results in each file starting at least one own data packet over the net. If the file is small and the packets are many, this results in increased protocol overhead. Now count in that there are more than one data packets for each file by means of rsync protocol as well (transferring checksums, comparing...), the protocol overhead quickly builds up. See Wikipedia on MTU size
    • Amelio Vazquez-Reina
      Amelio Vazquez-Reina over 12 years
      Thanks @TatjanaHeuser - if you add this to your answer and don't mind backing up the claim that rsync uses at least one packet per file, I would accept it.
    • Aska Ray
      Aska Ray over 12 years
      I found an interesting read stating that with scp and rsync the delay is to be blamed to different reasons: scp behaving basically like I described, but rsync optimizing network payload at the increased cost of building up large data structures for handling that. I've included that into my answer and will check on it this weekend.
    • Kevin
      Kevin over 11 years
      The comparative speed depends a great deal on the speed of the connection between the computers and the speed of the computers themselves.
  • forcefsck
    forcefsck over 12 years
    Rsync doesn't add a verification layer. It only uses checksums to find differences on existing files, not to verify the result. In case where the copy is fresh, no checksums are made. In case where the copy is not fresh, checksums save you bandwidth.
  • 0xC0000022L
    0xC0000022L over 12 years
    Indeed. The "only what's needed" is an important aspect, although it can sometimes be unruly, that beast called rsync ;)
  • Populus
    Populus over 9 years
    BTW if you use the flag z with rsync it will compress the connection. With the amount of CPU power we have nowadays, the compression is trivial compared to the amount of bandwidth you save, which can be ~1/10 of uncompressed for text files
  • forcefsck
    forcefsck over 9 years
    @Populus, you'll notice I'm using compression on my original reply. However in the tests I added later it doesn't matter that much, data from urandom doesn't compress much... if at all.
  • Wildcard
    Wildcard over 6 years
    Note that rsync by default skips compressing files with certain suffixes including .gz and .tgz and others; search the rsync man page for --skip-compress for the full list.
  • Wildcard
    Wildcard over 6 years
    Without using -z to have rsync do compression, this test seems incomplete.
  • Neek
    Neek over 6 years
    Tar without its own z argument, as I used it, does not compress data (see unix.stackexchange.com/questions/127169/…), so as far as I can see using rsync without compression is a fair comparison. If I were passing the tar output through a compression library like bzip2 or gzip then yes, -z would be sensible.
  • user1683793
    user1683793 about 5 years
    Update: Six hours into the ssh/tar transfer, my system decided to drop the connection to the SAN device I was moving data to. Now I am going to have to figure out what was transferred and what was not, which I will probably do with rsync. Sometimes, it is not worth the time you have to spend to save time.
  • Ciprian Tomoiagă
    Ciprian Tomoiagă over 2 years
    in this case it seems that the remote is mounted locally, probably using the NFS protocol. If this is the case, you're benchmarking the performance of NFS with 1 vs thousands of files, and rsync/tar have very little to say in this
  • Neek
    Neek over 2 years
    You are absolutely right @CiprianTomoiagă .. I think i was half joking in this 9 year old post, wanting to help others doing the same thing. It seems obvious that something like tar, which collects data on the source machine and writes to a single destination file, would perform better than a competitor working over any kind of filesystem barrier (NFS as you point out) e.g. rsync which operates over multiple files. I think I was hoping rsync would be able to achieve some optimisation on the receiving end, perhaps what rsyncd does, though I've even now never used it :)