Why is tar|tar so much faster than cp?

tar cp

7,051

Solution 1

Cp does open-read-close-open-write-close in a loop over all files. So reading from one place and writing to another occur fully interleaved. Tar|tar does reading and writing in separate processes, and in addition tar uses multiple threads to read (and write) several files 'at once', effectively allowing the disk controller to fetch, buffer and store many blocks of data at once. All in all, tar allows each component to work efficiently, while cp breaks down the problem in disparate, inefficiently small chunks.

Solution 2

Your edit goes in the good direction: cp isn't necessarily slower than tar | tar. Depends for example on the quantity and size of files. For big files a plain cp is best, since it's a simple job of pushing data around. For lots of small files the logistics are different and tar might do a better job. See for example this answer.

7,051

callum

Updated on September 18, 2022

Comments

callum almost 2 years

For recursively copying a directory, using tar to pack up a directory and then piping the output to another tar to unpack seems to be much faster than using cp -r (or cp -a).

Why is this? And why can't cp be made faster by doing it the same way under the hood?

Edit: I noticed this difference when trying to copy a huge directory structure containing tens of thousands of files and folders, deeply nested, but totalling only about 50MB. Not sure if that's relevant.
- admirabilis almost 10 years
  
  That's one interesting question. You can find some answers here: stackoverflow.com/questions/316078 and here: unix.stackexchange.com/questions/66647
jan deeg over 7 years

Can we really say that's true of all cp implementations? How do we know that's true? And why would cp be written in such an inefficient way? Any textbook implementation of a file copy reads a buffer of n bytes at a time, and writes them to disk before reading another n bytes. But you're saying cp always reads the whole file before writing the whole copy?
hmijail mourns resignees almost 4 years

"uses multiple threads to read (and write) several files 'at once'" -> that makes no sense. One disk can only read or write one file at once. Plus, the fastest way to read / write is to do it in big, contiguous chunks to take advantage of readahead optimizations and minimize seeking (including solid state drives), so using multiple threads to try to parallelize things only makes the process more inefficient.
gronostaj almost 4 years

@hmijailmournsresignees Your assumptions are correct for large files, but not for huge numbers of small ones or for heavily fragmented files. In that case it's optimal to group partial operations on multiple files into these contiguous chunks. Modern disks have huge buffers - up to 256 MB for HDDs - that can be employed to shuffle operations to improve efficiency. There's also NCQ which can reorder operations to take advantage of disk's mechanical properties. Disks do read/write concurrently, just not in parallel.
hmijail mourns resignees almost 4 years

@gronostaj aren't you making my point? It's as simple as "sequential accesses good, random accesses bad". The buffers are there just to compensate for the bottleneck that is just afterwards: the actual storage. Data fragmentation is a problem, which you make even worse by fragmenting the accesses and therefore requiring random accesses. NCQ tries to coalesce those random accesses, but if you do big bulk reads (by avoiding multithreading!) there is nothing to coalesce.
gronostaj almost 4 years

@hmijailmournsresignees Increased fragmentation due to concurrent writes is an aspect worth exploring, I think. I suppose a modern FS would try to reduce the impact, but I'm not sure if that's the case and how effective that would be. But for reads I think it's a different story: assuming you're dealing with plenty of small files, they are almost guaranteed to be non-contiguous. NCQ + long read queue should improve the performance. The less concurrent your reads are, the less room for NCQ to work you have.
hmijail mourns resignees almost 4 years

"NCQ + long read queue should improve the performance. The less concurrent your reads are, the less room for NCQ to work you have." -> If the pattern of concurrent accesses fits the pattern of fragmentation of the files, then... maybe? If there are so many parallel requests that the accesses become more fragmented than the file, then no. I'd guess it's like trying to run N threads on M processors when N>>M: you waste time in the switching. Only, the penalization for switching too early is terribly worse in the case of storage.