Copying faster than cp?

86

Solution 1

I was recently puzzled by the sometimes slow speed of cp. Specifically, how come df = pandas.read_hdf('file1', 'df') (700ms for a 1.2GB file) followed by df.to_hdf('file2') (530ms) could be so much faster than cp file1 file2 (8s)?

Digging into this:

  • cat file1 > file2 isn't any better (8.1s).
  • dd bs=1500000000 if=file1 of=file2 neither (8.3s).
  • rsync file1 file2 is worse (11.4s), because file2 existed already so it tries to do its rolling checksum and block update magic.

Oh, wait a second! How about unlinking (deleting) file2 first if it exists?

Now we are talking:

  • rm -f file2: 0.2s (to add to any figure below).
  • cp file1 file2: 1.0s.
  • cat file1 > file2: 1.0s.
  • dd bs=1500000000 if=file1 of=file2: 1.2s.
  • rsync file1 file2: 4s.

So there you have it. Make sure the target files don't exist (or truncate them, which is presumably what pandas.to_hdf() does).

Edit: this was without emptying the cache before any of the commands, but as noted in the comments, doing so just consistently adds ~3.8s to all numbers above.

Also noteworthy: this was tried on various Linux versions (Centos w. 2.6.18-408.el5 kernel, and Ubuntu w. 3.13.0-77-generic kernel), and ext4 as well as ext3. Interestingly, on a MacBook with Darwin 10.12.6, there is no difference and both versions (with or without existing file at the destination) are fast.

Solution 2

Copying a file on the local disk is 99% spent in reading and writing to the disk. If you try to compress data then you increase CPU load but don't reduce the read/write data... it will actually slow down your copy.

rsync will help if you already have a copy of the data and bring it "up to date".

But if you want to create a brand new copy of a tree then you can't really do much better than your cp command.

Solution 3

On the same partition (and filesystem) you can use -l to achieve hard links instead of copies. Hard link creation is much faster than copying things (but, of course, does not work across different disk partitions).

As a small example:

$ time cp -r mydir mydira

real    0m1.999s
user    0m0.000s
sys     0m0.490s

$ time cp -rl mydir mydirb

real    0m0.072s
user    0m0.000s
sys     0m0.007s

That's a 28 times improvement. But that test used only ~300 (rather small) files. A couple of bigger files should perform faster, a lot of smaller files slower.

Share:
86

Related videos on Youtube

Raven
Author by

Raven

Updated on September 18, 2022

Comments

  • Raven
    Raven over 1 year

    If I have the following input

    test = 3; //first
    test = test + 3; //second
    

    parsed with a grammar that looks like this (for example)

    Declaration:
        name=ID "=" DeclarationContent
    ;
    
        DeclarationContent:
            number=INT ("+" DeclarationContent)?
            | reference=[Declaration] ("+" DeclarationContent)?
        ;
    

    to which declaration does the reference refer (looking at the second declaration in my example)? Or in other words does the reference "test" (second Declaration) refer to the first Declaration ("test = 3;") or directly to itself ("test = test+ 3") and would therefor be a cycle in hierarchy.

    Greeting Krzmbrzl

    • Gilles 'SO- stop being evil'
      Gilles 'SO- stop being evil' over 7 years
      If this is on zfs, you can make a snapshot, which is practically instantaneous. The cost of the copy (both in time and in disk space) is then only paid when one of the sides is modified. I don't know what commands to use for this, I encourage someone who does to post an answer explaining how to do it.
    • Andrew Henle
      Andrew Henle over 7 years
      If you could post the output of iostat while this copy operation is running, you might get more help from readers. Assuming you're running on Solaris from the /solaris tag, post several lines from iostat -sndzx 2. That will emit an output line every 2 seconds, with the first line being not very useful. Again, that needs to be run while your cp -r ... command is running.
  • CJ7
    CJ7 over 7 years
    It's on the same partition
  • CJ7
    CJ7 over 7 years
    What are hard-links? I need actual copies of the files to play around with.
  • Stephen Harris
    Stephen Harris over 7 years
    Hard links make each filename map to the same file; they're not copies. If you modify the new name you modify the original.
  • grochmal
    grochmal over 7 years
    @CJ7 - Hard links are just extra inodes pointing to the same data. If you change the copy the original file is changed too.
  • CJ7
    CJ7 over 7 years
    Also I'm on solaris and my cp does not have a -l option.
  • Bratchley
    Bratchley over 7 years
    They could use CoW snapshots. That's essentially creating a new copy of the files and you only have the initial snapshot operation and subsequent increased latency for writes to the new "copies"
  • grochmal
    grochmal over 7 years
    @CJ7 - Yeah, posix cp will not have -l. You will need to go with find -exec ln -P ... \;. But then again, you're after copies not hard links.
  • Stephen Harris
    Stephen Harris over 7 years
    @Bratchley did you read the question? CoW does not meet the requirements (and it's Solaris, anyway).
  • Stephen Harris
    Stephen Harris over 7 years
    This doesn't meet the requirements; hard links to existing files are not copies; modify the "copy" and you modify the original.
  • grochmal
    grochmal over 7 years
    Well, you might do some form of tmpfs, mount it somewhere, CoW onto it, and then edit the "copies". But (1) that's horribly far fetched, and (2) not on solaris.
  • Bratchley
    Bratchley over 7 years
    It does meet the requirements. It creates an additional set of copies. Also Solaris quite famously supports snapshots.
  • Bratchley
    Bratchley over 7 years
    @grochmal If it's reasonably current version of Solaris, it's almost certainly going to be using ZFS which supports snapshotting. There are then a variety of ways to get those "copies" to show up in a desired part of the filesystem.
  • grochmal
    grochmal over 7 years
    @Bratchley - I forgot that Solaris has ZFS, yeah, that's actually a good bet. I know little about ZFS though.
  • Bratchley
    Bratchley over 7 years
    ha yeah. Solaris is the birthplace for ZFS ;-)
  • grochmal
    grochmal over 7 years
    @StephenHarris - Well, it may meet the requirements of someone who get here from a google search. In U&L theory it answers the question at hand. The comment discussion with OP is an extra that shows that the answer does not meet OPs requirements. (it's the difference between answer to question and most useful answer to OP).
  • grochmal
    grochmal over 7 years
    @StephenHarris - Semantics are heavily subjective, and you're too heated today/yesterday. I use hard links a lot in my home directory for an example. Had I searched for different ways of doing cp I'd like to see an alternative with hard links. If I remember correctly (that was several years ago), had I learned about hard links sooner, i would have used them sooner.
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' over 7 years
    @Eric dd can only copy a single file or of a whole filesystem image, but not a directory tree, which is what was asked here.
  • Eric
    Eric over 7 years
    @gilles I suppose you could use find to run dd for a quicker copy, but you're right that I missed the recursive part of the question. I wonder if cpio or tar would be quicker than cp.
  • Andrew Henle
    Andrew Henle almost 6 years
    Did you account for the source file contents potentially being held in cache?
  • roaima
    roaima almost 6 years
    Rsync won't do delta magic on local copies. At best it becomes like cp; in practice it's a little slower.
  • vijay
    vijay over 3 years
    So, why was it faster to rm first, then cp/cat rather than just plain cp/cat? Fragmentation? Something else?
  • TIm Richardson
    TIm Richardson over 3 years
    I am not sure, but it may be related to the way overwriting a large file is implemented on certain filesystems, e.g. ext3 and ext4. BTW, I also just tested with truncate -s 0 file2 && cp file1 file2: same timing as rm -f file2 && cp file1 file2 (fast).