Multithreaded cp on linux?

47,362

Solution 1

As Celada mentioned, there would be no point to using multiple threads of execution since a copy operation doesn't really use the cpu. As ryekayo mentioned, you can run multiple instances of cp so that you end up with multiple concurrent IO streams, but even this is typically counter-productive. If you are copying files from one location to another on the same disk, trying to do more than one at a time will result in the disk wasting time seeking back and forth between each file, which will slow things down. The only time it is really beneficial to copy multiple files at once is if you are, for instance, copying several files from several different slow, removable disks onto your fast hard disk, or vice versa.

Solution 2

Well, I believe you could use gnu parallel to accomplish your task.

 seq 70 | parallel -j70 cp filename

You could see a detailed explanation on using gnu parallel from my other answer here.

I just tested the above command in my system and I could see that 70 copies of files are being made.

Share:
47,362

Related videos on Youtube

leeand00
Author by

leeand00

Projects jobdb - Creator of Open Source Job Search Document Creator/Tracker http://i9.photobucket.com/albums/a58/Maskkkk/c64nMe.jpg Received my first computer (see above) at the age of 3, wrote my first program at the age of 7. Been hooked on programming ever since.

Updated on September 18, 2022

Comments

  • leeand00
    leeand00 almost 2 years

    Is there a multi-threaded cp command on Linux?

    I know how to do this on Windows, but I don't know how this is approached in a Linux environment.

    • Celada
      Celada over 9 years
      Sinec cp is IO-bound, I'm not sure how much multithreading would help.
    • Jon Bringhurst
      Jon Bringhurst over 9 years
      Do you have a filesystem with multiple read-write heads? If you do, take a look at github.com/hpc/dcp
    • maxschlepzig
      maxschlepzig over 9 years
      I don't see how a question about cp could be a duplicate of a question about dd ...
    • Ciro Santilli Путлер Капут 六四事
      Ciro Santilli Путлер Капут 六四事 almost 9 years
  • leeand00
    leeand00 over 9 years
    I want to be able to specify the number of parallel threads used.
  • Matej Vrzala M4
    Matej Vrzala M4 over 9 years
    You can probably stick this in a function and take in a parameter to accept ints to specify the number of times you want the command run..That'll be a bit of coding on your end though
  • Matej Vrzala M4
    Matej Vrzala M4 over 9 years
    Even easier way than what I got.
  • Thorbjørn Ravn Andersen
    Thorbjørn Ravn Andersen over 9 years
    Disk seeks are only relevant for non-SSD disks.
  • psusi
    psusi over 9 years
    @ThorbjørnRavnAndersen, the severe penalty for seeks on HDD is almost none on SSD, yet the point remains that there is no benefit to trying to read or write to/from multiple parts of a single disk at the same time.
  • Thorbjørn Ravn Andersen
    Thorbjørn Ravn Andersen over 9 years
    @MatteoItalia if the IO-channel is saturated there is nothing caches can improve.
  • Thorbjørn Ravn Andersen
    Thorbjørn Ravn Andersen over 9 years
    @psusi The argument was that the disk wasted time seeking, not that the disk could not serve data any faster.
  • Paul Draper
    Paul Draper over 7 years
    Not all setups are one SSD-disk. For example, right now I am waiting on a 1 hour copy with AWS EFS, which uses multiple disks and has high latency.
  • peterh
    peterh over 7 years
    Not true, the reading from the source and the writing to the destination should be done in parallel, but it isn't so. Yes, the gnu fileutils requires a few tuning. There are other, not so common cases as well, as parallel copy would be profitable, for example on network drives or on raid/lvm.
  • psusi
    psusi over 7 years
    @peterh, it isn't good on raid either for the same reason it isn't on a single drive: you're just making multiple drives seek their heads back and forth. Network drives are not going to benefit either unless your drive and network connection are both faster than at least one of the server drives and this isn't likely to be the case.
  • peterh
    peterh over 7 years
    @psusi There are many raid personalities, for example in linearly ordered raid devices it is not an issue, furthermore it is also not an issue if the raid has a bigger block structure as the blocks of the worker threads of the cp tools. Furthermore, the actual disk access block order is controlled not by the cp, but by the disk layer (ok, our most pathologic ext4 writes out the write cache in every 5 seconds by default setting...). In the case of the reading is there the trick of the readahead, although it is far not so effective.
  • peterh
    peterh over 7 years
    @psusi But, the focus of my comment was this: the current copy tool 1. reads a block from the input, 2. THEN writes this block to the output. While it reads, the write operation stalls, and while it writes, the read operation stalls. At least its reading and writing should be done on two different threads, this is my point.
  • psusi
    psusi over 7 years
    @peterh, cp can do one byte at a time and it doesn't matter ( much ). Read ahead and write behind make sure that the individual read and write calls do not stall, at least until there is plenty of data in the write behind cache to keep the disk(s) busy, at which point the kernel starts letting the write calls block for a bit to avoid filling all of ram with dirty pages.
  • John
    John over 5 years
    on amazon you can get up to 10 times the sequential read speed when using multi threaded access to the same seqeuential file. So the answer isn'T really accurate anymore
  • Paul Knopf
    Paul Knopf over 5 years
    I develop recording software that runs on embedded Linux devices. I need to archive my internal media to external thumb drives. I used to use .NET, and having read/write threads increased performance by around %40. I'd like a similar approach, but with cp/native.
  • joker
    joker over 4 years
    Running a command in the background has nothing to do, at all, with multithreading.
  • Szczepan Hołyszewski
    Szczepan Hołyszewski over 2 years
    And that copies filename 70 times?
  • Jason Newton
    Jason Newton over 2 years
    This answer is pretty misinformative; there are many cases where parallel copy have different aggregate performance / timings - often drastically so (easily 10x throughput) - nas/raids are very common, as are pcie based memory devices are just a few environments I've observed this. Sometimes for one reason or another this is true of any tech with sockets in the loop as well. It also does not suggest an original solution to the problem; just says try what others suggest even though it'd be counterproductive.
  • psusi
    psusi over 2 years
    Normal raids won't benefit from it either. I suppose if you are using JBOD/linear mode and get lucky and happen to have some of the files on one underlying disk, and some files on another, then you might see some improvement, but that's unusual and unlikely.