How to parallelize the scp command?

linux ubuntu bash shell scp

19,732

Solution 1

Parallelizing SCP is counterproductive, unless both sides run on SSD's. The slowest part of SCP is wither the network, in which case parallelizing won't help at all, or disks on either side, which you'll make worse by parallelizing: seek time is going to kill you.

You say machineA is on SSD, so parallelizing per machine should be enough. The simplest way to do that is to wrap the first forloop in a subshell and background it.

( for el in "${PARTITION1[@]}"
do
    scp david@${FILERS_LOCATION[0]}:$dir1/t1_weekly_1680_"$el"_200003_5.data $PRIMARY/. || scp david@${FILERS_LOCATION[1]}:$dir2/t1_weekly_1680_"$el"_200003_5.data $PRIMARY/.
done ) &

Solution 2

You could use GNU Parallel to help you run multiple tasks in parallel.

However, in your situation, it would appear that you're establishing a separate secure connection for each file transfer, which is likely quite inefficient indeed, especially if the other machines are not on a local network.

The best approach would be to use a tool that specifically does batch file transfer — for example, rsync, which can work over plain ssh, too.

If rsync is not available, as an alternative, you could use zip, or even tar and gzip or bzip2, and then scp the resulting archives (then connect with ssh, and do the unpacking).

19,732

arsenal

Updated on September 18, 2022

Comments

arsenal over 1 year

I need to scp the files from machineB and machineC to machineA. I am running my below shell script from machineA. I have setup the ssh keys properly.

If the files are not there in machineB, then it should be there in machineC. I need to move all the PARTITION1 AND PARTITION2 FILES into machineA respective folder as shown below in my shell script -

#!/bin/bash

readonly PRIMARY=/export/home/david/dist/primary
readonly SECONDARY=/export/home/david/dist/secondary
readonly FILERS_LOCATION=(machineB machineC)
readonly MAPPED_LOCATION=/bat/data/snapshot
PARTITION1=(0 3 5 7 9)
PARTITION2=(1 2 4 6 8)

dir1=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[0]} ls -dt1 "$MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
dir2=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[1]} ls -dt1 "$MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)

length1=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[0]} "ls '$dir1' | wc -l")
length2=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[1]} "ls '$dir2' | wc -l")

if [ "$dir1" = "$dir2" ] && [ "$length1" -gt 0 ] && [ "$length2" -gt 0 ]
then
    rm -r $PRIMARY/*
    rm -r $SECONDARY/*
    for el in "${PARTITION1[@]}"
    do
        scp david@${FILERS_LOCATION[0]}:$dir1/t1_weekly_1680_"$el"_200003_5.data $PRIMARY/. || scp david@${FILERS_LOCATION[1]}:$dir2/t1_weekly_1680_"$el"_200003_5.data $PRIMARY/.
    done
    for sl in "${PARTITION2[@]}"
    do    
        scp david@${FILERS_LOCATION[0]}:$dir1/t1_weekly_1680_"$sl"_200003_5.data $SECONDARY/. || scp david@${FILERS_LOCATION[1]}:$dir2/t1_weekly_1680_"$sl"_200003_5.data $SECONDARY/.
    done
fi

Currently, I am having 5 files in PARTITION1 AND PARTITION2, but in general it will have around 420 files, so that means, it will move the files one by one which I think might be pretty slow. Is there any way to speed up the process?

I am running Ubuntu 12.04

Matthew Ife over 10 years

I dont think there is much benefit to you making this concurrent, its only two hosts. If you had two thousand there might be a good case to make for the extra complexity -- else you're falling into a trap of over-engineering this.

arsenal over 10 years

machineA is running on SSD's not sure about machineB and machineC.
arsenal over 10 years

And if you are saying parallelizing won't help at all. The way I am doing currently is the only way which is efficient? Or we can improve it slightly?
Dennis Kaarsemaker over 10 years

Rats, that means I'll have to give you an actual answer :)
arsenal over 10 years

And similarly I can do it for second for loop as well? They both are moving into same machine but in different folders..
Dennis Kaarsemaker over 10 years

sure, but why? there's nothing left to run in the foreground after that :) - btw, you'll also want to add a wait at the end of the script.
arsenal over 10 years

I see what you meant, Then how do I run my other for loop in the background as well?
Ole Tange over 10 years

How to use GNU Parallel: gist.github.com/rcoup/5358786 and gnu.org/software/parallel/man.html#example__parallelizing_rs‌ync
Ole Tange over 10 years

I have had a situation where it was neither the disk bandwidth nor the network bandwidth that limited the performance. It was network latency. In that situation I got a factor of 3 performance boost by using GNU Parallel (see other answer).