Intersection of two arrays in BASH
Solution 1
comm(1)
is a tool that compares two lists and can give you the intersection or difference between two lists. The lists need to be sorted, but that's easy to achieve.
To get your arrays into a sorted list suitable for comm
:
$ printf '%s\n' "${A[@]}" | LC_ALL=C sort
That will turn array A into a sorted list. Do the same for B.
To use comm
to return the intersection:
$ comm -1 -2 file1 file2
-1 -2
says to remove entries unique to file1 (A) and unique to file2 (B) - the intersection of the two.
To have it return what is in file2 (B) but not file1 (A):
$ comm -1 -3 file1 file2
-1 -3
says to remove entries unique to file1 and common to both - leaving only those unique to file2.
To feed two pipelines into comm
, use the "Process Substitution" feature of bash
:
$ comm -1 -2 <(pipeline1) <(pipeline2)
To capture this in an array:
$ C=($(command))
Putting it all together:
# 1. Intersection
$ C=($(comm -12 <(printf '%s\n' "${A[@]}" | LC_ALL=C sort) <(printf '%s\n' "${B[@]}" | LC_ALL=C sort)))
# 2. B - A
$ D=($(comm -13 <(printf '%s\n' "${A[@]}" | LC_ALL=C sort) <(printf '%s\n' "${B[@]}" | LC_ALL=C sort)))
Solution 2
There is rather elegant and efficient approach to do that, using uniq
— but, we will need to eliminate duplicates from each array, leaving only unique items. If you want to save duplicates, there is only one way "by looping through both arrays and comparing".
Consider we have two arrays:
A=(vol-175a3b54 vol-382c477b vol-8c027acf vol-93d6fed0 vol-71600106 vol-79f7970e vol-e3d6a894 vol-d9d6a8ae vol-8dbbc2fa vol-98c2bbef vol-ae7ed9e3 vol-5540e618 vol-9e3bbed3 vol-993bbed4 vol-a83bbee5 vol-ff52deb2)
B=(vol-175a3b54 vol-e38d0c94 vol-2a19386a vol-b846c5cf vol-98c2bbef vol-7320102b vol-8f6226cc vol-27991850 vol-71600106 vol-615e1222)
First of all, lets transform these arrays into sets. We will do it because there is mathematical operation intersection which is known like intersection of sets, and set is a collection of distinct objects, distinct or unique. To be honest, I don't know what is "intersection" if we speak about lists or sequences. Though we can pick out a subsequence from sequence, but this operation (selection) has slightly different meaning.
So, lets transform!
$ A=(echo ${A[@]} | sed 's/ /\n/g' | sort | uniq)
$ B=(echo ${B[@]} | sed 's/ /\n/g' | sort | uniq)
Intersection:
$ echo ${A[@]} ${B[@]} | sed 's/ /\n/g' | sort | uniq -d
If you want to store the elements in another array:
$ intersection_set=$(echo ${A[@]} ${B[@]} | sed 's/ /\n/g' | sort | uniq -d) $ echo $intersection_set vol-175a3b54 vol-71600106 vol-98c2bbef
uniq -d
means show only duplicates (I think,uniq
is rather fast because of its realisation: I guess that it is done withXOR
operation).Get the list of elements that appear in
B
and are not available inA
, i.e.B\A
$ echo ${A[@]} ${B[@]} | sed 's/ /\n/g' | sort | uniq -d | xargs echo ${B[@]} | sed 's/ /\n/g' | sort | uniq -u
Or, with saving in a variable:
$ subtraction_set=$(echo ${A[@]} ${B[@]} | sed 's/ /\n/g' | sort | uniq -d | xargs echo ${B[@]} | sed 's/ /\n/g' | sort | uniq -u) $ echo $subtraction_set vol-27991850 vol-2a19386a vol-615e1222 vol-7320102b vol-8f6226cc vol-b846c5cf vol-e38d0c94
Thus, at first we have got intersection of
A
andB
(which is simply the set of duplicates between them), say it isA/\B
, and then we used operation of inverting intersection ofB
andA/\B
(which is simply only unique elements), so we getB\A = ! (B /\ (A/\B))
.
P.S. uniq
was written by Richard M. Stallman and David MacKenzie.
Solution 3
You can get all elements that are in both A and B by looping through both arrays and comparing:
A=(vol-175a3b54 vol-382c477b vol-8c027acf vol-93d6fed0 vol-71600106 vol-79f7970e vol-e3d6a894 vol-d9d6a8ae vol-8dbbc2fa vol-98c2bbef vol-ae7ed9e3 vol-5540e618 vol-9e3bbed3 vol-993bbed4 vol-a83bbee5 vol-ff52deb2)
B=(vol-175a3b54 vol-e38d0c94 vol-2a19386a vol-b846c5cf vol-98c2bbef vol-7320102b vol-8f6226cc vol-27991850 vol-71600106 vol-615e1222)
intersections=()
for item1 in "${A[@]}"; do
for item2 in "${B[@]}"; do
if [[ $item1 == "$item2" ]]; then
intersections+=( "$item1" )
break
fi
done
done
printf '%s\n' "${intersections[@]}"
You can get all elements in B but not in A in a similar manner:
A=(vol-175a3b54 vol-382c477b vol-8c027acf vol-93d6fed0 vol-71600106 vol-79f7970e vol-e3d6a894 vol-d9d6a8ae vol-8dbbc2fa vol-98c2bbef vol-ae7ed9e3 vol-5540e618 vol-9e3bbed3 vol-993bbed4 vol-a83bbee5 vol-ff52deb2)
B=(vol-175a3b54 vol-e38d0c94 vol-2a19386a vol-b846c5cf vol-98c2bbef vol-7320102b vol-8f6226cc vol-27991850 vol-71600106 vol-615e1222)
not_in_a=()
for item1 in "${B[@]}"; do
for item2 in "${A[@]}"; do
[[ $item1 == "$item2" ]] && continue 2
done
# If we reached here, nothing matched.
not_in_a+=( "$item1" )
done
printf '%s\n' "${not_in_a[@]}"
Solution 4
Ignoring efficiency, here is an approach:
declare -a intersect
declare -a b_only
for bvol in "${B[@]}"
do
in_both=""
for avol in "${A[@]}"
do
[ "$bvol" = "$avol" ] && in_both=Yes
done
if [ "$in_both" ]
then
intersect+=("$bvol")
else
b_only+=("$bvol")
fi
done
echo "intersection=${intersect[*]}"
echo "In B only=${b_only[@]}"
Related videos on Youtube
Bogdan
Linux Passionate for over a decade. Started my Linux distro (NimbleX) in 2005 and never looked back since.
Updated on September 18, 2022Comments
-
Bogdan over 1 year
I have two arrays like this:
A=(vol-175a3b54 vol-382c477b vol-8c027acf vol-93d6fed0 vol-71600106 vol-79f7970e vol-e3d6a894 vol-d9d6a8ae vol-8dbbc2fa vol-98c2bbef vol-ae7ed9e3 vol-5540e618 vol-9e3bbed3 vol-993bbed4 vol-a83bbee5 vol-ff52deb2) B=(vol-175a3b54 vol-e38d0c94 vol-2a19386a vol-b846c5cf vol-98c2bbef vol-7320102b vol-8f6226cc vol-27991850 vol-71600106 vol-615e1222)
The arrays are not sorted and might possibly even contain duplicated elements.
I would like to make the intersection of these two arrays and store the elements in another array. How would I do that?
Also, how would I get the list of elements that appear in B and are not available in A?
-
F. Hauri over 10 yearsOf course, if
Duplicate
lines are useless, they could simply be dropped. -
clerksx over 10 yearsThis will only work if your values don't contain
\n
. -
Gilles 'SO- stop being evil' over 10 yearsExercise: if you interchange
A
andB
, isintersections
always the same up to reordering? -
camh over 10 years@ChrisDown: That's right. I always try to write shell scripts that are properly quoted and handle all chars, but I've given up on \n. I have NEVER seen it in a filename, and a large bunch of unix tools work with \n delimited records that you lose a lot if you try to handle \n as a valid char.
-
clerksx over 10 yearsI've seen it in filenames when using GUI file managers that do not properly sanitise input filenames that are copied from somewhere else (also, nobody said anything about filenames).
-
clerksx over 10 years@Gilles If the arrays may contain duplicate elements, no.
-
Jason R. Mick over 8 yearsTo protect
\n
try this:arr1=( one two three "four five\nsix\nseven" ); arr2=( ${arr1[@]:1} "four five\\nsix" ); n1=${#arr1[@]}; n2=${#arr2[@]}; arr=( ${arr1[@]/ /'-_-'} ${arr2[@]/ /'-_-'} ); arr=( $( echo "${arr[@]}"|tr '\t' '-t-'|tr '\n' '-n-'|tr '\r' '-r-' ) ); arr1=( ${arr[@]:0:${n1}} ); arr2=( ${arr[@]:${n1}:${n2}} ); unset arr; printf "%0.s-" {1..10}; printf '\n'; printf '{'; printf " \"%s\" " "${arr1[@]}"; printf '}\n'; printf "%0.s-" {1..10}; printf '\n'; printf '{'; printf " \"%s\" " "${arr2[@]}"; printf '}\n'; printf "%0.s-" {1..10}; printf '\n\n'; unset arr1; unset arr2
-
Sorpigal over 5 yearsOne should not set
LC_ALL=C
. Instead setLC_COLLATE=C
for the same performance gain without other side effects. In order to obtain correct results you will also need to set the same collation forcomm
that was used forsort
, e.g.:unset LC_ALL; LC_COLLATE=C ; comm -12 <(printf '%s\n' "${A[@]}" | sort) <(printf '%s\n' "${B[@]}" | sort)
-
Matt Alexander over 2 yearsIf you have spaces in the names, you'll be in trouble doing
echo ${A[@]} | sed 's/ /\n/g' | sort | uniq
. Better to doIFS=$'\n'; printf %s "${A[@]}" | sort | uniq
. -
Matt Alexander over 2 yearsBy the way, this is a really slick solution, using
uniq -d
.