Removing duplicates on a variable without sorting

12,018

Solution 1

new_variable=$( awk 'BEGIN{RS=ORS=" "}!a[$0]++' <<<$variable );

Here's how it works:

RS (Input Record Separator) is set to a white space so that it treats each fruit in $variable as a record instead of a field. The non-sorting unique magic happens with !a[$0]++. Since awk supports associative arrays, it uses the current record ($0) as the key to the array a[]. If that key has not been seen before, a[$0] evaluates to '0' (awk's default value for unset indices) which is then negated to return TRUE. I then exploit the fact that awk will default to 'print $0' if an expression returns TRUE and no '{ commands }' are given. Finally, a[$0] is then incremented such that this key can no longer return TRUE and thus repeat values are never printed. ORS (Output Record Separator) is set to a space as well to mimic the input format.

A less terse version of this command which produces the same output would be the following:

awk 'BEGIN{RS=ORS=" "}{ if (a[$0] == 0){ a[$0] += 1; print $0}}'

Gotta love awk =)

EDIT

If you needed to do this in pure Bash 2.1+, I would suggest this:

#!/bin/bash    

variable="apple lemon papaya avocado lemon grapes papaya apple avocado mango banana"
temp="$variable"

new_variable="${temp%% *}"

while [[ "$temp" != ${new_variable##* } ]]; do
   temp=${temp//${temp%% *} /}
   new_variable="$new_variable ${temp%% *}"
done

echo $new_variable;

Solution 2

This pipeline version works by preserving the original order:

variable=$(echo "$variable" | tr ' ' '\n' | nl | sort -u -k2 | sort -n | cut -f2-)

Solution 3

Pure Bash:

variable="apple lemon papaya avocado lemon grapes papaya apple avocado mango banana"

declare new_value=''

for item in $variable; do
  if [[ ! $new_value =~ $item ]] ; then   # first time?
    new_value="$new_value $item"
  fi
done
new_value=${new_value:1}                  # remove leading blank

Solution 4

In pure, portable sh:

words="apple lemon papaya avocado lemon grapes papaya apple avocado mango banana"
seen=
for word in $words; do
  case $seen in
    $word\ * | *\ $word | *\ $word\ * | $word) 
      # already seen
      ;;
    *)
      seen="$seen $word"
      ;;
  esac
done
echo $seen

Solution 5

shell

declare -a arr
variable="apple lemon papaya avocado lemon grapes papaya apple avocado mango banana"
set -- $variable
count=0
for c in $@
do
    flag=0
    for((i=0;i<=${#arr[@]}-1;i++))
    do
        if [ "${arr[$i]}" == "$c" ] ;then
            flag=1
            break
        fi
    done
    if  [ "$flag" -eq 0 ] ; then
        arr[$count]="$c"
        count=$((count+1))
    fi
done
for((i=0;i<=${#arr[@]}-1;i++))
do
   echo "result: ${arr[$i]}"
done

Result when run:

linux# ./myscript.sh
result: apple
result: lemon
result: papaya
result: avocado
result: grapes
result: mango
result: banana

OR if you want to use gawk

awk 'BEGIN{RS=ORS=" "} (!($0 in a) ){a[$0];print}'
Share:
12,018
user224178
Author by

user224178

Updated on June 13, 2022

Comments

  • user224178
    user224178 almost 2 years

    I have a variable that contains the following space separated entries.

    variable="apple lemon papaya avocado lemon grapes papaya apple avocado mango banana"
    

    How do I remove the duplicates without sorting?

    #Something like this.
    new_variable="apple lemon papaya avocado grapes mango banana"
    

    I have found somewhere a script that accomplish removing the duplicates of a variable, but does sort the contents.

    #Not something like this.
    new_variable=$(echo "$variable"|tr " " "\n"|sort|uniq|tr "\n" " ")
    echo $new_variable
    apple avocado banana grapes lemon mango papaya
    
  • jhwist
    jhwist over 14 years
    Sweet :) Thanks for the explanation.
  • Mark Edgar
    Mark Edgar over 14 years
    Simply testing for membership is better than counting: awk 'BEGIN{RS=ORS=" "} { if (!($0 in a)) { a[$0]; print } }' Or more tersely: awk 'BEGIN{RS=ORS=" "} !($0 in a || a[$0])'
  • SiegeX
    SiegeX over 14 years
    Good solution, but note that this locks you into Bash 3.X due to the '=~' operator.
  • SiegeX
    SiegeX over 14 years
    @Mark: Doing a 'time' over a loop of 10,000 iterations shows that yours is just over 3% slower. Not very much but nonetheless, not better. This difference will only become larger as the number of elements grows since your version takes O(n) time while mine is always a constant O(1).
  • MiloDC
    MiloDC over 5 years
    This is the only solution here that worked for me. The awk solution still had duplicates. Thanks.
  • Gregg
    Gregg over 3 years
    Really nice solution, thanks. Except if duplicates are found at the end, one after the other then it doesn't work. Ex: variable="apple lemon papaya papaya" prints: apple lemon papaya papaya. Whereas if I have: variable="apple lemon papaya papaya mango" then it removes the duplicate papaya and prints: apple lemon papaya mango. Thoughts?
  • Gregg
    Gregg over 3 years
    Found the following solution which helped with the problem outlined in my previous comment: stackoverflow.com/questions/46185241/… Thank you for sharing your solution.