How do I remove a words ending in the letter S if duplicates without an S on the end are on the same list?

shell-script text-processing awk sed perl

5,849

Solution 1

Using awk and reading the file twice. Save all variables in array with s on the end. Check the array on each line in the second run through and print if the line is not in array.

awk 'FNR==NR{a[$0 "s"]++;next}!($0 in a)' file.txt file.txt

To use a little less memory you could also do

awk 'FNR==NR{!/s$/ && a[$0 "s"]++;next}!($0 in a)' file.txt file.txt

Solution 2

You can do this in several ways, e.g., the simplest way would be to sort the data and compare adjacent lines:

sort foo |awk '{ if ( plural[$1] == "" ) print; plural[$1 "s"] = 1; }'

Given input

frog
dogs
cats
catfish
cat
dog
frogs

output

cat
catfish
dog
frog

Without sorting:

#!/bin/sh
awk 'BEGIN { count=0; }
{
        words[count++] = $1;
        plurals[$1 "s"] = $1;
}
END {
        for ( n = 0; n < count; ++n) {
                if ( plurals[words[n]] == "")
                        print words[n];
        }
}
' <foo

Output:

frog
catfish
cat
dog

Solution 3

Using a bash script:

#!/bin/bash

readarray -t mylist

# compare each item on the list with a new list created by appending `s'
# to each item of the original list

for i in "${mylist[@]}"; do
  for j in "${mylist[@]/%/s}"; do
    [[ "$i" == "$j" ]] && continue 2
  done
  echo "$i"
done

The list is read from stdin. Here is a test run:

$ cat file1
frog
dogs
cats
cat
dog
frogs
catfish
$ ./remove-s.sh < file1 
frog
cat
dog
catfish

Solution 4

This is a simplified solution using awk, which does not preserve the order of words:

    {
        len = length($1);
        prefix = $1;
        if (substr($1, len) == "s") {
            prefix = substr($1, 1, len - 1);
        }
        if (prefix in data) {
            next;
        } else {
            print prefix;
            data[prefix] = 1;
        }
    }

If it is essential to preserve the order of words, then you will have to keep all lines in memory and process the list after the entire file has been read.

{
    line[FNR] = $0;
    len = length($1);
    if (substr($1, len) == "s") {
        prefix = substr($1, 1, len - 1);
        if (prefix in data) {
            line[FNR] = "";
            next;
        } else {
            data[prefix] = FNR;
        }
    } else {
        num = data[$1];
        if (num) {
            line[num] = "";
        } else {
            data[$1] = FNR;
        }
    }
}

END {
    for (i = 1; i <= FNR; i++) {
        if (line[i]) {
            print line[i];
        }
    }
}

Solution 5

With excessive use of grep's -f (obtain patterns from file) option:

grep 's$' input       | # output: all lines ending with s 
  sed -e 's/s$//'     | # those same entries, minus the s
  grep -F -x -f input | # the entries whose plurals appear
  sed -e 's/$/s/'     | # the plurals to remove
  grep -F -x -v -f - input

View more solutions

5,849

J363

Hello, I have been using linux and BSD at home almost exclusively for more than a decade. I also work with unix like operating systems professionally at work. I enjoy working physical projects, like home improvement, working on cars, and generally tinkering with things. When i was a kid I use to take things apart and put them back together again. I enjoy solving puzzles and learning new information.

Updated on September 18, 2022

Comments

J363 over 1 year
I have a large list of words. Many of the words are only different because they have the letter s on the end. If a word on the list is the exact same as another word on the list, except one of the words ends with the letter s, I would like to remove the duplicate word that ends in s. I would also like to accomplish this without having to sort the list so that I can maintain the current position of the words.

example input:
```
frog
dogs
cats
cat
dog
frogs
catfish
octopus
```
example output:
```
frog
cat
dog
catfish
octopus
```
- 123 almost 8 years
  
  Do you want to keep lines that are plural but don't have a singular counterpart ?
- J363 almost 8 years
  
  I'd like to preserve the current sort order and eliminate the words ending in s. I'd like to keep the singular form. If there is a word ending in s I would like to check to see if there is the same word exists elsewhere on the list. If the word does exist, I'd like to eliminate the word ending in s.
- Chris H almost 8 years
  
  @123 how would you handle an octopus?
- Angel Todorov almost 8 years
  
  @J363, you didn't answer 123's question: add "horses" to this list without "horse" -- should you output "horses"?
- Angel Todorov almost 8 years
  
  According to your sample data and criteria, sed '/s$/d' would work.
- J363 almost 8 years
  
  @glennjackman in the event that horse is not on the list, horses should remain, for the exact same situation Chris H is postulating.
123 almost 8 years

Might want to mention the language used.
123 almost 8 years

What about if another animal has the same start of the name? E.g cat and catfish.
Michael Vehrs almost 8 years

@123 Silly me...
J363 almost 8 years

When I pass these commands I seem to be getting no output? I'm not sure if I'm doing something wrong on my end.
don_crissti almost 8 years

You don't need the first grep and you can avoid the second sed via paste (which is faster than any regex engine on the planet) e.g. sed -n 's/s$//p' infile | grep -Fxf infile | paste -d s - /dev/null | grep -vFxf- infile or, alternatively, sed '/s$/!s/$/s/' infile | sort | uniq -d | grep -vFxf- infile
Wildcard almost 8 years

@J363, what if you use file.txt file.txt in place of test{,}? (By the way, 123, I like brace expansion a lot but I usually omit it in answers in favor of better clarity and readability. Just a suggestion.)
123 almost 8 years

@Wildcard I have edited, thanks for the tip.
J363 almost 8 years

@123 This works wonderfully, thank you again. Is there any way you can tell me how to get the inverse output so I can see what your tool is going to remove before I go ahead and finalize the change by writing to a new file? I've been sorting and using comm to compare files but doing so is extremely inefficient.