How do I remove a words ending in the letter S if duplicates without an S on the end are on the same list?
Solution 1
Using awk and reading the file twice. Save all variables in array with s on the end. Check the array on each line in the second run through and print if the line is not in array.
awk 'FNR==NR{a[$0 "s"]++;next}!($0 in a)' file.txt file.txt
To use a little less memory you could also do
awk 'FNR==NR{!/s$/ && a[$0 "s"]++;next}!($0 in a)' file.txt file.txt
Solution 2
You can do this in several ways, e.g., the simplest way would be to sort the data and compare adjacent lines:
sort foo |awk '{ if ( plural[$1] == "" ) print; plural[$1 "s"] = 1; }'
Given input
frog
dogs
cats
catfish
cat
dog
frogs
output
cat
catfish
dog
frog
Without sorting:
#!/bin/sh
awk 'BEGIN { count=0; }
{
words[count++] = $1;
plurals[$1 "s"] = $1;
}
END {
for ( n = 0; n < count; ++n) {
if ( plurals[words[n]] == "")
print words[n];
}
}
' <foo
Output:
frog
catfish
cat
dog
Solution 3
Using a bash script:
#!/bin/bash
readarray -t mylist
# compare each item on the list with a new list created by appending `s'
# to each item of the original list
for i in "${mylist[@]}"; do
for j in "${mylist[@]/%/s}"; do
[[ "$i" == "$j" ]] && continue 2
done
echo "$i"
done
The list is read from stdin. Here is a test run:
$ cat file1
frog
dogs
cats
cat
dog
frogs
catfish
$ ./remove-s.sh < file1
frog
cat
dog
catfish
Solution 4
This is a simplified solution using awk
, which does not preserve the order of words:
{
len = length($1);
prefix = $1;
if (substr($1, len) == "s") {
prefix = substr($1, 1, len - 1);
}
if (prefix in data) {
next;
} else {
print prefix;
data[prefix] = 1;
}
}
If it is essential to preserve the order of words, then you will have to keep all lines in memory and process the list after the entire file has been read.
{
line[FNR] = $0;
len = length($1);
if (substr($1, len) == "s") {
prefix = substr($1, 1, len - 1);
if (prefix in data) {
line[FNR] = "";
next;
} else {
data[prefix] = FNR;
}
} else {
num = data[$1];
if (num) {
line[num] = "";
} else {
data[$1] = FNR;
}
}
}
END {
for (i = 1; i <= FNR; i++) {
if (line[i]) {
print line[i];
}
}
}
Solution 5
With excessive use of grep's -f
(obtain patterns from file) option:
grep 's$' input | # output: all lines ending with s
sed -e 's/s$//' | # those same entries, minus the s
grep -F -x -f input | # the entries whose plurals appear
sed -e 's/$/s/' | # the plurals to remove
grep -F -x -v -f - input
Related videos on Youtube
J363
Hello, I have been using linux and BSD at home almost exclusively for more than a decade. I also work with unix like operating systems professionally at work. I enjoy working physical projects, like home improvement, working on cars, and generally tinkering with things. When i was a kid I use to take things apart and put them back together again. I enjoy solving puzzles and learning new information.
Updated on September 18, 2022Comments
-
J363 over 1 year
I have a large list of words. Many of the words are only different because they have the letter s on the end. If a word on the list is the exact same as another word on the list, except one of the words ends with the letter s, I would like to remove the duplicate word that ends in s. I would also like to accomplish this without having to sort the list so that I can maintain the current position of the words.
example input:
frog dogs cats cat dog frogs catfish octopus
example output:
frog cat dog catfish octopus
-
123 almost 8 yearsDo you want to keep lines that are plural but don't have a singular counterpart ?
-
J363 almost 8 yearsI'd like to preserve the current sort order and eliminate the words ending in s. I'd like to keep the singular form. If there is a word ending in s I would like to check to see if there is the same word exists elsewhere on the list. If the word does exist, I'd like to eliminate the word ending in s.
-
Chris H almost 8 years@123 how would you handle an octopus?
-
Angel Todorov almost 8 years@J363, you didn't answer 123's question: add "horses" to this list without "horse" -- should you output "horses"?
-
Angel Todorov almost 8 yearsAccording to your sample data and criteria,
sed '/s$/d'
would work. -
J363 almost 8 years@glennjackman in the event that horse is not on the list, horses should remain, for the exact same situation Chris H is postulating.
-
-
123 almost 8 yearsMight want to mention the language used.
-
123 almost 8 yearsWhat about if another animal has the same start of the name? E.g
cat
andcatfish
. -
Michael Vehrs almost 8 years@123 Silly me...
-
J363 almost 8 yearsWhen I pass these commands I seem to be getting no output? I'm not sure if I'm doing something wrong on my end.
-
don_crissti almost 8 yearsYou don't need the first
grep
and you can avoid the secondsed
viapaste
(which is faster than any regex engine on the planet) e.g.sed -n 's/s$//p' infile | grep -Fxf infile | paste -d s - /dev/null | grep -vFxf- infile
or, alternatively,sed '/s$/!s/$/s/' infile | sort | uniq -d | grep -vFxf- infile
-
Wildcard almost 8 years@J363, what if you use
file.txt file.txt
in place oftest{,}
? (By the way, 123, I like brace expansion a lot but I usually omit it in answers in favor of better clarity and readability. Just a suggestion.) -
123 almost 8 years@Wildcard I have edited, thanks for the tip.
-
J363 almost 8 years@123 This works wonderfully, thank you again. Is there any way you can tell me how to get the inverse output so I can see what your tool is going to remove before I go ahead and finalize the change by writing to a new file? I've been sorting and using comm to compare files but doing so is extremely inefficient.