Pattern Matching Exclude Duplicate Characters
Solution 1
With regular expressions in the mathematical sense, it's possible, but the size of the regular expressions grows exponentially relative to the size of the alphabet, so it isn't practical.
There's a simple way with negation and backreferences.
grep '[spine]' | grep -Ev '([spine]).*\1'
The first grep
selects lines that contain at least one of einps
; the second grep
rejects lines that contain more than one of any (e.g. allowing spinal tap
and spend
but not foobar
or see
).
Solution 2
Inspired by your expression, I can come up with a shorter one, using egrep:
egrep -v '(s.*s|p.*p|i.*i|n.*n|e.*e)' FILE
which is equivalent to
sed /s.*s/d;/p.*p/d;/i.*i/d;/n.*n/d;/e.*e/d; FILE
And this is how to produce the sed-command from the input automatically:
#!/bin/bash
word=$1
file=$2
expr=$(for c in $(echo $word | sed 's/./& /g'); do echo -n "/"$c".*"$c"/d;"; done);
sed $expr $file
I tried a similar approach with grep, but couldn't convince the shell to take the grep-pattern from a variable, but if I echoed it out, and inserted the result with cut and paste, the command worked:
expr="'("$(for c in $(echo $wort | sed 's/./& /g'); do echo -n $c".*"$c"|"; done)
egrep -v ${expr/%|/)\'} FILE
# doesn't work, filters nothing, whole file is printed
# check:
echo egrep -v $(echo $exp) FILE
egrep -v '(s.*s|p.*p|i.*i|n.*n|e.*e)' FILE
# manually:
egrep -v '(s.*s|p.*p|i.*i|n.*n|e.*e)' FILE
spine
spin
pine
Maybe I made an error, maybe I make a mistake with variable expansion.
Solution 3
Here's a non-regex way of doing it without knowing ahead of time what the string is. Not saying this is the most efficient but it was fast enough for my needs.
$ (echo a;echo abc;echo aabc;echo def;echo two words;echo one pair) | awk '
> {
> split($0,a,"");
> n=asort(a);
> for(i=1;i<=n;i++){
> if(a[i]==a[i+1]){
> next
> }
> }
> }
> n'
a
abc
def
one pair
What this does is splits each line $0
into an array a
and then sorts that array in place, returning the length n
of the array. Then it iterates through the array and exiting to the next word if two adjacent characters in the sorted array are the same. If it makes it all the way through the word it prints the (whole) input line. Note that a line of three words or more will always fail to print due to the spaces being repeated.
Example - find all five letter words without a repeated character:
$ grep '^.....$' /usr/share/dict/words | tr '[A-Z]' '[a-z]' | awk '{split($1,a,"");n=asort(a);for(i=1;i<=n;i++){if(a[i]==a[i+1]){next}}}n' | head -5
abhor
abide
abies
abilo
abler
Related videos on Youtube
Comments
-
Steven over 1 year
Is there a regular expression for the following that matches characters in a character set but only once? In other words, once a character is found, remove it from the set.
If grep cannot do this, is there a built-in utility which can?
Example:
Characters to match only once: spine
Input:
spine spines spin pine seep spins
Output:
spine spin pine
EDIT:
There are many ways to achieve this output (one example below), but I'm looking for a way to do this without having to customize the command for each pattern I want to match.grep '[spine]' input_file | grep -v 's.*s' | ... | grep -v 'e.*e'
-
text almost 13 yearsQuestion: What is the application for this?
-
-
Steven almost 13 yearsSee my edited post for desired output. Also, I'm looking for a solution which doesn't require a complex, tedious, pattern-specific command.
-
user unknown almost 13 yearsYes, I see. Maybe I find a way to produce the sed-command from the word 'spine'.
-
user unknown almost 13 yearsFinally found out how to solve it with sed - is that acceptable?
-
D Mac over 2 yearsNice, general solution. It needs gawk for the asort function, btw. Regular awk (at least on Monterey MacOS) doesn't have asort.