Pattern Matching Exclude Duplicate Characters

grep regular-expression patterns

5,471

Solution 1

With regular expressions in the mathematical sense, it's possible, but the size of the regular expressions grows exponentially relative to the size of the alphabet, so it isn't practical.

There's a simple way with negation and backreferences.

grep '[spine]' | grep -Ev '([spine]).*\1'

The first grep selects lines that contain at least one of einps; the second grep rejects lines that contain more than one of any (e.g. allowing spinal tap and spend but not foobar or see).

Solution 2

Inspired by your expression, I can come up with a shorter one, using egrep:

egrep -v '(s.*s|p.*p|i.*i|n.*n|e.*e)' FILE

which is equivalent to

sed /s.*s/d;/p.*p/d;/i.*i/d;/n.*n/d;/e.*e/d; FILE

And this is how to produce the sed-command from the input automatically:

#!/bin/bash
word=$1
file=$2
expr=$(for c in $(echo $word | sed 's/./& /g'); do echo -n "/"$c".*"$c"/d;"; done);
sed $expr $file

I tried a similar approach with grep, but couldn't convince the shell to take the grep-pattern from a variable, but if I echoed it out, and inserted the result with cut and paste, the command worked:

expr="'("$(for c in $(echo $wort | sed 's/./& /g'); do echo -n $c".*"$c"|"; done)

egrep -v ${expr/%|/)\'} FILE
# doesn't work, filters nothing, whole file is printed
# check:    
echo egrep -v $(echo $exp) FILE 
egrep -v '(s.*s|p.*p|i.*i|n.*n|e.*e)' FILE
# manually: 
egrep -v '(s.*s|p.*p|i.*i|n.*n|e.*e)' FILE
spine
spin
pine

Maybe I made an error, maybe I make a mistake with variable expansion.

Solution 3

Here's a non-regex way of doing it without knowing ahead of time what the string is. Not saying this is the most efficient but it was fast enough for my needs.

$ (echo a;echo abc;echo aabc;echo def;echo two words;echo one pair) | awk '
>   {
>     split($0,a,"");
>     n=asort(a);
>     for(i=1;i<=n;i++){
>       if(a[i]==a[i+1]){
>         next
>       }
>     }
>   }
>   n'
a
abc
def
one pair

What this does is splits each line $0 into an array a and then sorts that array in place, returning the length n of the array. Then it iterates through the array and exiting to the next word if two adjacent characters in the sorted array are the same. If it makes it all the way through the word it prints the (whole) input line. Note that a line of three words or more will always fail to print due to the spaces being repeated.

Example - find all five letter words without a repeated character:

$ grep '^.....$' /usr/share/dict/words | tr '[A-Z]' '[a-z]' | awk '{split($1,a,"");n=asort(a);for(i=1;i<=n;i++){if(a[i]==a[i+1]){next}}}n' | head -5
abhor
abide
abies
abilo
abler

5,471

Steven

...

Updated on September 18, 2022

Comments

Steven over 1 year
Is there a regular expression for the following that matches characters in a character set but only once? In other words, once a character is found, remove it from the set.

If grep cannot do this, is there a built-in utility which can?

Example:
```
Characters to match only once:   spine
```
Input:
```
spine
spines
spin
pine
seep 
spins
```
Output:
```
spine
spin
pine
```
EDIT:
There are many ways to achieve this output (one example below), but I'm looking for a way to do this without having to customize the command for each pattern I want to match.

grep '[spine]' input_file | grep -v 's.*s' | ... | grep -v 'e.*e'
- text almost 13 years
  
  Question: What is the application for this?
Steven almost 13 years

See my edited post for desired output. Also, I'm looking for a solution which doesn't require a complex, tedious, pattern-specific command.
user unknown almost 13 years

Yes, I see. Maybe I find a way to produce the sed-command from the word 'spine'.
user unknown almost 13 years

Finally found out how to solve it with sed - is that acceptable?
D Mac over 2 years

Nice, general solution. It needs gawk for the asort function, btw. Regular awk (at least on Monterey MacOS) doesn't have asort.