Pattern Matching Exclude Duplicate Characters

5,471

Solution 1

With regular expressions in the mathematical sense, it's possible, but the size of the regular expressions grows exponentially relative to the size of the alphabet, so it isn't practical.

There's a simple way with negation and backreferences.

grep '[spine]' | grep -Ev '([spine]).*\1'

The first grep selects lines that contain at least one of einps; the second grep rejects lines that contain more than one of any (e.g. allowing spinal tap and spend but not foobar or see).

Solution 2

Inspired by your expression, I can come up with a shorter one, using egrep:

egrep -v '(s.*s|p.*p|i.*i|n.*n|e.*e)' FILE

which is equivalent to

sed /s.*s/d;/p.*p/d;/i.*i/d;/n.*n/d;/e.*e/d; FILE

And this is how to produce the sed-command from the input automatically:

#!/bin/bash
word=$1
file=$2
expr=$(for c in $(echo $word | sed 's/./& /g'); do echo -n "/"$c".*"$c"/d;"; done);
sed $expr $file 

I tried a similar approach with grep, but couldn't convince the shell to take the grep-pattern from a variable, but if I echoed it out, and inserted the result with cut and paste, the command worked:

expr="'("$(for c in $(echo $wort | sed 's/./& /g'); do echo -n $c".*"$c"|"; done)

egrep -v ${expr/%|/)\'} FILE
# doesn't work, filters nothing, whole file is printed
# check:    
echo egrep -v $(echo $exp) FILE 
egrep -v '(s.*s|p.*p|i.*i|n.*n|e.*e)' FILE
# manually: 
egrep -v '(s.*s|p.*p|i.*i|n.*n|e.*e)' FILE
spine
spin
pine

Maybe I made an error, maybe I make a mistake with variable expansion.

Solution 3

Here's a non-regex way of doing it without knowing ahead of time what the string is. Not saying this is the most efficient but it was fast enough for my needs.

$ (echo a;echo abc;echo aabc;echo def;echo two words;echo one pair) | awk '
>   {
>     split($0,a,"");
>     n=asort(a);
>     for(i=1;i<=n;i++){
>       if(a[i]==a[i+1]){
>         next
>       }
>     }
>   }
>   n'
a
abc
def
one pair

What this does is splits each line $0 into an array a and then sorts that array in place, returning the length n of the array. Then it iterates through the array and exiting to the next word if two adjacent characters in the sorted array are the same. If it makes it all the way through the word it prints the (whole) input line. Note that a line of three words or more will always fail to print due to the spaces being repeated.

Example - find all five letter words without a repeated character:

$ grep '^.....$' /usr/share/dict/words | tr '[A-Z]' '[a-z]' | awk '{split($1,a,"");n=asort(a);for(i=1;i<=n;i++){if(a[i]==a[i+1]){next}}}n' | head -5
abhor
abide
abies
abilo
abler
Share:
5,471

Related videos on Youtube

Steven
Author by

Steven

...

Updated on September 18, 2022

Comments

  • Steven
    Steven over 1 year

    Is there a regular expression for the following that matches characters in a character set but only once? In other words, once a character is found, remove it from the set.

    If grep cannot do this, is there a built-in utility which can?

    Example:

    Characters to match only once:   spine
    

    Input:

    spine
    spines
    spin
    pine
    seep 
    spins
    

    Output:

    spine
    spin
    pine
    

    EDIT:
    There are many ways to achieve this output (one example below), but I'm looking for a way to do this without having to customize the command for each pattern I want to match.

    grep '[spine]' input_file | grep -v 's.*s' | ... | grep -v 'e.*e'

    • text
      text almost 13 years
      Question: What is the application for this?
  • Steven
    Steven almost 13 years
    See my edited post for desired output. Also, I'm looking for a solution which doesn't require a complex, tedious, pattern-specific command.
  • user unknown
    user unknown almost 13 years
    Yes, I see. Maybe I find a way to produce the sed-command from the word 'spine'.
  • user unknown
    user unknown almost 13 years
    Finally found out how to solve it with sed - is that acceptable?
  • D Mac
    D Mac over 2 years
    Nice, general solution. It needs gawk for the asort function, btw. Regular awk (at least on Monterey MacOS) doesn't have asort.