extract word with regular expression

18,009

Solution 1

Your whole match evaluates to '1/temperatoA,2/CelcieusB' because that matches the following expression:

qr{ (       # begin group 
      \d+   # at least one digit
      /     # followed by a slash
     (\w+)  # followed by at least one word characters
     ,?     # maybe a comma
    )*      # ANY number of repetitions of this pattern.
}x;

'1/temperatoA,' fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in '2/CelcieusB' (the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that '2/CelcieusB' replaces '1/temperatoA,' as $1, so $1 reads '2/CelcieusB'.

Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the global flag and assign the captures into an array. Since an array is not a single scalar like $1, it can hold all the values that were captured for capture #1.

When I do this:

my $str   = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my @matches = $str =~ /$regex/g ) { 
    print Dumper( \@matches );
}

I get this:

$VAR1 = [
          '1/temperatoA',
          'temperatoA',
          '2/CelcieusB',
          'CelcieusB',
          '23/33',
          '33',
          '55/66',
          '66'
        ];

Now, I figure that's probably not what you expected. But '3' and '6' are word characters, and so--coming after a slash--they comply with the expression.

So, if this is an issue, you can change your regex to the equivalent: qr{(\d+/(\p{Alpha}\w*))}, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:

$VAR1 = [
          '1/temperatoA',
          'temperatoA',
          '2/CelcieusB',
          'CelcieusB'
        ];

And if you only want 'temperatoA' or 'CelcieusB', then you're capturing more than you need to and you'll want your regex to be qr{\d+/(\p{Alpha}\w*)}.

However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.

Solution 2

The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?

The expression you want is simply as follows:

(\w+)

Solution 3

With a Perl-compatible regex engine you can search for

(?<=\d/)\w+(?=.*!)

(?<=\d/) asserts that there is a digit and a slash before the start of the match

\w+ matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use [A-Za-z]+ instead.

(?=.*!) asserts that there is a ! ahead in the string - i. e. the regex will fail once we have passed the !.

Depending on the language you're using, you might need to escape some of the characters in the regex.

E. g., for use in C (with the PCRE library), you need to escape the backslashes:

myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);
Share:
18,009
farka
Author by

farka

Updated on June 14, 2022

Comments

  • farka
    farka almost 2 years

    I have a string 1/temperatoA,2/CelcieusB!23/33/44,55/66/77 and I would like to extract the words temperatoA and CelcieusB.

    I have this regular expression (\d+/(\w+),?)*! but I only get the match 1/temperatoA,2/CelcieusB!

    Why?