python - regex search and findall
Solution 1
Ok, I see what's going on... from the docs:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
As it turns out, you do have a group, "(\d+,?)"... so, what it's returning is the last occurrence of this group, or 000.
One solution is to surround the entire regex by a group, like this
regex = re.compile('((\d+,?)+)')
then, it will return [('9,000,000', '000')], which is a tuple containing both matched groups. of course, you only care about the first one.
Personally, i would use the following regex
regex = re.compile('((\d+,)*\d+)')
to avoid matching stuff like " this is a bad number 9,123,"
Edit.
Here's a way to avoid having to surround the expression by parenthesis or deal with tuples
s = "..."
regex = re.compile('(\d+,?)+')
it = re.finditer(regex, s)
for match in it:
print match.group(0)
finditer returns an iterator that you can use to access all the matches found. these match objects are the same that re.search returns, so group(0) returns the result you expect.
Solution 2
@aleph_null's answer correctly explains what's causing your problem, but I think I have a better solution. Use this regex:
regex = re.compile(r'\d+(?:,\d+)*')
Some reasons why it's better:
(?:...)
is a non-capturing group, so you only get the one result for each match.\d+(?:,\d+)*
is a better regex, more efficient and less likely to return false positives.You should always use Python's raw strings for regexes if possible; you're less likely to be surprised by regex escape sequences (like
\b
for word boundary) being interpreted as string-literal escape sequences (like\b
for backspace).
armandino
I maintain a couple of open source projects: Instancio: a library with a JUnit 5 extension for automating data setup in unit tests. TxtStyle: a command line tool for colorizing output of text console programs using regular expressions. I'm always open to new project opportunities: https://www.linkedin.com/in/armandino https://www.thinkthru.ca
Updated on April 15, 2020Comments
-
armandino about 4 years
I need to find all matches in a string for a given regex. I've been using
findall()
to do that until I came across a case where it wasn't doing what I expected. For example:regex = re.compile('(\d+,?)+') s = 'There are 9,000,000 bicycles in Beijing.' print re.search(regex, s).group(0) > 9,000,000 print re.findall(regex, s) > ['000']
In this case
search()
returns what I need (the longest match) butfindall()
behaves differently, although the docs imply it should be the same:findall()
matches all occurrences of a pattern, not just the first one assearch()
does.Why is the behaviour different?
How can I achieve the result of
search()
withfindall()
(or something else)?
-
armandino over 12 yearsThanks for the explanation. It turns out
finditer
was actually better suited to what I was doing as you suggested. The regex comes from user input so I don't have control over it. -
armandino over 12 yearsThanks Alan! I should have mentioned before but I don't have control over the regex as it's user input..
-
Alan Moore over 12 yearsNo problem! But, for the record, letting users input regexes to be executed by your app is a bad idea. When their badly-written (or just hurriedly-typed) regexes fail to match, or crash the system, they're going to blame you for it. ;)