Why does re.findall return a list of tuples when my pattern only contains one group?

python regex findall

21,557

Solution 1

You pattern has two groups, the bigger group:

(1([a-z]+)2|[a-z])

and the second smaller group which is a subset of your first group:

([a-z]+)

Here is a solution that gives you the expected result although mind you, it is really ugly and there is probably a better way. I just can't figure it out:

import re
s = 'ab1cd2efg1hij2k'
a = re.findall( r'((?:1)([a-z]+)(?:2)|([a-z]))', s )
a = [tuple(j for j in i if j)[-1] for i in a]

>>> print a
['a', 'b', 'cd', 'e', 'f', 'g', 'hij', 'k']

Solution 2

I am 5 years too late to the party, but I think I might have found an elegant solution to the re.findall() ugly tuple-ridden output with multiple capture groups.

In general, if you end up with an output which looks something like that:

[('pattern_1', '', ''), ('', 'pattern_2', ''), ('pattern_1', '', ''), ('', '', 'pattern_3')]

Then you can bring it into a flat list with this little trick:

["".join(x) for x in re.findall(all_patterns, iterable)]

The expected output will be like so:

['pattern_1', 'pattern_2', 'pattern_1', 'pattern_3']

It was tested on Python 3.7. Hope it helps!

Solution 3

Your regular expression has 2 groups, just look at the number of parenthesis you are using :). One group would be ([a-z]+) and the other one (1([a-z]+)2|[a-z]). The key is that you can have groups inside other groups. So, if possible, you should build a regular expression with only one group, so that you don't have to post-process the result.

An example of regular expression with only one group would be:

>>> import re
>>> s = 'ab1cd2efg1hij2k'
>>> re.findall('((?<=1)[a-z]+(?=2)|[a-z])', s)
['a', 'b', 'cd', 'e', 'f', 'g', 'hij', 'k']

Solution 4

Look at this answer for similar question: https://bugs.python.org/issue6663 Just drop the parenthesis if you are using findall:

import re
s = 'ab1cd2efg1hij2k'
re.findall( r'(?<=1)[a-z]+(?=2)|[a-z]', s )

Solution 5

If you want to have an 'or' match without having the split into match groups just add a '?:' to the beginning of the 'or' match.

Without '?:'

re.findall('(test (word1|word2))', 'test word1')

Output:
[('test word1', 'word1')]

With '?:'

re.findall('(test (?:word1|word2))', 'test word1')

Output:
['test word1']

Further explanation: https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/

View more solutions

21,557

Author by

usual me

Updated on December 30, 2020

Comments

usual me over 3 years
Say I have a string s containing letters and two delimiters 1 and 2. I want to split the string in the following way:
- if a substring t falls between 1 and 2, return t
- otherwise, return each character
So if s = 'ab1cd2efg1hij2k', the expected output is ['a', 'b', 'cd', 'e', 'f', 'g', 'hij', 'k'].

I tried to use regular expressions:
```
import re
s = 'ab1cd2efg1hij2k'
re.findall( r'(1([a-z]+)2|[a-z])', s )

[('a', ''),
 ('b', ''),
 ('1cd2', 'cd'),
 ('e', ''),
 ('f', ''),
 ('g', ''),
 ('1hij2', 'hij'),
 ('k', '')]
```
From there i can do [ x[x[-1]!=''] for x in re.findall( r'(1([a-z]+)2|[a-z])', s ) ] to get my answer, but I still don't understand the output. The documentation says that findall returns a list of tuples if the pattern has more than one group. However, my pattern only contains one group. Any explanation is welcome.
Blckknght almost 10 years

Your pattern is pretty odd. You don't need the non-capturing groups around 1 and 2, or the group around the whole pattern (which you expend a bunch of effort to skip in the output). Instead, just accept that the findall call will return 2-tuples and turn them into single values with a = [x or y for x, y in a].
A. Rabus over 3 years

"non-capturing group" is the keyword here... (added solely for the search engine)
grantr about 2 years

saved me-thanks