How do I split a string into a list?
Solution 1
It just so happens that the tokens you want split are already Python tokens, so you can use the built-in tokenize
module. It's almost a one-liner; this program:
from io import StringIO
from tokenize import generate_tokens
STRING = 1
print(
list(
token[STRING]
for token in generate_tokens(StringIO("2+24*48/32").readline)
if token[STRING]
)
)
produces this output:
['2', '+', '24', '*', '48', '/', '32']
Solution 2
You can use split
from the re
module.
re.split(pattern, string, maxsplit=0, flags=0)
Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
Example code:
import re
data = re.split(r'(\D)', '2+24*48/32')
\D
When the UNICODE flag is not specified, \D matches any non-digit character; this is equivalent to the set [^0-9].
Solution 3
>>> import re
>>> re.findall(r'\d+|\D+', '2+24*48/32=10')
['2', '+', '24', '*', '48', '/', '32', '=', '10']
Matches consecutive digits or consecutive non-digits.
Each match is returned as a new element in the list.
Depending on the usage, you may need to alter the regular expression. Such as if you need to match numbers with a decimal point.
>>> re.findall(r'[0-9\.]+|[^0-9\.]+', '2+24*48/32=10.1')
['2', '+', '24', '*', '48', '/', '32', '=', '10.1']
Solution 4
This looks like a parsing problem, and thus I am compelled to present a solution based on parsing techniques.
While it may seem that you want to 'split' this string, I think what you actually want to do is 'tokenize' it. Tokenization or lexxing is the compilation step before parsing. I have amended my original example in an edit to implement a proper recursive decent parser here. This is the easiest way to implement a parser by hand.
import re
patterns = [
('number', re.compile('\d+')),
('*', re.compile(r'\*')),
('/', re.compile(r'\/')),
('+', re.compile(r'\+')),
('-', re.compile(r'\-')),
]
whitespace = re.compile('\W+')
def tokenize(string):
while string:
# strip off whitespace
m = whitespace.match(string)
if m:
string = string[m.end():]
for tokentype, pattern in patterns:
m = pattern.match(string)
if m:
yield tokentype, m.group(0)
string = string[m.end():]
def parseNumber(tokens):
tokentype, literal = tokens.pop(0)
assert tokentype == 'number'
return int(literal)
def parseMultiplication(tokens):
product = parseNumber(tokens)
while tokens and tokens[0][0] in ('*', '/'):
tokentype, literal = tokens.pop(0)
if tokentype == '*':
product *= parseNumber(tokens)
elif tokentype == '/':
product /= parseNumber(tokens)
else:
raise ValueError("Parse Error, unexpected %s %s" % (tokentype, literal))
return product
def parseAddition(tokens):
total = parseMultiplication(tokens)
while tokens and tokens[0][0] in ('+', '-'):
tokentype, literal = tokens.pop(0)
if tokentype == '+':
total += parseMultiplication(tokens)
elif tokentype == '-':
total -= parseMultiplication(tokens)
else:
raise ValueError("Parse Error, unexpected %s %s" % (tokentype, literal))
return total
def parse(tokens):
tokenlist = list(tokens)
returnvalue = parseAddition(tokenlist)
if tokenlist:
print 'Unconsumed data', tokenlist
return returnvalue
def main():
string = '2+24*48/32'
for tokentype, literal in tokenize(string):
print tokentype, literal
print parse(tokenize(string))
if __name__ == '__main__':
main()
Implementation of handling of brackets is left as an exercise for the reader. This example will correctly do multiplication before addition.
Solution 5
This is a parsing problem, so neither regex not split() are the "good" solution. Use a parser generator instead.
I would look closely at pyparsing. There have also been some decent articles about pyparsing in the Python Magazine.
Admin
Updated on July 17, 2020Comments
-
Admin almost 4 years
If I have this string:
2+24*48/32
what is the most efficient approach for creating this list:
['2', '+', '24', '*', '48', '/', '32']
-
Jerub over 15 yearsWhy don't you just go implement forth, it'll only be 5 more lines!
-
Admin over 15 yearsI'm reading up on tokenizing now to understand it. So I'm not able too say where the problem is though I think it's in the fact that this script will eval * and / at the same time, which is incorrect. 8/2*2 this string should print a result of 2, but it prints a result of 8.
-
Admin over 15 yearsexcuse me im wrong, always took bomdas literally turns out multiplication and division are equal in order of predecnce and whichever is occurs first is evaluated first
-
Kiv over 15 yearsGreat answer, I didn't realize this module existed :)
-
P Daddy over 13 yearsForgive me if I'm wrong, but wouldn't it be preferable to use
result = eval(expression)
? -
Diamond Python over 13 yearsIndeed it would; my apologies.
-
roskakori about 12 yearsInstead or manually assigning
STRING=1
you could use the constant from thetoken
module by doing afrom token import STRING
. This is particular useful if you need several token constants. -
Victor S almost 12 yearswhy would such a complicated answer be rated so high? It's a pretty simple question. Whatever happened to finding the cleanest, most concise answer?
-
Air over 10 yearsIn
tokenize
: Why usere
to remove whitespace over a built-in string function?