tokenize a string keeping delimiters in Python
18,773
Solution 1
How about
import re
splitter = re.compile(r'(\s+|\S+)')
splitter.findall(s)
Solution 2
>>> re.compile(r'(\s+)').split("\tthis is an example")
['', '\t', 'this', ' ', 'is', ' ', 'an', ' ', 'example']
Solution 3
the re
module provides this functionality:
>>> import re
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
(quoted from the Python documentation).
For your example (split on whitespace), use re.split('(\s+)', '\tThis is an example')
.
The key is to enclose the regex on which to split in capturing parentheses. That way, the delimiters are added to the list of results.
Edit: As pointed out, any preceding/trailing delimiters will of course also be added to the list. To avoid that you can use the .strip()
method on your input string first.
Solution 4
Have you looked at pyparsing? Example borrowed from the pyparsing wiki:
>>> from pyparsing import Word, alphas
>>> greet = Word(alphas) + "," + Word(alphas) + "!"
>>> hello1 = 'Hello, World!'
>>> hello2 = 'Greetings, Earthlings!'
>>> for hello in hello1, hello2:
... print (u'%s \u2192 %r' % (hello, greet.parseString(hello))).encode('utf-8')
...
Hello, World! → (['Hello', ',', 'World', '!'], {})
Greetings, Earthlings! → (['Greetings', ',', 'Earthlings', '!'], {})
Related videos on Youtube
Comments
-
fortran over 3 years
Is there any equivalent to
str.split
in Python that also returns the delimiters?I need to preserve the whitespace layout for my output after processing some of the tokens.
Example:
>>> s="\tthis is an example" >>> print s.split() ['this', 'is', 'an', 'example'] >>> print what_I_want(s) ['\t', 'this', ' ', 'is', ' ', 'an', ' ', 'example']
Thanks!
-
Dominic Rodger over 14 years+1 - Interesting question,
splitlines
seems to have akeepends
parameter, but no such thing forsplit
. Seems odd (docs.python.org/library/stdtypes.html#str.splitlines).
-
-
Admin over 14 yearselegant and easily expandable (think
(\s+|\w+|\S+)
). -
Admin over 14 yearsnot using the OP's string masks the fact that the empty string is included as the first element of the returned list.
-
Tim Pietzcker over 14 yearsThanks. I edited my post accordingly (although in this case, the OP's spec ("want to preserve whitespace") and his example were contradictory).
-
ghostdog74 over 14 yearsno need regex or creating your own wheels if you have python 2.5 onwards.. see my answer.
-
fortran over 14 yearsNo, it wasn't... there was one example of the current behaviour, and another of the desired one.