tokenize a string keeping delimiters in Python

18,773

Solution 1

How about

import re
splitter = re.compile(r'(\s+|\S+)')
splitter.findall(s)

Solution 2

>>> re.compile(r'(\s+)').split("\tthis is an  example")
['', '\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

Solution 3

the re module provides this functionality:

>>> import re
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']

(quoted from the Python documentation).

For your example (split on whitespace), use re.split('(\s+)', '\tThis is an example').

The key is to enclose the regex on which to split in capturing parentheses. That way, the delimiters are added to the list of results.

Edit: As pointed out, any preceding/trailing delimiters will of course also be added to the list. To avoid that you can use the .strip() method on your input string first.

Solution 4

Have you looked at pyparsing? Example borrowed from the pyparsing wiki:

>>> from pyparsing import Word, alphas
>>> greet = Word(alphas) + "," + Word(alphas) + "!"
>>> hello1 = 'Hello, World!'
>>> hello2 = 'Greetings, Earthlings!'
>>> for hello in hello1, hello2:
...     print (u'%s \u2192 %r' % (hello, greet.parseString(hello))).encode('utf-8')
... 
Hello, World! → (['Hello', ',', 'World', '!'], {})
Greetings, Earthlings! → (['Greetings', ',', 'Earthlings', '!'], {})
Share:
18,773

Related videos on Youtube

fortran
Author by

fortran

general purpose developer

Updated on December 01, 2020

Comments

  • fortran
    fortran over 3 years

    Is there any equivalent to str.split in Python that also returns the delimiters?

    I need to preserve the whitespace layout for my output after processing some of the tokens.

    Example:

    >>> s="\tthis is an  example"
    >>> print s.split()
    ['this', 'is', 'an', 'example']
    
    >>> print what_I_want(s)
    ['\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']
    

    Thanks!

  • Admin
    Admin over 14 years
    elegant and easily expandable (think (\s+|\w+|\S+)).
  • Admin
    Admin over 14 years
    not using the OP's string masks the fact that the empty string is included as the first element of the returned list.
  • Tim Pietzcker
    Tim Pietzcker over 14 years
    Thanks. I edited my post accordingly (although in this case, the OP's spec ("want to preserve whitespace") and his example were contradictory).
  • ghostdog74
    ghostdog74 over 14 years
    no need regex or creating your own wheels if you have python 2.5 onwards.. see my answer.
  • fortran
    fortran over 14 years
    No, it wasn't... there was one example of the current behaviour, and another of the desired one.

Related