nltk regular expression tokenizer

10,129

You should turn all capturing groups to non-capturing:

  • ([A-Z]\.)+ > (?:[A-Z]\.)+
  • \w+(-\w+)* -> \w+(?:-\w+)*
  • \$?\d+(\.\d+)?%? to \$?\d+(?:\.\d+)?%?

The issue is that regexp_tokenize seems to be using re.findall that returns capture tuple lists when multiple capture groups are defined in the pattern. See this nltk.tokenize package reference:

pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)

Also, I am not sure you wanted to use :-_ that matches a range including all uppercase letters, put the - to the end of the character class.

Thus, use

pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''
Share:
10,129
Juan Menashsheh
Author by

Juan Menashsheh

Updated on June 19, 2022

Comments

  • Juan Menashsheh
    Juan Menashsheh almost 2 years

    I tried to implement a regular expression tokenizer with nltk in python, but the result is this:

    >>> import nltk
    >>> text = 'That U.S.A. poster-print costs $12.40...'
    >>> pattern = r'''(?x)    # set flag to allow verbose regexps
    ...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
    ...   | \w+(-\w+)*        # words with optional internal hyphens
    ...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
    ...   | \.\.\.            # ellipsis
    ...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
    ... '''
    >>> nltk.regexp_tokenize(text, pattern)
    [('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]
    

    But the wanted result is this:

    ['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
    

    Why? Where is the mistake?