nltk regular expression tokenizer
10,129
You should turn all capturing groups to non-capturing:
-
([A-Z]\.)+
>(?:[A-Z]\.)+
-
\w+(-\w+)*
->\w+(?:-\w+)*
-
\$?\d+(\.\d+)?%?
to\$?\d+(?:\.\d+)?%?
The issue is that regexp_tokenize
seems to be using re.findall
that returns capture tuple lists when multiple capture groups are defined in the pattern. See this nltk.tokenize package reference:
pattern (str)
– The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)
Also, I am not sure you wanted to use :-_
that matches a range including all uppercase letters, put the -
to the end of the character class.
Thus, use
pattern = r'''(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(?:-\w+)* # words with optional internal hyphens
| \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():_`-] # these are separate tokens; includes ], [
'''
Author by
Juan Menashsheh
Updated on June 19, 2022Comments
-
Juan Menashsheh almost 2 years
I tried to implement a regular expression tokenizer with nltk in python, but the result is this:
>>> import nltk >>> text = 'That U.S.A. poster-print costs $12.40...' >>> pattern = r'''(?x) # set flag to allow verbose regexps ... ([A-Z]\.)+ # abbreviations, e.g. U.S.A. ... | \w+(-\w+)* # words with optional internal hyphens ... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82% ... | \.\.\. # ellipsis ... | [][.,;"'?():-_`] # these are separate tokens; includes ], [ ... ''' >>> nltk.regexp_tokenize(text, pattern) [('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]
But the wanted result is this:
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
Why? Where is the mistake?