nltk regular expression tokenizer

python regex pattern-matching nltk

10,129

You should turn all capturing groups to non-capturing:

([A-Z]\.)+ > (?:[A-Z]\.)+
\w+(-\w+)* -> \w+(?:-\w+)*
\$?\d+(\.\d+)?%? to \$?\d+(?:\.\d+)?%?

The issue is that regexp_tokenize seems to be using re.findall that returns capture tuple lists when multiple capture groups are defined in the pattern. See this nltk.tokenize package reference:

pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)

Also, I am not sure you wanted to use :-_ that matches a range including all uppercase letters, put the - to the end of the character class.

Thus, use

pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

10,129

Author by

Juan Menashsheh

Updated on June 19, 2022

Comments

Juan Menashsheh almost 2 years

I tried to implement a regular expression tokenizer with nltk in python, but the result is this:

>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]

But the wanted result is this:

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

Why? Where is the mistake?

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Writing a tokenizer in Python

Python: How to get multiple elements inside square brackets

Python regex string matching?

How to combine multiple regex into single one in python?

RegExp match repeated characters

how to match whitespace and alphanumeric characters in python

Python regex to match IP-address with /CIDR

In python, how do i extract a sublist from a list of strings by matching a string pattern in the original list

Attribute error while using scikit-learn

Python: NLTK and TextBlob in french

nltk regular expression tokenizer

Juan Menashsheh

Comments

Recents

Related