Seperate all special characters and words into items in a String List - Regex

1,431

Solution 1

Instead of splitting, you could match either one or more word characters or match any char except a word or whitespace char to get the separate surrounding characters.

[,.?!“”()]|[^,.?!“”()\s]+

Explanation

  • [,.?!“”()] Match any of the listed
  • | Or
  • [^,.?!“”()\s]+ Match the opposite except whitespace chars

Regex demo | Dart demo

Example code

void main() {
    final _regExp = RegExp(r'[,.?!“”()]|[^,.?!“”()\s]+');
    Iterable<String> matches = _regExp.allMatches("testing  “testing”  “one two three”  (hi there.) !word").map((m)=>m[0]);
    print(matches);
}

Output

(testing, “, testing, ”, “, one, two, three, ”, (, hi, there, ., ), !, word)

Solution 2

Matching bits to keep, as Bird #4 has done, seems like the most effective approach. If you are determined to split, however, and your regex engine supports positive lookbehinds and lookaheads, you could split on matches of the following regular expression (some of which are zero-width).

\ +|(?<=[^\w ])(?=\w)|(?<=\w)(?=[^\w ])|(?<=[^\w ])(?=[^\w ])

Demo

At the link I've shown the effect of replacing each match with a comma to make it easier to identify the matches.

The regex engine performs the following operations.

\ +          # match 1+ spaces (escape not necessary) 
|            # or
(?<=[^\w ])  # following must be preceded by a char other
             # than word char or space
(?=\w)       # preceding must be a word char
|            # or 
(?<=\w)      # following must be preceded by a word char
(?=[^\w ])   # preceding must be followed by a char other
             # than word char or space
|            # or
(?<=[^\w ])  # following must be preceded by a char other
             # than word char or space
(?=[^\w ])   # preceding must be followed by a char other
             # than word char or space

All but \ + (I've escaped the space so that it can be seen more easily) are zero-width matches, meaning that the string is split between two successive characters (e.g., between " and a in ..."a...) and no characters are consumed. (?<=...) are positive lookbehinds; (?=...) are positive lookaheads.

Share:
1,431
Yonkee
Author by

Yonkee

Updated on December 19, 2022

Comments

  • Yonkee
    Yonkee 11 months

    I am attempting to split a string into a list of strings, words being seperate, but surrounding charactors eg.. "?()“”!" being seperate also.

    String to seperate is "testing “testing” “one two three” (hi there.) !word"

    Output I would like is

    [",testing,",testing,",",one,two,three,",(,hi,there,.,),!,word]
    

    I been using the following Regex which almost works, but it doesn't seem to pick up the before charactors like (“ etc..

    RegExp regex = RegExp("(?=[,.?!“”()])|\\s+");
    
    
    list = context.split(regex).toList();
    

    Any suggestions or help from Regex masters out there would be greatly appreciated.

  • Yonkee
    Yonkee over 3 years
    Thanks for your answer, without split how would I convert the output into a List<String>?
  • Yonkee
    Yonkee over 3 years
    Sorry, if I don't understand, but I don't want to remove the characters I just want them to be a seperate element in the string List, in the same order as the string. I also, don't want to seperate by things like - and ' as they are part of the word makeup
  • The fourth bird
    The fourth bird over 3 years
    @Yonkee I have updated the answer. I think you could use this function to get all matches api.flutter.dev/flutter/dart-core/RegExp-class.html
  • The fourth bird
    The fourth bird over 3 years
    @Yonkee I have added an example.
  • Yonkee
    Yonkee over 3 years
    Thanks for taking the time to explain it also, i appreciate the help. Stay safe.
  • Cary Swoveland
    Cary Swoveland over 3 years
    Might \w+|[^\s\w] be enough?
  • The fourth bird
    The fourth bird over 3 years
    @CarySwoveland That was my first answer :-) stackoverflow.com/posts/61452940/revisions
  • Yonkee
    Yonkee over 3 years
    I just noticed that there are still spaces attached to the words, is there an option to remove them also? eg.. [this, notthis,this, nothis]
  • The fourth bird
    The fourth bird over 3 years
    Not sure what you mean by [this, notthis,this, nothis] Can yo create a regex101 link with the text for which the matches are not ok?