Split sentence into words but having trouble with the punctuations in C#

35,207

Solution 1

A regex solution.

(\b[^\s]+\b)

And if you really want to fix that last . on i.e. you could use this.

((\b[^\s]+\b)((?<=\.\w).)?)

Here's the code I'm using.

  var input = "The moon is our natural satellite, i.e. it rotates around the Earth!";
  var matches = Regex.Matches(input, @"((\b[^\s]+\b)((?<=\.\w).)?)");

  foreach(var match in matches)
  {
     Console.WriteLine(match);
  }

Results:

The
moon
is
our
natural
satellite
i.e.
it
rotates
around
the
Earth

Solution 2

I suspect the solution you're looking for is much more complex than you think. You're looking for some form of actual language analysis, or at a minimum a dictionary, so that you can determine whether a period is part of a word or ends a sentence. Have you considered the fact that it may do both?

Consider adding a dictionary of allowed "words that contain punctuation." This may be the simplest way to solve your problem.

Solution 3

Regex.Matches(input, @"\b\w+\b").OfType<Match>().Select(m => m.Value)

Solution 4

This works for me.

var str="The moon is our natural satellite, i.e. it rotates around the Earth!";
var a = str.Split(new char[] {' ', '\t'});
for (int i=0; i < a.Length; i++)
{
    Console.WriteLine(" -{0}", a[i]);
}

Results:

 -The
 -moon
 -is
 -our
 -natural
 -satellite,
 -i.e.
 -it
 -rotates
 -around
 -the
 -Earth!

you could do some post-processing of the results, removing commas and semicolons, etc.

Share:
35,207
Richard N
Author by

Richard N

I am a developer with over 6 years of software development experience across various platforms. Currently focusing on cloud technologies such as Salesforce and Workday. Have worked on the Force.com platform for about 3 years now. Love to work on APEX classes/triggers and Visualforce + SOQL. I am very interested in learning new technologies. I have recently started taking an active interest in learning and developing mashups using different technologies such as Google APIs, REST and SOAP API's along with Salesforce I currently hold the following Salesforce certifications Advanced Developer (passed the written exam. Waiting for the programming assignment) Advanced Administrator Developer Administrator Sales Cloud Consultant Service Cloud Consultant I have started to write about Salesforce in general at http://www.decodingthecloud.com/

Updated on July 09, 2022

Comments

  • Richard N
    Richard N almost 2 years

    I have seen a few similar questions but I am trying to achieve this.

    Given a string, str="The moon is our natural satellite, i.e. it rotates around the Earth!" I want to extract the words and store them in an array. The expected array elements would be this.

    the 
    moon 
    is 
    our 
    natural 
    satellite 
    i.e. 
    it  
    rotates 
    around 
    the 
    earth
    

    I tried using String.split( ','\t','\r') but this does not work correctly. I also tried removing the ., and other punctuation marks but I would want a string like "i.e." to be parsed out too. What is the best way to achieve this? I also tried using regex.split to no avail.

    string[] words = Regex.Split(line, @"\W+");
    

    Would surely appreciate some nudges in the right direction.

  • Jim Mischel
    Jim Mischel over 12 years
    But isn't that going to include punctuation as part of the word? So in the example above the last word would be "Earth!" ...
  • TheCodeKing
    TheCodeKing over 12 years
    No it won't match the punctuation in earth. \b matches on word boundaries.
  • TheCodeKing
    TheCodeKing over 12 years
    Regex does this with \b so you don't have to, admittedly there are some grey areas. For instance i.e. with match as i.e.
  • Richard N
    Richard N over 12 years
    @Thecodeking, What about matching "i.e."? or something like "u.n.i.c.e.f"?
  • TheCodeKing
    TheCodeKing over 12 years
    Comes out as u.n.i.c.e.f or i.e :)
  • Jim Mischel
    Jim Mischel over 12 years
    How about "can't"? Won't that come out as "can" and "t"?
  • Richard N
    Richard N over 12 years
    And are you using this with Regex.Split?
  • Richard N
    Richard N over 12 years
    Would this be the best solution. Would post processing be considered inefficient for cases like these?
  • Richard N
    Richard N over 12 years
    Thanks for the update. Ill mark this as the answer. I will use my existing string split method but will keep this as an option. Clearly, I need to learn more about reg exps.
  • Lincoln Bergeson
    Lincoln Bergeson almost 7 years
    Six years later it's still beautiful :)