Split sentence into words but having trouble with the punctuations in C#
Solution 1
A regex solution.
(\b[^\s]+\b)
And if you really want to fix that last .
on i.e.
you could use this.
((\b[^\s]+\b)((?<=\.\w).)?)
Here's the code I'm using.
var input = "The moon is our natural satellite, i.e. it rotates around the Earth!";
var matches = Regex.Matches(input, @"((\b[^\s]+\b)((?<=\.\w).)?)");
foreach(var match in matches)
{
Console.WriteLine(match);
}
Results:
The moon is our natural satellite i.e. it rotates around the Earth
Solution 2
I suspect the solution you're looking for is much more complex than you think. You're looking for some form of actual language analysis, or at a minimum a dictionary, so that you can determine whether a period is part of a word or ends a sentence. Have you considered the fact that it may do both?
Consider adding a dictionary of allowed "words that contain punctuation." This may be the simplest way to solve your problem.
Solution 3
Regex.Matches(input, @"\b\w+\b").OfType<Match>().Select(m => m.Value)
Solution 4
This works for me.
var str="The moon is our natural satellite, i.e. it rotates around the Earth!";
var a = str.Split(new char[] {' ', '\t'});
for (int i=0; i < a.Length; i++)
{
Console.WriteLine(" -{0}", a[i]);
}
Results:
-The
-moon
-is
-our
-natural
-satellite,
-i.e.
-it
-rotates
-around
-the
-Earth!
you could do some post-processing of the results, removing commas and semicolons, etc.
Richard N
I am a developer with over 6 years of software development experience across various platforms. Currently focusing on cloud technologies such as Salesforce and Workday. Have worked on the Force.com platform for about 3 years now. Love to work on APEX classes/triggers and Visualforce + SOQL. I am very interested in learning new technologies. I have recently started taking an active interest in learning and developing mashups using different technologies such as Google APIs, REST and SOAP API's along with Salesforce I currently hold the following Salesforce certifications Advanced Developer (passed the written exam. Waiting for the programming assignment) Advanced Administrator Developer Administrator Sales Cloud Consultant Service Cloud Consultant I have started to write about Salesforce in general at http://www.decodingthecloud.com/
Updated on July 09, 2022Comments
-
Richard N almost 2 years
I have seen a few similar questions but I am trying to achieve this.
Given a string, str="The moon is our natural satellite, i.e. it rotates around the Earth!" I want to extract the words and store them in an array. The expected array elements would be this.
the moon is our natural satellite i.e. it rotates around the earth
I tried using String.split( ','\t','\r') but this does not work correctly. I also tried removing the ., and other punctuation marks but I would want a string like "i.e." to be parsed out too. What is the best way to achieve this? I also tried using regex.split to no avail.
string[] words = Regex.Split(line, @"\W+");
Would surely appreciate some nudges in the right direction.
-
Jim Mischel over 12 yearsBut isn't that going to include punctuation as part of the word? So in the example above the last word would be "Earth!" ...
-
TheCodeKing over 12 yearsNo it won't match the punctuation in earth.
\b
matches on word boundaries. -
TheCodeKing over 12 yearsRegex does this with
\b
so you don't have to, admittedly there are some grey areas. For instancei.e.
with match asi.e
. -
Richard N over 12 years@Thecodeking, What about matching "i.e."? or something like "u.n.i.c.e.f"?
-
TheCodeKing over 12 yearsComes out as
u.n.i.c.e.f
ori.e
:) -
Jim Mischel over 12 yearsHow about "can't"? Won't that come out as "can" and "t"?
-
Richard N over 12 yearsAnd are you using this with Regex.Split?
-
Richard N over 12 yearsWould this be the best solution. Would post processing be considered inefficient for cases like these?
-
Richard N over 12 yearsThanks for the update. Ill mark this as the answer. I will use my existing string split method but will keep this as an option. Clearly, I need to learn more about reg exps.
-
Lincoln Bergeson almost 7 yearsSix years later it's still beautiful :)