Converting a sentence string to a string array of words in Java

314,992

Solution 1

String.split() will do most of what you want. You may then need to loop over the words to pull out any punctuation.

For example:

String s = "This is a sample sentence.";
String[] words = s.split("\\s+");
for (int i = 0; i < words.length; i++) {
    // You may want to check for a non-word character before blindly
    // performing a replacement
    // It may also be necessary to adjust the character class
    words[i] = words[i].replaceAll("[^\\w]", "");
}

Solution 2

Now, this can be accomplished just with split as it takes regex:

String s = "This is a sample sentence with []s.";
String[] words = s.split("\\W+");

this will give words as: {"this","is","a","sample","sentence", "s"}

The \\W+ will match all non-alphabetic characters occurring one or more times. So there is no need to replace. You can check other patterns also.

Solution 3

You can use BreakIterator.getWordInstance to find all words in a string.

public static List<String> getWords(String text) {
    List<String> words = new ArrayList<String>();
    BreakIterator breakIterator = BreakIterator.getWordInstance();
    breakIterator.setText(text);
    int lastIndex = breakIterator.first();
    while (BreakIterator.DONE != lastIndex) {
        int firstIndex = lastIndex;
        lastIndex = breakIterator.next();
        if (lastIndex != BreakIterator.DONE && Character.isLetterOrDigit(text.charAt(firstIndex))) {
            words.add(text.substring(firstIndex, lastIndex));
        }
    }

    return words;
}

Test:

public static void main(String[] args) {
    System.out.println(getWords("A PT CR M0RT BOUSG SABN NTE TR/GB/(G) = RAND(MIN(XXX, YY + ABC))"));
}

Ouput:

[A, PT, CR, M0RT, BOUSG, SABN, NTE, TR, GB, G, RAND, MIN, XXX, YY, ABC]

Solution 4

You can also use BreakIterator.getWordInstance.

Solution 5

You can just split your string like that using this regular expression

String l = "sofia, malgré tout aimait : la laitue et le choux !" <br/>
l.split("[[ ]*|[,]*|[\\.]*|[:]*|[/]*|[!]*|[?]*|[+]*]+");
Share:
314,992
AnimatedRNG
Author by

AnimatedRNG

I'm an Arch Linux user. Currently working on a lot of random projects, see my Github for details. Proficient in C, C++, Python, and Java. My New Years resolution is to learn Rust and Scala.

Updated on July 08, 2022

Comments

  • AnimatedRNG
    AnimatedRNG almost 2 years

    I need my Java program to take a string like:

    "This is a sample sentence."
    

    and turn it into a string array like:

    {"this","is","a","sample","sentence"}
    

    No periods, or punctuation (preferably). By the way, the string input is always one sentence.

    Is there an easy way to do this that I'm not seeing? Or do we really have to search for spaces a lot and create new strings from the areas between the spaces (which are words)?