Java simple sentence parser

13,075

Based on @Jarrod Roberson's answer, I have created a util method that uses BreakIterator and returns the list of sentences.

public static List<String> tokenize(String text, String language, String country){
    List<String> sentences = new ArrayList<String>();
    Locale currentLocale = new Locale(language, country);
    BreakIterator sentenceIterator = BreakIterator.getSentenceInstance(currentLocale);      
    sentenceIterator.setText(text);
    int boundary = sentenceIterator.first();
    int lastBoundary = 0;
    while (boundary != BreakIterator.DONE) {
        boundary = sentenceIterator.next();         
        if(boundary != BreakIterator.DONE){
            sentences.add(text.substring(lastBoundary, boundary));
        }
        lastBoundary = boundary;            
    }
    return sentences;
}
Share:
13,075
Admin
Author by

Admin

Updated on June 27, 2022

Comments

  • Admin
    Admin almost 2 years

    is there any simple way to create sentence parser in plain Java without adding any libs and jars.

    Parser should not just take care about blanks between words, but be more smart and parse: . ! ?, recognize when sentence is ended etc.

    After parsing, only real words could be all stored in db or file, not any special chars.

    thank you very much all in advance :)