Split a text into sentences

10,284

Since you want to "split" sentences why are you trying to match them ?

For this case let's use preg_split().

Code:

$str = 'Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End';
$sentences = preg_split('/(?<=[.?!])\s+(?=[a-z])/i', $str);
print_r($sentences);

Output:

Array
(
    [0] => Fry me a Beaver.
    [1] => Fry me a Beaver!
    [2] => Fry me a Beaver?
    [3] => Fry me Beaver no. 4?!
    [4] => Fry me many Beavers...
    [5] => End
)

Explanation:

Well to put it simply we are spliting by grouped space(s) \s+ and doing two things:

  1. (?<=[.?!]) Positive look behind assertion, basically we search if there is a point or question mark or exclamation mark behind the space.

  2. (?=[a-z]) Positive look ahead assertion, searching if there is a letter after the space, this is kind of a workaround for the no. 4 problem.

Share:
10,284
thelolcat
Author by

thelolcat

Updated on July 14, 2022

Comments

  • thelolcat
    thelolcat almost 2 years

    How can I split a text into an array of sentences?

    Example text:

    Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End

    Should output:

    0 => Fry me a Beaver.
    1 => Fry me a Beaver!
    2 => Fry me a Beaver?
    3 => Fry me Beaver no. 4?!
    4 => Fry me many Beavers...
    5 => End
    

    I tried some solutions that I've found on SO through search, but they all fail, especially at the 4th sentence.

    /(?<=[!?.])./
    
    /\.|\?|!/
    
    /((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])/
    
    /(?<=[.!?]|[.!?][\'"])\s+/    // <- closest one
    
  • thelolcat
    thelolcat almost 11 years
    just a question: shouldn't \s be \s+ ? I mean to ignore multiple spaces grouped toghether
  • HamZa
    HamZa almost 11 years
    @thelolcat Well you're right in case there is multiple spaces !
  • voidMainReturn
    voidMainReturn almost 11 years
    @HamZa : what would it translate to be, in java? I tried the same thing in java but it doesn't work. Can you guide me ?
  • HamZa
    HamZa almost 11 years
    @tejas I would guess you need to use double backslash instead of one (?<=[.?!])\\s+(?=[a-z])
  • voidMainReturn
    voidMainReturn almost 11 years
    yes this is what I am using : str.split("(?<=[.?!])\\s+(?=[a-z])"); But of no use.
  • HamZa
    HamZa almost 11 years
    @tejas What do you mean "not correctly" ? Do you mind to join me in the regex chatroom ?
  • voidMainReturn
    voidMainReturn almost 11 years
    actually, the following one worked : str.split("(?<=[.?!])\\s+(?=[a-zA-Z])")
  • HamZa
    HamZa almost 11 years
    @tejas You see that little i after / ? That means match case insensitive. I think you could use my expression and add (?i) to the beginning of it :)
  • voidMainReturn
    voidMainReturn almost 11 years
    yeah ok. I didn't know it's used as ?i in java. I tried using it as /i and it didn't work
  • HamZa
    HamZa almost 11 years
    @tejas not only in Java, it's possible in much more languages.
  • Cosmologist
    Cosmologist over 7 years
  • Ryan
    Ryan about 7 years
    I love this and would love it even more if I could disqualify ... from counting as the end of a sentence and include .) as the end of a sentence. Ideas? Thanks.
  • HamZa
    HamZa about 7 years
    @Ryan quick (?<!\.\.\.)(?<=[.?!]|\.\))\s+(?=[a-z]). See if it suits your needs.
  • Ryan
    Ryan about 7 years
    Based on what I learned from yours, I was able to edit it to handle even more corner cases that I'm running into: regex101.com/r/e4NYyd/4 Cool stuff.
  • Richard
    Richard almost 6 years
    This doesn't work. Try adding "i.e. " to the sentence, this regex fails at this
  • holden321
    holden321 over 5 years
    Also it doesn't work for sentences with Mr. and !" at the end of sentence.
  • Markus AO
    Markus AO over 3 years
    "2020 is the year the system failed." Sentences may start with a digit... which makes avoiding "See (A. 1) for reference." more complex.