Split a text into sentences
Since you want to "split" sentences why are you trying to match them ?
For this case let's use preg_split().
Code:
$str = 'Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End';
$sentences = preg_split('/(?<=[.?!])\s+(?=[a-z])/i', $str);
print_r($sentences);
Output:
Array
(
[0] => Fry me a Beaver.
[1] => Fry me a Beaver!
[2] => Fry me a Beaver?
[3] => Fry me Beaver no. 4?!
[4] => Fry me many Beavers...
[5] => End
)
Explanation:
Well to put it simply we are spliting by grouped space(s) \s+ and doing two things:
(?<=[.?!]) Positive look behind assertion, basically we search if there is a point or question mark or exclamation mark behind the space.
(?=[a-z]) Positive look ahead assertion, searching if there is a letter after the space, this is kind of a workaround for the
no. 4
problem.
Comments
-
thelolcat almost 2 years
How can I split a text into an array of sentences?
Example text:
Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End
Should output:
0 => Fry me a Beaver. 1 => Fry me a Beaver! 2 => Fry me a Beaver? 3 => Fry me Beaver no. 4?! 4 => Fry me many Beavers... 5 => End
I tried some solutions that I've found on SO through search, but they all fail, especially at the 4th sentence.
/(?<=[!?.])./ /\.|\?|!/ /((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])/ /(?<=[.!?]|[.!?][\'"])\s+/ // <- closest one
-
thelolcat almost 11 yearsjust a question: shouldn't
\s
be\s+
? I mean to ignore multiple spaces grouped toghether -
HamZa almost 11 years@thelolcat Well you're right in case there is multiple spaces !
-
voidMainReturn almost 11 years@HamZa : what would it translate to be, in java? I tried the same thing in java but it doesn't work. Can you guide me ?
-
HamZa almost 11 years@tejas I would guess you need to use double backslash instead of one
(?<=[.?!])\\s+(?=[a-z])
-
voidMainReturn almost 11 yearsyes this is what I am using : str.split("(?<=[.?!])\\s+(?=[a-z])"); But of no use.
-
HamZa almost 11 years@tejas What do you mean "not correctly" ? Do you mind to join me in the regex chatroom ?
-
voidMainReturn almost 11 yearsactually, the following one worked : str.split("(?<=[.?!])\\s+(?=[a-zA-Z])")
-
HamZa almost 11 years@tejas You see that little
i
after/
? That means match case insensitive. I think you could use my expression and add(?i)
to the beginning of it :) -
voidMainReturn almost 11 yearsyeah ok. I didn't know it's used as ?i in java. I tried using it as /i and it didn't work
-
HamZa almost 11 years@tejas not only in Java, it's possible in much more languages.
-
Cosmologist over 7 yearsThank you! Add it to my helper library - github.com/Cosmologist/Gears/blob/master/src/Gears/StringType/…
-
Ryan about 7 yearsI love this and would love it even more if I could disqualify
...
from counting as the end of a sentence and include.)
as the end of a sentence. Ideas? Thanks. -
HamZa about 7 years@Ryan quick
(?<!\.\.\.)(?<=[.?!]|\.\))\s+(?=[a-z])
. See if it suits your needs. -
Ryan about 7 yearsBased on what I learned from yours, I was able to edit it to handle even more corner cases that I'm running into: regex101.com/r/e4NYyd/4 Cool stuff.
-
Richard almost 6 yearsThis doesn't work. Try adding "i.e. " to the sentence, this regex fails at this
-
holden321 over 5 yearsAlso it doesn't work for sentences with Mr. and !" at the end of sentence.
-
Markus AO over 3 years"2020 is the year the system failed." Sentences may start with a digit... which makes avoiding "See (A. 1) for reference." more complex.