REGEX in R: extracting words from a string
Solution 1
You've already accepted an answer, but I'm going to share this as a means of helping you understand a little more about regex in R, since you were actually very close to getting the answer on your own.
There are two problems with your gsub
approach:
You used single backslashes (
\
). R requires you to escape those since they are special characters. You escape them by adding another backslash (\\
). If you donchar("\\")
, you'll see that it returns "1".You didn't specify what the replacement should be. Here, we don't want to replace anything, but we want to capture a specific part of the string. You capture groups in parentheses
(...)
, and then you can refer to them by the number of the group. Here, we have just one group, so we refer to it as"\\1"
.
You should have tried something like:
sub("^((?:\\S+\\s+){2}\\S+).*", "\\1", z, perl = TRUE)
# [1] "I love stack"
This is essentially saying:
- Work from the start of the contents of "z".
- Start creating group 1.
- Find non-whitespace (like a word) followed by whitespace (
\S+\s+
) two times{2}
and then the next set of non-whitespaces (\S+
). This will get us 3 words, without also getting the whitespace after the third word. Thus, if you wanted a different number of words, change the{2}
to be one less than the number you are actually after. - End group 1 there.
- Then, just return the contents of group 1 (
\1
) from "z".
To get the last three words, just switch the position of the capturing group and put it at the end of the pattern to match.
sub("^.*\\s+((?:\\S+\\s+){2}\\S+)$", "\\1", z, perl = TRUE)
# [1] "a cool site"
Solution 2
For getting the first four words.
library(stringr)
str_extract(x, "^\\s*(?:\\S+\\s+){3}\\S+")
For getting the last four.
str_extract(x, "(?:\\S+\\s+){3}\\S+(?=\\s*$)")
Fagui Curtain
Updated on July 07, 2022Comments
-
Fagui Curtain almost 2 years
i guess this is a common problem, and i found quite a lot of webpages, including some from SO, but i failed to understand how to implement it.
I am new to REGEX, and I'd like to use it in R to extract the first few words from a sentence.
for example, if my sentence is
z = "I love stack overflow it is such a cool site"
id like to have my output as being (if i need the first four words)
[1] "I love stack overflow"
or (if i need the last four words)
[1] "such a cool site"
of course, the following works
paste(strsplit(z," ")[[1]][1:4],collapse=" ") paste(strsplit(z," ")[[1]][7:10],collapse=" ")
but i'd like to try a regex solution for performance issues as i need to deal with very huge files (and also for the sake of knowing about it)
I looked at several links, including Regex to extract first 3 words from a string and http://osherove.com/blog/2005/1/7/using-regex-to-return-the-first-n-words-in-a-string.html
so i tried things like
gsub("^((?:\S+\s+){2}\S+).*",z,perl=TRUE) Error: '\S' is an unrecognized escape in character string starting ""^((?:\S"
i tried other stuff but it usually returned me either the whole string, or the empty string.
another problem with substr is that it returns a list. maybe it looks like the
[[]]
operator is slowing things a bit (??) when dealing with large files and doing apply stuff.it looks like the Syntax used in R is somewhat different ? thanks !
-
Avinash Raj over 8 yearsor
sub("^\\s*((?:\\S+\\s+){3}\\S+).*", "\\1", x)
-
Fagui Curtain over 8 yearscan you give me the correct regex using the function
sub
. i made a test on a 10,000 sample and thesub
function from base R is like 30 times faster thanstr_extract
from thelibrary(stringr)
. thanks -
Fagui Curtain over 8 yearsthanks. @Ananda Mahto. could you give the regex for the last 4 words using the same function
sub
? -
Fagui Curtain over 8 yearsI'm stupid but don't know how to tweak the function.
sub("(?:\\S+\\s+){3}\\S+(?=\\s*$)",replacement="",z,perl=TRUE)
is returning me"I love stack overflow it is "
which is everything BUT the last 4 words... -
Fagui Curtain over 8 years
sub('^.* (\\w+\\s+\\w+\\s+\\w+\\s+\\w+)$', '\\1', z)
works for the last 5 strings, but i don't understand how to use properly the{...}
to make for a simpler expression in this case -
Avinash Raj over 8 yearslike
sub('^.* (\\w+(?:\\s+\\w+){4})$', '\\1', z)
-
A5C1D2H2I1M1N2O1R2T1 over 8 years@FaguiCurtain, I just swapped the reference from being fixed to the start of the line to the end instead, like:
^.*((?:\\S+\\s+){2}\\S+)$
. Change "2" to "3" to get 4 words instead of 3.