How to remove specific special characters in R
Solution 1
gsub("[^[:alnum:][:blank:]+?&/\\-]", "", c)
# [1] "In Acid-base reaction page4 why does it create water and not H+?"
Solution 2
In order to get your method to work, you need to put the literal "]" immediately after the leading "["
gsub("[][!#$%()*,.:;<=>@^_`|~.{}]", "", c)
[1] "In Acid-base reaction page4 why does it create water and not H+?"
You can them put the inner "[" anywhere. If you needed to exclude minus, it would then need to be last. See the ?regex page after all of those special pre-defined character classes are listed.
Solution 3
I think you're after a regex solution. I'll give you a messy solution and a package add on solution (shameless self promotion).
There's likely a better regex:
x <- "In Acid-base reaction (page[4]), why does it create water and not H+?"
keeps <- c("+", "-", "?")
## Regex solution
gsub(paste0(".*?($|'|", paste(paste0("\\",
keeps), collapse = "|"), "|[^[:punct:]]).*?"), "\\1", x)
#qdap: addon package solution
library(qdap)
strip(x, keeps, lower = FALSE)
## [1] "In Acid-base reaction page why does it create water and not H+?"
Related videos on Youtube
wen
Updated on July 09, 2022Comments
-
wen almost 2 years
I have some sentences like this one.
c = "In Acid-base reaction (page[4]), why does it create water and not H+?"
I want to remove all special characters except for '?&+-/
I know that if I want to remove all special characters, I can simply use
gsub("[[:punct:]]", "", c) "In Acidbase reaction page4 why does it create water and not H"
However, some special characters such as + - ? are also removed, which I intend to keep.
I tried to create a string of special characters that I can use in some code like this
gsub("[special_string]", "", c)
The best I can do is to come up with this
cat("!\"#$%()*,.:;<=>@[\\]^_`{|}~.")
However, the following code just won't work
gsub("[cat("!\"#$%()*,.:;<=>@[\\]^_`{|}~.")]", "", c)
What should I do to remove special characters, except for a few that I want to keep?
Thanks
-
wen about 10 yearsThis really works. I only know that ^ marks the beginning of a line (and $ marks the end). Why you are using it to mean "keep"? Could you explain a little?
-
IRTFM about 10 years"^" is the character class negation marker (when it occurs first). Read
?regex
. -
BrodieG about 10 years@user3193265, as IShouldBuyABoat notes, the
^
inside a character range ([]
) has a different meaning than outside. Several otehr characters have different meanings too. For example,?+
are not special characters in character ranges, but-
is (so we had to escape that one). Inside, as the first character, it means negate, or much everything other than what's inside the the expression. If this answers your question, please consider checking it as answered. Thanks. -
Fabian Werner over 8 yearsDoes not seem to work for all special characters, like '•'. The seem to survive...