How to remove specific special characters in R

68,537

Solution 1

gsub("[^[:alnum:][:blank:]+?&/\\-]", "", c)
# [1] "In Acid-base reaction page4 why does it create water and not H+?"

Solution 2

In order to get your method to work, you need to put the literal "]" immediately after the leading "["

 gsub("[][!#$%()*,.:;<=>@^_`|~.{}]", "", c)
[1] "In Acid-base reaction page4 why does it create water and not H+?"

You can them put the inner "[" anywhere. If you needed to exclude minus, it would then need to be last. See the ?regex page after all of those special pre-defined character classes are listed.

Solution 3

I think you're after a regex solution. I'll give you a messy solution and a package add on solution (shameless self promotion).

There's likely a better regex:

x <- "In Acid-base reaction (page[4]), why does it create water and not H+?" 
keeps <- c("+", "-", "?")

## Regex solution
gsub(paste0(".*?($|'|", paste(paste0("\\", 
    keeps), collapse = "|"), "|[^[:punct:]]).*?"), "\\1", x)

#qdap: addon package solution
library(qdap)
strip(x, keeps, lower = FALSE)

## [1] "In Acid-base reaction page why does it create water and not H+?"
Share:
68,537

Related videos on Youtube

wen
Author by

wen

Updated on July 09, 2022

Comments

  • wen
    wen almost 2 years

    I have some sentences like this one.

    c = "In Acid-base reaction (page[4]), why does it create water and not H+?" 
    

    I want to remove all special characters except for '?&+-/

    I know that if I want to remove all special characters, I can simply use

    gsub("[[:punct:]]", "", c)
    "In Acidbase reaction page4 why does it create water and not H"
    

    However, some special characters such as + - ? are also removed, which I intend to keep.

    I tried to create a string of special characters that I can use in some code like this

    gsub("[special_string]", "", c)
    

    The best I can do is to come up with this

    cat("!\"#$%()*,.:;<=>@[\\]^_`{|}~.")
    

    However, the following code just won't work

    gsub("[cat("!\"#$%()*,.:;<=>@[\\]^_`{|}~.")]", "", c)
    

    What should I do to remove special characters, except for a few that I want to keep?

    Thanks

  • wen
    wen about 10 years
    This really works. I only know that ^ marks the beginning of a line (and $ marks the end). Why you are using it to mean "keep"? Could you explain a little?
  • IRTFM
    IRTFM about 10 years
    "^" is the character class negation marker (when it occurs first). Read ?regex.
  • BrodieG
    BrodieG about 10 years
    @user3193265, as IShouldBuyABoat notes, the ^ inside a character range ([]) has a different meaning than outside. Several otehr characters have different meanings too. For example, ?+ are not special characters in character ranges, but - is (so we had to escape that one). Inside, as the first character, it means negate, or much everything other than what's inside the the expression. If this answers your question, please consider checking it as answered. Thanks.
  • Fabian Werner
    Fabian Werner over 8 years
    Does not seem to work for all special characters, like '•'. The seem to survive...