remove emoticons in R using tm package

10,099

Solution 1

You can use gsub to get rid of all non-ASCII characters.

Texts = c("Let the stormy clouds chase, everyone from the place ☁  ♪ ♬",
    "See you soon brother ☮ ",
    "A boring old-fashioned message" ) 

gsub("[^\x01-\x7F]", "", Texts)
[1] "Let the stormy clouds chase, everyone from the place    "
[2] "See you soon brother  "                                  
[3] "A boring old-fashioned message"

Details: You can specify character classes in regex's with [ ]. When the class description starts with ^ it means everything except these characters. Here, I have specified everything except characters 1-127, i.e. everything except standard ASCII and I have specified that they should be replaced with the empty string.

Solution 2

you can try this function

iconv(July4th_clean, "latin1", "ASCII", sub="")

Duplicate issue, see post

Share:
10,099

Related videos on Youtube

Luis
Author by

Luis

Updated on June 04, 2022

Comments

  • Luis
    Luis almost 2 years

    I'm using the tm package to clean up a Twitter Corpus. However, the package is unable to clean up emoticons.

    Here's a replicated code:

    July4th_clean <- tm_map(July4th_clean, content_transformer(tolower))
    Error in FUN(content(x), ...) : invalid input 'RT ElleJohnson Love of country is encircling the globes ������������������ july4thweekend July4th FourthOfJuly IndependenceDay NotAvailableOnIn' in 'utf8towcs'
    

    Can someone point me in the right direction to remove the emoticons using the tm package?

    Thank you,

    Luis

    • G5W
      G5W almost 7 years
      It is not clear from your example what you wish to eliminate. Do you want to eliminate substrings that contain multiple consecutive punctuation marks like :-) and (-_-) or are you trying to eliminate odd Unicode characters like ☺ and ❀ ?
    • Luis
      Luis almost 7 years
      You are right. I assumed that it was a 🤓 or something similar.
    • Luis
      Luis almost 7 years
      I am a R newbie. Do you know how I could check that particular tweet? I imagine you use the [] but not sure if the function or any other part of the code.
    • Luis
      Luis almost 7 years
      Hi G5W, the emoticon is a peach and a USA flag. 🍑
    • Luis
      Luis almost 7 years
      I am trying to eliminate odd Unicode characters.
  • Luis
    Luis almost 7 years
    Hi Zeyad, I did see that one but hesitated using it because the code was different than the tm code I was using. I was using the <- tm_map function.
  • zdeeb
    zdeeb almost 7 years
    you should run this before using the tm package