Emoticons in Twitter Sentiment Analysis in r

13,975

Solution 1

This should get rid of the emoticons, using iconv as suggested by ndoogan.

Some reproducible data:

require(twitteR) 
# note that I had to register my twitter credentials first
# here's the method: http://stackoverflow.com/q/9916283/1036500
s <- searchTwitter('#emoticons', cainfo="cacert.pem") 

# convert to data frame
df <- do.call("rbind", lapply(s, as.data.frame))

# inspect, yes there are some odd characters in row five
head(df)

                                                                                                                                                text
1                                                                      ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via @tweetsmania  ;-)
2 “@teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons &amp; \nall the other stuff i cant see on android!" \n#Emoticons
3                      E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4                                                #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5  I use emoticons too much. #addicted #admittingit #emoticons <ed><U+00A0><U+00BD><ed><U+00B8><U+00AC><ed><U+00A0><U+00BD><ed><U+00B8><U+0081> haha
6                                                                                         What you text What I see #Emoticons http://t.co/BKowBSLJ0s

Here's the key line that will remove the emoticons:

# Clean text to remove odd characters
df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub=""))

Now inspect again, to see if the odd characters are gone (see row 5)

head(df)    
                                                                                                                               text
1                                                                     ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via @tweetsmania  ;-)
2 @teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons &amp; \nall the other stuff i cant see on android!" \n#Emoticons
3                     E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4                                               #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5                                                                                 I use emoticons too much. #addicted #admittingit #emoticons  haha
6                                                                                        What you text What I see #Emoticons http://t.co/BKowBSLJ0s

Solution 2

I recommend the function:
ji_replace_all <- function (string, replacement)

From the package:
install_github (" hadley / emo ").

I needed to remove the emojis from tweets that were in the Spanish language. Tried several options, but some messed up the text for me. However this is a marvel that works perfectly:

library(emo)

text="#VIDEO 😢💔🙏🏻,Alguien sabe si en Afganistán hay cigarro?"

ji_replace_all(text,"")

Result:

"#VIDEO ,Alguien sabe si en Afganistán hay cigarro?"

Solution 3

You can use regular expression to detect non-alphabet characters and remove them. Sample code:

rmNonAlphabet <- function(str) {
  words <- unlist(strsplit(str, " "))
  in.alphabet <- grep(words, pattern = "[a-z|0-9]", ignore.case = T)
  nice.str <- paste(words[in.alphabet], collapse = " ")
  nice.str
}
Share:
13,975
Rhodo
Author by

Rhodo

Updated on June 22, 2022

Comments

  • Rhodo
    Rhodo almost 2 years

    How do I handle/get rid of emoticons so that I can sort tweets for sentiment analysis?

    Getting: Error in sort.list(y) : invalid input

    Thanks

    and this is how the emoticons come out looking from twitter and into r:

    \xed��\xed�\u0083\xed��\xed��
    \xed��\xed�\u008d\xed��\xed�\u0089 
    
  • Rhodo
    Rhodo about 11 years
    Ben- Thank you so much- that cleaned it up- Finally!
  • Ben
    Ben about 11 years
    You're welcome! In case you're not familiar, you should upvote if answer was useful to you (that's the preferred way to say thanks here) and click on the tick (under the up/down arrows) to indicate that it was the best answer to your question. That will be helpful to other people who have the same question as you (this process is more relevant when there are multiple answers, in this case it's more for the fun of it).