Emoticons in Twitter Sentiment Analysis in r
Solution 1
This should get rid of the emoticons, using iconv
as suggested by ndoogan.
Some reproducible data:
require(twitteR)
# note that I had to register my twitter credentials first
# here's the method: http://stackoverflow.com/q/9916283/1036500
s <- searchTwitter('#emoticons', cainfo="cacert.pem")
# convert to data frame
df <- do.call("rbind", lapply(s, as.data.frame))
# inspect, yes there are some odd characters in row five
head(df)
text
1 ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via @tweetsmania ;-)
2 “@teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons & \nall the other stuff i cant see on android!" \n#Emoticons
3 E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4 #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5 I use emoticons too much. #addicted #admittingit #emoticons <ed><U+00A0><U+00BD><ed><U+00B8><U+00AC><ed><U+00A0><U+00BD><ed><U+00B8><U+0081> haha
6 What you text What I see #Emoticons http://t.co/BKowBSLJ0s
Here's the key line that will remove the emoticons:
# Clean text to remove odd characters
df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
Now inspect again, to see if the odd characters are gone (see row 5)
head(df)
text
1 ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via @tweetsmania ;-)
2 @teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons & \nall the other stuff i cant see on android!" \n#Emoticons
3 E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4 #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5 I use emoticons too much. #addicted #admittingit #emoticons haha
6 What you text What I see #Emoticons http://t.co/BKowBSLJ0s
Solution 2
I recommend the function:
ji_replace_all <- function (string, replacement)
From the package:
install_github (" hadley / emo ")
.
I needed to remove the emojis from tweets that were in the Spanish language. Tried several options, but some messed up the text for me. However this is a marvel that works perfectly:
library(emo)
text="#VIDEO 😢💔🙏🏻,Alguien sabe si en Afganistán hay cigarro?"
ji_replace_all(text,"")
Result:
"#VIDEO ,Alguien sabe si en Afganistán hay cigarro?"
Solution 3
You can use regular expression to detect non-alphabet characters and remove them. Sample code:
rmNonAlphabet <- function(str) {
words <- unlist(strsplit(str, " "))
in.alphabet <- grep(words, pattern = "[a-z|0-9]", ignore.case = T)
nice.str <- paste(words[in.alphabet], collapse = " ")
nice.str
}
Rhodo
Updated on June 22, 2022Comments
-
Rhodo almost 2 years
How do I handle/get rid of emoticons so that I can sort tweets for sentiment analysis?
Getting: Error in sort.list(y) : invalid input
Thanks
and this is how the emoticons come out looking from twitter and into r:
\xed��\xed�\u0083\xed��\xed�� \xed��\xed�\u008d\xed��\xed�\u0089
-
Rhodo about 11 yearsBen- Thank you so much- that cleaned it up- Finally!
-
Ben about 11 yearsYou're welcome! In case you're not familiar, you should upvote if answer was useful to you (that's the preferred way to say thanks here) and click on the tick (under the up/down arrows) to indicate that it was the best answer to your question. That will be helpful to other people who have the same question as you (this process is more relevant when there are multiple answers, in this case it's more for the fun of it).