remove emoticons in R using tm package
10,099
Solution 1
You can use gsub
to get rid of all non-ASCII characters.
Texts = c("Let the stormy clouds chase, everyone from the place ☁ ♪ ♬",
"See you soon brother ☮ ",
"A boring old-fashioned message" )
gsub("[^\x01-\x7F]", "", Texts)
[1] "Let the stormy clouds chase, everyone from the place "
[2] "See you soon brother "
[3] "A boring old-fashioned message"
Details:
You can specify character classes in regex's with [ ]
. When the class description starts with ^
it means everything except these characters. Here, I have specified everything except characters 1-127, i.e. everything except standard ASCII and I have specified that they should be replaced with the empty string.
Solution 2
you can try this function
iconv(July4th_clean, "latin1", "ASCII", sub="")
Duplicate issue, see post
Related videos on Youtube
Author by
Luis
Updated on June 04, 2022Comments
-
Luis almost 2 years
I'm using the tm package to clean up a Twitter Corpus. However, the package is unable to clean up emoticons.
Here's a replicated code:
July4th_clean <- tm_map(July4th_clean, content_transformer(tolower)) Error in FUN(content(x), ...) : invalid input 'RT ElleJohnson Love of country is encircling the globes ������������������ july4thweekend July4th FourthOfJuly IndependenceDay NotAvailableOnIn' in 'utf8towcs'
Can someone point me in the right direction to remove the emoticons using the tm package?
Thank you,
Luis
-
G5W almost 7 yearsIt is not clear from your example what you wish to eliminate. Do you want to eliminate substrings that contain multiple consecutive punctuation marks like :-) and (-_-) or are you trying to eliminate odd Unicode characters like ☺ and ❀ ?
-
Luis almost 7 yearsYou are right. I assumed that it was a 🤓 or something similar.
-
Luis almost 7 yearsI am a R newbie. Do you know how I could check that particular tweet? I imagine you use the [] but not sure if the function or any other part of the code.
-
Luis almost 7 yearsHi G5W, the emoticon is a peach and a USA flag. 🍑
-
Luis almost 7 yearsI am trying to eliminate odd Unicode characters.
-
-
Luis almost 7 yearsHi Zeyad, I did see that one but hesitated using it because the code was different than the tm code I was using. I was using the <- tm_map function.
-
zdeeb almost 7 yearsyou should run this before using the
tm
package