R tm package invalid input in 'utf8towcs'
Solution 1
None of the above answers worked for me. The only way to work around this problem was to remove all non graphical characters (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html).
The code is this simple
usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ")
Solution 2
This is from the tm faq:
it will replace non-convertible bytes in yourCorpus with strings showing their hex codes.
I hope this helps, for me it does.
tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
http://tm.r-forge.r-project.org/faq.html
Solution 3
I think it is clear by now that the problem is because of the emojis that tolower is not able to understand
#to remove emojis
dataSet <- iconv(dataSet, 'UTF-8', 'ASCII')
Solution 4
I have just run afoul of this problem. By chance are you using a machine running OSX? I am and seem to have traced the problem to the definition of the character set that R is compiled against on this operating system (see https://stat.ethz.ch/pipermail/r-sig-mac/2012-July/009374.html)
What I was seeing is that using the solution from the FAQ
tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
was giving me this warning:
Warning message:
it is not known that wchar_t is Unicode on this platform
This I traced to the enc2utf8
function. Bad news is that this is a problem with my underlying OS and not R.
So here is what I did as a work around:
tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))
This forces iconv to use the utf8 encoding on the macintosh and works fine without the need to recompile.
Solution 5
I have often run into this issue and this Stack Overflow post is always what comes up first. I have used the top solution before, but it can strip out characters and replace them with garbage (like converting it’s
to it’s
).
I have found that there is actually a much better solution for this! If you install the stringi
package, you can replace tolower()
with stri_trans_tolower()
and then everything should work fine.
maiaini
Updated on June 07, 2020Comments
-
maiaini almost 4 years
I'm trying to use the tm package in R to perform some text analysis. I tied the following:
require(tm) dataSet <- Corpus(DirSource('tmp/')) dataSet <- tm_map(dataSet, tolower) Error in FUN(X[[6L]], ...) : invalid input 'RT @noXforU Erneut riesiger (Alt-)�lteppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'
The problem is some characters are not valid. I'd like to exclude the invalid characters from analysis either from within R or before importing the files for processing.
I tried using iconv to convert all files to utf-8 and exclude anything that can't be converted to that as follows:
find . -type f -exec iconv -t utf-8 "{}" -c -o tmpConverted/"{}" \;
as pointed out here Batch convert latin-1 files to utf-8 using iconv
But I still get the same error.
I'd appreciate any help.
-
maiaini about 12 yearsThanks for your reply Ben! For some reason, that same line of code that failed for me works now. I don't know if this is another lucky coincidence :) I didn't change anything, just rerun it and this time it works without any hiccups.
-
Hack-R almost 7 yearsThis should be marked as the solution. It works and it's been popular for years, but the OP didn't stick around to mark it as being correct.
-
Agile Bean about 6 yearsas an alternative using base r, you can try:
usableText <- iconv(tweets$text, "ASCII", "UTF-8", sub="")