R tm package invalid input in 'utf8towcs'

r utf-8 iconv text-mining

51,498

Solution 1

None of the above answers worked for me. The only way to work around this problem was to remove all non graphical characters (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html).

The code is this simple

usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ")

Solution 2

This is from the tm faq:

it will replace non-convertible bytes in yourCorpus with strings showing their hex codes.

I hope this helps, for me it does.

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

http://tm.r-forge.r-project.org/faq.html

Solution 3

I think it is clear by now that the problem is because of the emojis that tolower is not able to understand

#to remove emojis
dataSet <- iconv(dataSet, 'UTF-8', 'ASCII')

Solution 4

I have just run afoul of this problem. By chance are you using a machine running OSX? I am and seem to have traced the problem to the definition of the character set that R is compiled against on this operating system (see https://stat.ethz.ch/pipermail/r-sig-mac/2012-July/009374.html)

What I was seeing is that using the solution from the FAQ

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

was giving me this warning:

Warning message:
it is not known that wchar_t is Unicode on this platform

This I traced to the enc2utf8 function. Bad news is that this is a problem with my underlying OS and not R.

So here is what I did as a work around:

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))

This forces iconv to use the utf8 encoding on the macintosh and works fine without the need to recompile.

Solution 5

I have often run into this issue and this Stack Overflow post is always what comes up first. I have used the top solution before, but it can strip out characters and replace them with garbage (like converting it’s to itâ€™s).

I have found that there is actually a much better solution for this! If you install the stringi package, you can replace tolower() with stri_trans_tolower() and then everything should work fine.

View more solutions

51,498

Author by

maiaini

Updated on June 07, 2020

Comments

maiaini almost 4 years
I'm trying to use the tm package in R to perform some text analysis. I tied the following:
```
require(tm)
dataSet <- Corpus(DirSource('tmp/'))
dataSet <- tm_map(dataSet, tolower)
Error in FUN(X[[6L]], ...) : invalid input 'RT @noXforU Erneut riesiger (Alt-)�lteppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'
```
The problem is some characters are not valid. I'd like to exclude the invalid characters from analysis either from within R or before importing the files for processing.

I tried using iconv to convert all files to utf-8 and exclude anything that can't be converted to that as follows:
```
find . -type f -exec iconv -t utf-8 "{}" -c -o tmpConverted/"{}" \; 
```
as pointed out here Batch convert latin-1 files to utf-8 using iconv

But I still get the same error.

I'd appreciate any help.
maiaini about 12 years

Thanks for your reply Ben! For some reason, that same line of code that failed for me works now. I don't know if this is another lucky coincidence :) I didn't change anything, just rerun it and this time it works without any hiccups.
Hack-R almost 7 years

This should be marked as the solution. It works and it's been popular for years, but the OP didn't stick around to mark it as being correct.
Agile Bean about 6 years

as an alternative using base r, you can try: usableText <- iconv(tweets$text, "ASCII", "UTF-8", sub="")