R tm package invalid input in 'utf8towcs'

51,498

Solution 1

None of the above answers worked for me. The only way to work around this problem was to remove all non graphical characters (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html).

The code is this simple

usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ") 

Solution 2

This is from the tm faq:

it will replace non-convertible bytes in yourCorpus with strings showing their hex codes.

I hope this helps, for me it does.

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

http://tm.r-forge.r-project.org/faq.html

Solution 3

I think it is clear by now that the problem is because of the emojis that tolower is not able to understand

#to remove emojis
dataSet <- iconv(dataSet, 'UTF-8', 'ASCII')

Solution 4

I have just run afoul of this problem. By chance are you using a machine running OSX? I am and seem to have traced the problem to the definition of the character set that R is compiled against on this operating system (see https://stat.ethz.ch/pipermail/r-sig-mac/2012-July/009374.html)

What I was seeing is that using the solution from the FAQ

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

was giving me this warning:

Warning message:
it is not known that wchar_t is Unicode on this platform 

This I traced to the enc2utf8 function. Bad news is that this is a problem with my underlying OS and not R.

So here is what I did as a work around:

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))

This forces iconv to use the utf8 encoding on the macintosh and works fine without the need to recompile.

Solution 5

I have often run into this issue and this Stack Overflow post is always what comes up first. I have used the top solution before, but it can strip out characters and replace them with garbage (like converting it’s to it’s).

I have found that there is actually a much better solution for this! If you install the stringi package, you can replace tolower() with stri_trans_tolower() and then everything should work fine.

Share:
51,498
maiaini
Author by

maiaini

Updated on June 07, 2020

Comments

  • maiaini
    maiaini almost 4 years

    I'm trying to use the tm package in R to perform some text analysis. I tied the following:

    require(tm)
    dataSet <- Corpus(DirSource('tmp/'))
    dataSet <- tm_map(dataSet, tolower)
    Error in FUN(X[[6L]], ...) : invalid input 'RT @noXforU Erneut riesiger (Alt-)�lteppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'
    

    The problem is some characters are not valid. I'd like to exclude the invalid characters from analysis either from within R or before importing the files for processing.

    I tried using iconv to convert all files to utf-8 and exclude anything that can't be converted to that as follows:

    find . -type f -exec iconv -t utf-8 "{}" -c -o tmpConverted/"{}" \; 
    

    as pointed out here Batch convert latin-1 files to utf-8 using iconv

    But I still get the same error.

    I'd appreciate any help.

  • maiaini
    maiaini about 12 years
    Thanks for your reply Ben! For some reason, that same line of code that failed for me works now. I don't know if this is another lucky coincidence :) I didn't change anything, just rerun it and this time it works without any hiccups.
  • Hack-R
    Hack-R almost 7 years
    This should be marked as the solution. It works and it's been popular for years, but the OP didn't stick around to mark it as being correct.
  • Agile Bean
    Agile Bean about 6 years
    as an alternative using base r, you can try: usableText <- iconv(tweets$text, "ASCII", "UTF-8", sub="")