Displaying UTF-8 encoded Chinese characters in R

31,155

Solution 1

Not a bug, more a misunderstanding of the underlying type system conversions (the character type and the factor type) when constructing a data.frame.

You could start first with data <-read.csv("mydata.csv", encoding="UTF-8", stringsAsFactors=FALSE) which will make your Chinese characters to be of the character type and so by printing them out you should see waht you are expecting.

@nograpes: similarly x=c('中華民族');x; y <- data.frame(x, stringsAsFactors=FALSE) and everything should be ok.

Solution 2

In my case, the utf-8 encoding does not work in my r. But the Gb* encoding works.The utf8 wroks in ubuntu. First you need to figure out the default encoding in your OS. And encode it as it is. Excel can not encode it as utf8 properly even it claims that it save as utf8.

(1) Download 'Open Sheet' software.

(2) Open it properly. You can scroll the encoding method until you see the Chinese character displayed in the preview windows.

(3) Save it as utf-8(if you want utf-8). (UTF-8 is not solution to every problem, you HAVE TO know the default encoding in your system first)

Share:
31,155
Admin
Author by

Admin

Updated on July 13, 2020

Comments

  • Admin
    Admin almost 4 years

    I try to open a UTF-8 encoded .csv file that contains (traditional) Chinese characters in R. For some reason, R displays the information sometimes as Chinese characters, sometimes as unicode characters.

    For instance:

    data <-read.csv("mydata.csv", encoding="UTF-8")
    
    data
    

    will produce unicode characters, while:

    data <-read.csv("mydata.csv", encoding="UTF-8")
    
    data[,1]
    

    will actually display Chinese characters.

    If I turn it into a matrix, it will also display Chinese characters, but if I try to look at the data (command View(data) or fix(data)) it is in unicode again.

    I've asked for advice from people who use a Mac (I'm using a PC, Windows 7), and some of them got Chinese characters throughout, others didn't. I tried to save the original data as a table instead and read it into R this way - same result. I tried running the script in RStudio, Revolution R, and RGui. I tried to adjust the locale (e.g. to chinese), but either R didn't let me change it or else the result was gibberish instead of unicode characters.

    My current locale is:

    "LC_COLLATE=French_Switzerland.1252;LC_CTYPE=French_Switzerland.1252;LC_MONETARY=French_Switzerland.1252;LC_NUMERIC=C;LC_TIME=French_Switzerland.1252"

    Any help to get R to consistently display Chinese characters would be greatly appreciated...