Write UTF-8 files from R
Solution 1
This "answer" serves rather the purpose of clarifying that there is something odd going on behind the scenes:
"hīersumian" doesn't even make it into the data frame it seems. The "ī"-symbol is in all cases converted to "i".
options("encoding" = "native.enc")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
# a
# 1 hiersumian
options("encoding" = "UTF-8")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
# a
# 1 hiersumian
options("encoding" = "UTF-16")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
# a
# 1 hiersumian
The following sequence successfully writes "ǣmettigan" to the text file:
t2 <- data.frame(a = c("ǣmettigan"), stringsAsFactors=F)
getOption("encoding")
# [1] "native.enc"
Encoding(t2[,"a"]) <- "UTF-16"
write.table(t2,"test.txt",row.names=F,col.names=F,quote=F)
It is not going to work with "encoding" as "UTF-8" or "UTF-16" and also specifying "fileEncoding" will either lead to a defect or no output.
Somewhat disappointing as so far I managed to get all Unicode issues fixed somehow.
Solution 2
I may be missing something OS-specific, but data.table
appears to have no problem with this (or perhaps more likely it's an update to R internals since this question was originally posed):
t1 = data.table(a = c("hīersumian", "ǣmettigan"))
tmp = tempfile()
fwrite(t1, tmp)
system(paste('cat', tmp))
# a
# hīersumian
# ǣmettigan
fread(tmp)
# a
# 1: hīersumian
# 2: ǣmettigan
Related videos on Youtube
Comments
-
Sverre over 3 years
Whereas R seems to handle Unicode characters well internally, I'm not able to output a data frame in R with such UTF-8 Unicode characters. Is there any way to force this?
data.frame(c("hīersumian","ǣmettigan"))->test write.table(test,"test.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8")
The output text file reads:
hiersumian <U+01E3>mettigan
I am using R version 3.0.2 in a Windows environment (Windows 7).
EDIT
It's been suggested in the answers that R is writing the file correctly in UTF-8, and that the problem lies with the software I'm using to view the file. Here's some code where I'm doing everything in R. I'm reading in a text file encoded in UTF-8, and R reads it correctly. Then R writes the file out in UTF-8 and reads it back in again, and now the correct Unicode characters are gone.
read.table("myinputfile.txt",encoding="UTF-8")->myinputfile myinputfile[1,1] write.table(myinputfile,"myoutputfile.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8") read.table("myoutputfile.txt",encoding="UTF-8")->myoutputfile myoutputfile[1,1]
Console output:
> read.table("myinputfile.txt",encoding="UTF-8")->myinputfile > myinputfile[1,1] [1] hīersumian Levels: hīersumian ǣmettigan > write.table(myinputfile,"myoutputfile.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8") > read.table("myoutputfile.txt",encoding="UTF-8")->myoutputfile > myoutputfile[1,1] [1] <U+FEFF>hiersumian Levels: <U+01E3>mettigan <U+FEFF>hiersumian >
-
Ben Bolker over 10 yearsworks for me (R-devel on Ubuntu 12.04) when viewing the file in the terminal, vi, or emacs.
-
Sverre over 10 years@BenBolker Does this mean that this problem is specific to the Windows version of R?
-
Konrad Rudolph over 10 yearsTo clarify: this is a Windows-specific problem. On OS X the result is verifiably correct.
file test.txt
replies withtest.txt: UTF-8 Unicode text
. A hexdump shows the correct bytes. Well-written question though. -
Sverre over 10 yearsIsn't it more correct to say that this is a problem specific to the R version for Windows (R exists in different versions depending on the OS)? I don't have any problems with using UTF-8 and Unicode in Windows otherwise, so I doubt the problem lies with Windows.
-
Sverre over 10 yearsI've submitted a request to the R-devel mailing list for UTF-8 to be properly supported in future versions of R for Windows.
-
Ben Bolker over 10 yearsNow that you got a chilly response on r-devel (article.gmane.org/gmane.comp.lang.r.devel/34861), I wonder if answers here could focus on workarounds.
-
user almost 7 yearsPossible duplicate of UTF-8 file output in R
-
-
MichaelChirico about 7 yearsWhile
write.table
still appears to fail on my machine (Ubuntu), the automatic conversion of"hīersumian"
no longer seems to be an issue in my current version of R (3.3.2)