How to source() .R file saved using UTF-8 encoding?
Solution 1
We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:
The file "myfile.r" contains:
russian <- function() print ("Американские с...");
The console contains:
source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."
Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).
Solution 2
On R/Windows, source
runs into problems with any UTF-8 characters that can't be represented in the current locale (or ANSI Code Page in Windows-speak). And unfortunately Windows doesn't have UTF-8 available as an ANSI code page--Windows has a technical limitation that ANSI code pages can only be one- or two-byte-per-character encodings, not variable-byte encodings like UTF-8.
This doesn't seem to be a fundamental, unsolvable problem--there's just something wrong with the source
function. You can get 90% of the way there by doing this instead:
eval(parse(filename, encoding="UTF-8"))
This'll work almost exactly like source()
with default arguments, but won't let you do echo=T
, eval.print=T
, etc.
Solution 3
I think the problem lies with R. I can happily source UTF-8 files, or UCS-2LE files with many non-ASCII characters in. But some characters cause it to fail. For example the following
danish <- function() print("Skønt H. C. Andersens barndomsomgivelser var meget fattige, blev de i hans rige fantasi solbeskinnede.")
croatian <- function() print("Dodigović. Kako se Vi zovete?")
new_testament <- function() print("Ne provizu al vi trezorojn sur la tero, kie tineo kaj rusto konsumas, kaj jie ŝtelistoj trafosas kaj ŝtelas; sed provizu al vi trezoron en la ĉielo")
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.")
is fine in both UTF-8 and UCS-2LE without the Russian line. But if that is included then it fails. I'm pointing the finger at R. Your Chinese text also appears to be too hard for R on Windows.
Locale seems irrelevant here. It's just a file, you tell it what encoding the file is, why should your locale matter?
Solution 4
For me (on windows) I do:
source.utf8 <- function(f) {
l <- readLines(f, encoding="UTF-8")
eval(parse(text=l),envir=.GlobalEnv)
}
It works fine.
Solution 5
Building on crow's answer, this solution makes RStudio
's Source
button work.
When hitting that Source
button, RStudio
executes source('myfile.r', encoding = 'UTF-8')
), so overriding source
makes the errors disappear and runs the code as expected:
source <- function(f, encoding = 'UTF-8') {
l <- readLines(f, encoding=encoding)
eval(parse(text=l),envir=.GlobalEnv)
}
You can then add that script to an .Rprofile
file, so it will execute on startup.
Tony Breyal
Updated on July 09, 2022Comments
-
Tony Breyal almost 2 years
The following, when copied and pasted directly into R works fine:
> character_test <- function() print("R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示...") > character_test() [1] "R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示..."
However, if I make a file called character_test.R containing the EXACT SAME code, save it in UTF-8 encoding (so as to retain the special Chinese characters), then when I source() it in R, I get the following error:
> source(file="C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8") Error in source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "utf-8") : C:\Users\Tony\Desktop\character_test.R:3:0: unexpected end of input 1: character.test <- function() print("R 2: ^ In addition: Warning message: In source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8") : invalid input found on input connection 'C:\Users\Tony\Desktop\character_test.R'
Any help you can offer in solving and helping me to understand what is going on here would be much appreciated.
> sessionInfo() # Windows 7 Pro x64 R version 2.12.1 (2010-12-16) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 [2] LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods [7] base loaded via a namespace (and not attached): [1] tools_2.12.1
and
> l10n_info() $MBCS [1] FALSE $`UTF-8` [1] FALSE $`Latin-1` [1] TRUE $codepage [1] 1252
-
Bernd Elkemann about 13 yearsI just tried WinEdt (for which there is an often used R-Plugin RWinEdt) and it does not work (Version 5.5). So, you might want to try it with "Notepad2" first. You can also write the utf-8 text-file yourself using [R] writeChar(), i think it uses the encoding you set in Sys.setlocale().
-
David Heffernan about 13 yearsIt doesn't matter which text editor writes the file, they can all write the file correctly, R on Windows just fails to read it.
-
Bernd Elkemann about 13 years@David Heffernan The problem the original poster is having is different from your's. Yes, R can read UTF-8 files but the way his editor is set-up doesn't even create an UTF-8 file. He uses an editor that is not set to Utf-8-Mode and thus if he copies "R同时也" into it, the file becomes the bytes [52 3F 3F 3F] "R???".
-
David Heffernan about 13 years@eznme I don't think so. OP states that the file is saved with UTF-8 encoding. I save the same file with UTF-8 encoding (or indeed UTF-16) and get the same error. The problem is with R.
-
David Heffernan about 13 years@eznme just take a look at my answer and try to get R to source the file with the Russian in!
-
Bernd Elkemann about 13 yearsrussian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.") russian() [1] "Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями."
-
Bernd Elkemann about 13 yearsTo do that use: Sys.setlocale("LC_CTYPE","ru")
-
Tony Breyal about 13 years@eznme Cheers, but as @David says, my file was originally saved in notepade, set to utf-8 format mode. I installed notepad2 to try it out (quite nice, thanks for mentioning it, didn't know about it before), changed it to utf-8 and still have the same issue.
-
David Heffernan about 13 years@Tony Notepad2 is nice, Notepad++ is even nicer!
-
David Heffernan about 13 years@eznme @Tony What does locale have to do with anything? It's just a file read. Anyway, my machine says "OS reports request to set locale to "ru" cannot be honored". How did you get it to work?
-
Tony Breyal about 13 years@David I actually agree with you in that the locale shouldn't matter because I'm specifically telling R to read in the file as utf-8 encoding, but I'm not an R expert and so am very willing to try different things out if they work. I get the same "cannot be honored" message as you. Also, just downloaded Notepad++ and very nice it is too!
-
David Heffernan about 13 years@Tony Really, how can this be anything other than a bug in R, as I suggest in my answer?
-
Bernd Elkemann about 13 yearsIn my screenshot you can see that when i set the locale to "ru" the russian text displays correctly, when i set it to "German" it does not.
-
Tony Breyal about 13 yearsI'm going to post my question to the official R-help list, just in case it really is an error of R on Windows.
-
David Heffernan about 13 years@eznme I don't see you calling source on a UTF-8 file with that text in in that screenshot. That's what doesn't work. The use of locales your are illustrating is for dealing with 8 bit character sets. A modern Unicode program uses Unicode text and so locales are only used for things like date/time/number formatting preferences.
-
Tony Breyal about 13 yearsMany thanks, this worked! I used Sys.setlocale("LC_CTYPE","chinese")
-
Bernd Elkemann about 13 yearsAnytime sir. ("chinese" not "Chinese", interesting how inconsistent they are good you found out)
-
David Heffernan about 13 yearshow do you load a file that contains multiple languages? Something is wrong in R!
-
Bernd Elkemann about 13 yearsYou just switch the locale multiple times inside that file. I'm not sure the problem is with R, some commenters said that it's fine in Linux (without locale switching). It may-be R but it may be the Windows-API (widechar instead of utf-8) or a combination thereof.
-
Tony Breyal about 13 years@David @eznme Just saw this on the official R-help list, in which Prof Ripley says something about utf-8 locals on Windows: goo.gl/cUZCm
-
David Heffernan about 13 years@Tony Prof. Ripley is talking out of his hat! Windows supports UTF-8 just fine. Windows has supported Unicode since 1991 and the reason it uses UTF-16 rather than UTF-8 as on Linux is that it supported Unicode before UTF-8 was even invented! My Windows app eats all these characters for breakfast. Locales should be irrelevant when you specify an encoding. I'm fingering
iconv
as the culprit here, but I'm afraid that if Prof. Ripley is taking that attitude then R on Windows has little hope of ever supporting Unicode properly. -
David Heffernan about 13 years@eznme There just should be no need for locales. That might be how its done on Linux but it makes no sense in Windows. You just use the WideChar versions of all the API functions, hold the text as LPWSTR, and convert to different encodings at the boundaries (file import/export). It's not that difficult, but I understand that it becomes more difficult if you want to support Linux and Windows from a single codebase!
-
David Heffernan about 13 years@eznme Of course I can't get this locale thing to go because I can't select the ru locale on my machine. What a mess!
-
Anton Tarasenko over 10 yearsI confirm that this works.
source()
requires settingSys.setlocale()
all along the file.eval
does the job without this requirement. -
Konrad Rudolph almost 10 years
source
forwards theencoding
argument tofile
, which, in turn, converts the textual input in memory to whatever locale encoding is specified (and fails) – this seems to be the culprit.parse
by contrast doesn’t do this, it reads the file as-is and just marks the bytes in memory with the correct encoding. – I’m not entirely sure what this tells us, except that R’s internal handling of encodings is messy (we already knew that), and should be fixed, backwards compatibility be damned. -
crow16384 over 9 yearsYes. R 3.1.1 also can't do source(file, encoding="UTF-8") for Russian.
-
retorquere about 7 yearsThe solution doesn't work for me. If I have this in my R source:
boxplot(weight~Diet,data=ChickWeight,subset = Time ==21,col = "yellow", main="Gewicht van kuikens in gram op dag 21 bij verschillende diëten", xlab="dieet", ylab="gewicht in gram", sub="bron:package datasets in R")
I still getINCOMPLETE_STRING
. Also, is there a way to make r-studio source in utf-8 by default? -
Konrad Rudolph about 3 yearsThe
readLines
call is redundant. See Joe Cheng’s answer. Furthermore, when replacing thesource
function it’s a good idea to handle the remaining arguments, e.g.local
, correctly.