How to source() .R file saved using UTF-8 encoding?

44,710

Solution 1

We talked about this a lot in the comments to my previous post but I don't want this to get lost on page 3 of comments: You have to set the locale, it works with both input from the R-console (see screenshot in comments) as well as with input from file see this screenshot:

The file "myfile.r" contains:

russian <- function() print ("Американские с...");

The console contains:

source("myfile.r", encoding="utf-8")
> Error in source(".....
Sys.setlocale("LC_CTYPE","ru")
> [1] "Russian_Russia.1251"
russian()
[1] "Американские с..."

Note that the file-in fails and it points to the same character as the original poster's error (the one after "R). I can not do this with Chinese because i would have to install "Microsoft Pinyin IME 3.0", but the process is the same, you just replace the locale with "chinese" (the naming is a bit inconsistent, consult the documentation).

Solution 2

On R/Windows, source runs into problems with any UTF-8 characters that can't be represented in the current locale (or ANSI Code Page in Windows-speak). And unfortunately Windows doesn't have UTF-8 available as an ANSI code page--Windows has a technical limitation that ANSI code pages can only be one- or two-byte-per-character encodings, not variable-byte encodings like UTF-8.

This doesn't seem to be a fundamental, unsolvable problem--there's just something wrong with the source function. You can get 90% of the way there by doing this instead:

eval(parse(filename, encoding="UTF-8"))

This'll work almost exactly like source() with default arguments, but won't let you do echo=T, eval.print=T, etc.

Solution 3

I think the problem lies with R. I can happily source UTF-8 files, or UCS-2LE files with many non-ASCII characters in. But some characters cause it to fail. For example the following

danish <- function() print("Skønt H. C. Andersens barndomsomgivelser var meget fattige, blev de i hans rige fantasi solbeskinnede.")
croatian <- function() print("Dodigović. Kako se Vi zovete?")
new_testament <- function() print("Ne provizu al vi trezorojn sur la tero, kie tineo kaj rusto konsumas, kaj jie ŝtelistoj trafosas kaj ŝtelas; sed provizu al vi trezoron en la ĉielo")
russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.")

is fine in both UTF-8 and UCS-2LE without the Russian line. But if that is included then it fails. I'm pointing the finger at R. Your Chinese text also appears to be too hard for R on Windows.

Locale seems irrelevant here. It's just a file, you tell it what encoding the file is, why should your locale matter?

Solution 4

For me (on windows) I do:

source.utf8 <- function(f) {
    l <- readLines(f, encoding="UTF-8")
    eval(parse(text=l),envir=.GlobalEnv)
}

It works fine.

Solution 5

Building on crow's answer, this solution makes RStudio's Source button work.

When hitting that Source button, RStudio executes source('myfile.r', encoding = 'UTF-8')), so overriding source makes the errors disappear and runs the code as expected:

source <- function(f, encoding = 'UTF-8') {
    l <- readLines(f, encoding=encoding)
    eval(parse(text=l),envir=.GlobalEnv)
}

You can then add that script to an .Rprofile file, so it will execute on startup.

Share:
44,710
Tony Breyal
Author by

Tony Breyal

Updated on July 09, 2022

Comments

  • Tony Breyal
    Tony Breyal almost 2 years

    The following, when copied and pasted directly into R works fine:

    > character_test <- function() print("R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示...")
    > character_test()
    [1] "R同时也被称为GNU S是一个强烈的功能性语言和环境,探索统计数据集,使许多从自定义数据图形显示..."
    

    However, if I make a file called character_test.R containing the EXACT SAME code, save it in UTF-8 encoding (so as to retain the special Chinese characters), then when I source() it in R, I get the following error:

    > source(file="C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8")
    Error in source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "utf-8") : 
      C:\Users\Tony\Desktop\character_test.R:3:0: unexpected end of input
    1: character.test <- function() print("R
    2: 
      ^
    In addition: Warning message:
    In source(file = "C:\\Users\\Tony\\Desktop\\character_test.R", encoding = "UTF-8") :
      invalid input found on input connection 'C:\Users\Tony\Desktop\character_test.R'
    

    Any help you can offer in solving and helping me to understand what is going on here would be much appreciated.

    > sessionInfo() # Windows 7 Pro x64
    R version 2.12.1 (2010-12-16)
    Platform: x86_64-pc-mingw32/x64 (64-bit)
    
    locale:
    [1] LC_COLLATE=English_United Kingdom.1252 
    [2] LC_CTYPE=English_United Kingdom.1252   
    [3] LC_MONETARY=English_United Kingdom.1252
    [4] LC_NUMERIC=C                           
    [5] LC_TIME=English_United Kingdom.1252    
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods  
    [7] base     
    
    loaded via a namespace (and not attached):
    [1] tools_2.12.1
    

    and

    > l10n_info()
    $MBCS
    [1] FALSE
    
    $`UTF-8`
    [1] FALSE
    
    $`Latin-1`
    [1] TRUE
    
    $codepage
    [1] 1252
    
  • Bernd Elkemann
    Bernd Elkemann about 13 years
    I just tried WinEdt (for which there is an often used R-Plugin RWinEdt) and it does not work (Version 5.5). So, you might want to try it with "Notepad2" first. You can also write the utf-8 text-file yourself using [R] writeChar(), i think it uses the encoding you set in Sys.setlocale().
  • David Heffernan
    David Heffernan about 13 years
    It doesn't matter which text editor writes the file, they can all write the file correctly, R on Windows just fails to read it.
  • Bernd Elkemann
    Bernd Elkemann about 13 years
    @David Heffernan The problem the original poster is having is different from your's. Yes, R can read UTF-8 files but the way his editor is set-up doesn't even create an UTF-8 file. He uses an editor that is not set to Utf-8-Mode and thus if he copies "R同时也" into it, the file becomes the bytes [52 3F 3F 3F] "R???".
  • David Heffernan
    David Heffernan about 13 years
    @eznme I don't think so. OP states that the file is saved with UTF-8 encoding. I save the same file with UTF-8 encoding (or indeed UTF-16) and get the same error. The problem is with R.
  • David Heffernan
    David Heffernan about 13 years
    @eznme just take a look at my answer and try to get R to source the file with the Russian in!
  • Bernd Elkemann
    Bernd Elkemann about 13 years
    russian <- function() print ("Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями.") russian() [1] "Американские суда находятся в международных водах. Япония выразила серьезное беспокойство советскими действиями."
  • Bernd Elkemann
    Bernd Elkemann about 13 years
    To do that use: Sys.setlocale("LC_CTYPE","ru")
  • Tony Breyal
    Tony Breyal about 13 years
    @eznme Cheers, but as @David says, my file was originally saved in notepade, set to utf-8 format mode. I installed notepad2 to try it out (quite nice, thanks for mentioning it, didn't know about it before), changed it to utf-8 and still have the same issue.
  • David Heffernan
    David Heffernan about 13 years
    @Tony Notepad2 is nice, Notepad++ is even nicer!
  • David Heffernan
    David Heffernan about 13 years
    @eznme @Tony What does locale have to do with anything? It's just a file read. Anyway, my machine says "OS reports request to set locale to "ru" cannot be honored". How did you get it to work?
  • Tony Breyal
    Tony Breyal about 13 years
    @David I actually agree with you in that the locale shouldn't matter because I'm specifically telling R to read in the file as utf-8 encoding, but I'm not an R expert and so am very willing to try different things out if they work. I get the same "cannot be honored" message as you. Also, just downloaded Notepad++ and very nice it is too!
  • David Heffernan
    David Heffernan about 13 years
    @Tony Really, how can this be anything other than a bug in R, as I suggest in my answer?
  • Bernd Elkemann
    Bernd Elkemann about 13 years
    In my screenshot you can see that when i set the locale to "ru" the russian text displays correctly, when i set it to "German" it does not.
  • Tony Breyal
    Tony Breyal about 13 years
    I'm going to post my question to the official R-help list, just in case it really is an error of R on Windows.
  • David Heffernan
    David Heffernan about 13 years
    @eznme I don't see you calling source on a UTF-8 file with that text in in that screenshot. That's what doesn't work. The use of locales your are illustrating is for dealing with 8 bit character sets. A modern Unicode program uses Unicode text and so locales are only used for things like date/time/number formatting preferences.
  • Tony Breyal
    Tony Breyal about 13 years
    Many thanks, this worked! I used Sys.setlocale("LC_CTYPE","chinese")
  • Bernd Elkemann
    Bernd Elkemann about 13 years
    Anytime sir. ("chinese" not "Chinese", interesting how inconsistent they are good you found out)
  • David Heffernan
    David Heffernan about 13 years
    how do you load a file that contains multiple languages? Something is wrong in R!
  • Bernd Elkemann
    Bernd Elkemann about 13 years
    You just switch the locale multiple times inside that file. I'm not sure the problem is with R, some commenters said that it's fine in Linux (without locale switching). It may-be R but it may be the Windows-API (widechar instead of utf-8) or a combination thereof.
  • Tony Breyal
    Tony Breyal about 13 years
    @David @eznme Just saw this on the official R-help list, in which Prof Ripley says something about utf-8 locals on Windows: goo.gl/cUZCm
  • David Heffernan
    David Heffernan about 13 years
    @Tony Prof. Ripley is talking out of his hat! Windows supports UTF-8 just fine. Windows has supported Unicode since 1991 and the reason it uses UTF-16 rather than UTF-8 as on Linux is that it supported Unicode before UTF-8 was even invented! My Windows app eats all these characters for breakfast. Locales should be irrelevant when you specify an encoding. I'm fingering iconv as the culprit here, but I'm afraid that if Prof. Ripley is taking that attitude then R on Windows has little hope of ever supporting Unicode properly.
  • David Heffernan
    David Heffernan about 13 years
    @eznme There just should be no need for locales. That might be how its done on Linux but it makes no sense in Windows. You just use the WideChar versions of all the API functions, hold the text as LPWSTR, and convert to different encodings at the boundaries (file import/export). It's not that difficult, but I understand that it becomes more difficult if you want to support Linux and Windows from a single codebase!
  • David Heffernan
    David Heffernan about 13 years
    @eznme Of course I can't get this locale thing to go because I can't select the ru locale on my machine. What a mess!
  • Anton Tarasenko
    Anton Tarasenko over 10 years
    I confirm that this works. source() requires setting Sys.setlocale() all along the file. eval does the job without this requirement.
  • Konrad Rudolph
    Konrad Rudolph almost 10 years
    source forwards the encoding argument to file, which, in turn, converts the textual input in memory to whatever locale encoding is specified (and fails) – this seems to be the culprit. parse by contrast doesn’t do this, it reads the file as-is and just marks the bytes in memory with the correct encoding. – I’m not entirely sure what this tells us, except that R’s internal handling of encodings is messy (we already knew that), and should be fixed, backwards compatibility be damned.
  • crow16384
    crow16384 over 9 years
    Yes. R 3.1.1 also can't do source(file, encoding="UTF-8") for Russian.
  • retorquere
    retorquere about 7 years
    The solution doesn't work for me. If I have this in my R source: boxplot(weight~Diet,data=ChickWeight,subset = Time ==21,col = "yellow", main="Gewicht van kuikens in gram op dag 21 bij verschillende diëten", xlab="dieet", ylab="gewicht in gram", sub="bron:package datasets in R") I still get INCOMPLETE_STRING. Also, is there a way to make r-studio source in utf-8 by default?
  • Konrad Rudolph
    Konrad Rudolph about 3 years
    The readLines call is redundant. See Joe Cheng’s answer. Furthermore, when replacing the source function it’s a good idea to handle the remaining arguments, e.g. local, correctly.