R: can't read unicode text files even when specifying the encoding

10,330

After reading more closely to the documentation, I found the answer to my question.

The encoding param of readLines only applies to the param input strings. The documentation says:

encoding to be assumed for input strings. It is used to mark character strings as known to be in Latin-1 or UTF-8: it is not used to re-encode the input. To do the latter, specify the encoding as part of the connection con or via options(encoding=): see the examples. See also ‘Details’.

The proper way of reading a file with an uncommon encoding is, then,

filetext <- readLines(con <- file("UnicodeFile.txt", encoding = "UCS-2LE"))
close(con)
Share:
10,330
s_a
Author by

s_a

Updated on June 18, 2022

Comments

  • s_a
    s_a almost 2 years

    I'm using R 3.1.1 on Windows 7 32bits. I'm having a lot of problems reading some text files on which I want to perform textual analysis. According to Notepad++, the files are encoded with "UCS-2 Little Endian". (grepWin, a tool whose name says it all, says the file is "Unicode".)

    The problem is that I can't seem to read the file even specifying that encoding. (The characters are of the standard spanish Latin set -ñáó- and should be handled easily with CP1252 or anything like that.)

    > Sys.getlocale()
    [1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252"
    > readLines("filename.txt")
     [1] "ÿþE" ""    ""    ""    ""   ...
    > readLines("filename.txt",encoding="UTF-8")
     [1] "\xff\xfeE" ""          ""          ""          ""    ...
    > readLines("filename.txt",encoding="UCS2LE")
     [1] "ÿþE" ""    ""    ""    ""    ""    ""     ...
    > readLines("filename.txt",encoding="UCS2")
     [1] "ÿþE" ""    ""    ""    ""    ...
    

    Any ideas?

    Thanks!!


    edit: the "UTF-16", "UTF-16LE" and "UTF-16BE" encondings fails similarly