UTF-8 / Unicode Text Encoding with RPostgreSQL

11,019

Solution 1

After exporting to R it's shown as: "Stéphane" (the é is encoded as é)

Your R environment is using a 1-byte non-composed encoding like latin-1 or windows-1252. Witness this test in Python, demonstrating that the utf-8 bytes for é, decoded as if they were latin-1, produce the text you see:

>>> print u"é".encode("utf-8").decode("latin-1")
é

Either SET client_encoding = 'windows-1252' or fix the encoding your R environment uses. If it's running in a cmd.exe console you'll need to mess with the chcp console command; otherwise it's specific to whatever your R runtime is.

Solution 2

As Craig Ringer said, setting client_encoding to windows-1252 is probably not the best thing to do. Indeed, if the data you're retrieving contains a single exotic character, you're in trouble:

Error in postgresqlExecStatement(conn, statement, ...) : RS-DBI driver: (could not Retrieve the result : ERROR: character 0xcca7 of encoding "UTF8" has no equivalent in "WIN1252" )

On the other hand, getting your R environment to use Unicode could be impossible (I have the same problem as you with Sys.setlocale... Same in this question too.).

A workaround is to manually declare UTF-8 encoding on all your data, using a function like this one:

set_utf8 <- function(x) {
  # Declare UTF-8 encoding on all character columns:
  chr <- sapply(x, is.character)
  x[, chr] <- lapply(x[, chr, drop = FALSE], `Encoding<-`, "UTF-8")
  # Same on column names:
  Encoding(names(x)) <- "UTF-8"
  x
}

And you have to use this function in all your queries:

set_utf8(dbGetQuery(con, "SELECT myvar FROM mytable"))

EDIT: Another possibility is to use RPostgres unstead of RPostgreSQL. I tested it (with the same config as in your question), and as far as I can see all declared encodings are automatically set to UTF-8.

Solution 3

If you use RPostgres::Postgres() as the first parameter of dbConnect() normally you will not have problem with encoding.

I tried this script where I had the same problem and now my accented characters are ok.

dbConnect(RPostgres::Postgres(),user="user",password="psw",host="host",port=5432,dbname="db_name")

Solution 4

This will fix any Unicode/UTF-8 problems in Windows. It must be executed before querying the database.

postgresqlpqExec(con, "SET client_encoding = 'windows-1252'")

Drawn from asker's misplaced self-answer, visible in question revision history

Share:
11,019

Related videos on Youtube

David L
Author by

David L

Updated on June 06, 2022

Comments

  • David L
    David L almost 2 years

    I'm running R on a Windows machine which is directly linked to a PostgreSQL database. I'm not using RODBC. My database is encoded in UTF-8 as confirmed by the following R command:

    dbGetQuery(con, "SHOW CLIENT_ENCODING")
    #   client_encoding
    # 1            UTF8
    

    However, when some text is read into R, it displays as strange text in R.

    For example, the following text is shown in my PostgreSQL database: "Stéphane"

    After exporting to R it's shown as: "Stéphane" (the é is encoded as é)

    When importing to R I use the dbConnect command to establish a connection and the dbGetQuery command to query data using SQL. I do not specify any text encoding anywhere when connecting to the database or when running a query.

    I've searched online and can't find a direct resolution to my issue. I found this link, but their issue is with RODBC, which I'm not using.

    This link is helpful in identifying the symbols, but I don't just want to do a find & replace in R... way too much data.

    I did try running the following commands below and I arrived at a warning.

    Sys.setlocale("LC_ALL", "en_US.UTF-8")
    # [1] ""
    # Warning message:
    # In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
    #   OS reports request to set locale to "en_US.UTF-8" cannot be honored
    Sys.setenv(LANG="en_US.UTF-8")
    Sys.setenv(LC_CTYPE="UTF-8")
    

    The warning occurs on the Sys.setlocale("LC_ALL", "en_US.UTF-8") command. My intuition is that this is a Windows specific issue and doesn't occur with Mac/Linux/Unix.

    • nwellnhof
      nwellnhof about 10 years
      Note that client_encoding is not the actual encoding used by your database. You can find the encoding for a database using the psql -l option or the \l command.
  • David L
    David L about 10 years
    Yes this works. I ran the command postgresqlpqExec(con, "SET client_encoding = 'windows-1252'") before loading the data from PostgreSQL, and even though the system returns FALSE, it still converts to the desired character. Thanks!
  • Craig Ringer
    Craig Ringer about 10 years
    @DavidL Just be aware that if you take that approach, and your data contains chars that cannot be represented in windows-1252, queries will fail with encoding errors. If possible it'd be better to get your R environment using Unicode instead.
  • Nathan Tuggy
    Nathan Tuggy almost 7 years
    @Scarabee: I checked before posting and Craig's had less detail about this, mentioning only that there would need to be some R-runtime-specific way to set client_encoding.
  • Peter.k
    Peter.k about 6 years
    How to set your R environment using Unicode instead?
  • moj
    moj almost 6 years
    I can confirm that using RPostgres solves the problem.
  • Fato39
    Fato39 about 4 years
    I get Error in postgresqlNewConnection(drv, ...) : unused argument (encoding = "latin1")
  • Fabio Correa
    Fabio Correa almost 3 years
    Both Postgres and the set_utf8 solutions worked very well.