UTF-8 / Unicode Text Encoding with RPostgreSQL
Solution 1
After exporting to R it's shown as: "Stéphane" (the é is encoded as é)
Your R environment is using a 1-byte non-composed encoding like latin-1 or windows-1252. Witness this test in Python, demonstrating that the utf-8 bytes for é
, decoded as if they were latin-1, produce the text you see:
>>> print u"é".encode("utf-8").decode("latin-1")
é
Either SET client_encoding = 'windows-1252'
or fix the encoding your R environment uses. If it's running in a cmd.exe
console you'll need to mess with the chcp
console command; otherwise it's specific to whatever your R runtime is.
Solution 2
As Craig Ringer said, setting client_encoding
to windows-1252 is probably not the best thing to do. Indeed, if the data you're retrieving contains a single exotic character, you're in trouble:
Error in postgresqlExecStatement(conn, statement, ...) : RS-DBI driver: (could not Retrieve the result : ERROR: character 0xcca7 of encoding "UTF8" has no equivalent in "WIN1252" )
On the other hand, getting your R environment to use Unicode could be impossible (I have the same problem as you with Sys.setlocale
... Same in this question too.).
A workaround is to manually declare UTF-8 encoding on all your data, using a function like this one:
set_utf8 <- function(x) {
# Declare UTF-8 encoding on all character columns:
chr <- sapply(x, is.character)
x[, chr] <- lapply(x[, chr, drop = FALSE], `Encoding<-`, "UTF-8")
# Same on column names:
Encoding(names(x)) <- "UTF-8"
x
}
And you have to use this function in all your queries:
set_utf8(dbGetQuery(con, "SELECT myvar FROM mytable"))
EDIT: Another possibility is to use RPostgres unstead of RPostgreSQL. I tested it (with the same config as in your question), and as far as I can see all declared encodings are automatically set to UTF-8.
Solution 3
If you use RPostgres::Postgres() as the first parameter of dbConnect() normally you will not have problem with encoding.
I tried this script where I had the same problem and now my accented characters are ok.
dbConnect(RPostgres::Postgres(),user="user",password="psw",host="host",port=5432,dbname="db_name")
Solution 4
This will fix any Unicode/UTF-8 problems in Windows. It must be executed before querying the database.
postgresqlpqExec(con, "SET client_encoding = 'windows-1252'")
Drawn from asker's misplaced self-answer, visible in question revision history
Related videos on Youtube
David L
Updated on June 06, 2022Comments
-
David L almost 2 years
I'm running R on a Windows machine which is directly linked to a PostgreSQL database. I'm not using RODBC. My database is encoded in UTF-8 as confirmed by the following R command:
dbGetQuery(con, "SHOW CLIENT_ENCODING") # client_encoding # 1 UTF8
However, when some text is read into R, it displays as strange text in R.
For example, the following text is shown in my PostgreSQL database: "Stéphane"
After exporting to R it's shown as: "Stéphane" (the é is encoded as é)
When importing to R I use the
dbConnect
command to establish a connection and thedbGetQuery
command to query data using SQL. I do not specify any text encoding anywhere when connecting to the database or when running a query.I've searched online and can't find a direct resolution to my issue. I found this link, but their issue is with RODBC, which I'm not using.
This link is helpful in identifying the symbols, but I don't just want to do a find & replace in R... way too much data.
I did try running the following commands below and I arrived at a warning.
Sys.setlocale("LC_ALL", "en_US.UTF-8") # [1] "" # Warning message: # In Sys.setlocale("LC_ALL", "en_US.UTF-8") : # OS reports request to set locale to "en_US.UTF-8" cannot be honored Sys.setenv(LANG="en_US.UTF-8") Sys.setenv(LC_CTYPE="UTF-8")
The warning occurs on the
Sys.setlocale("LC_ALL", "en_US.UTF-8")
command. My intuition is that this is a Windows specific issue and doesn't occur with Mac/Linux/Unix.-
nwellnhof about 10 yearsNote that
client_encoding
is not the actual encoding used by your database. You can find the encoding for a database using thepsql -l
option or the\l
command.
-
-
David L about 10 yearsYes this works. I ran the command
postgresqlpqExec(con, "SET client_encoding = 'windows-1252'")
before loading the data from PostgreSQL, and even though the system returnsFALSE
, it still converts to the desired character. Thanks! -
Craig Ringer about 10 years@DavidL Just be aware that if you take that approach, and your data contains chars that cannot be represented in windows-1252, queries will fail with encoding errors. If possible it'd be better to get your R environment using Unicode instead.
-
Nathan Tuggy almost 7 years@Scarabee: I checked before posting and Craig's had less detail about this, mentioning only that there would need to be some R-runtime-specific way to set
client_encoding
. -
Peter.k about 6 yearsHow to set your R environment using Unicode instead?
-
moj almost 6 yearsI can confirm that using RPostgres solves the problem.
-
Fato39 about 4 yearsI get Error in postgresqlNewConnection(drv, ...) : unused argument (encoding = "latin1")
-
Fabio Correa almost 3 yearsBoth Postgres and the set_utf8 solutions worked very well.