Reading Rdata file with different encoding

11,185

Solution 1

Thanks to 42's comment, I've managed to write a function to recode the file:

fix.encoding <- function(df, originalEncoding = "latin1") {
  numCols <- ncol(df)
  for (col in 1:numCols) Encoding(df[, col]) <- originalEncoding
  return(df)
}

The meat here is the command Encoding(df[, col]) <- "latin1", which takes column col of dataframe df and converts it to latin1 format. Unfortunately, Encoding only takes column objects as input, so I had to create a function to sweep all columns of a dataframe object and apply the transformation.

Of course, if your problem is in just a couple of columns, you're better off just applying the Encoding to those columns instead of the whole dataframe (you can modify the function above to take a set of columns as input). Also, if you're facing the inverse problem, i.e. reading an R object created in Linux or Mac OS into Windows, you should use originalEncoding = "UTF-8".

Solution 2

following up on previous answers, this is a minor update which makes it work on factors and dplyr's tibble. Thanks for inspiration.

fix.encoding <- function(df, originalEncoding = "UTF-8") {
numCols <- ncol(df)
df <- data.frame(df)
for (col in 1:numCols)
{
        if(class(df[, col]) == "character"){
                Encoding(df[, col]) <- originalEncoding
        }

        if(class(df[, col]) == "factor"){
                        Encoding(levels(df[, col])) <- originalEncoding
}
}
return(as_data_frame(df))
}

Solution 3

Thank you for posting this. I took the liberty to modify your function in case you have a dataframe with some columns as character and some as non-character. Otherwise, an error occurs:

> fix.encoding(adress)
Error in `Encoding<-`(`*tmp*`, value = "latin1") :
 a character vector argument expected

So here is the modified function:

fix.encoding <- function(df, originalEncoding = "latin1") {
    numCols <- ncol(df)
    for (col in 1:numCols)
            if(class(df[, col]) == "character"){
                    Encoding(df[, col]) <- originalEncoding
            }
    return(df)
}

However, this will not change the encoding of level's names in a "factor" column. Luckily, I found this to change all factors in your dataframe to character (which may be not the best approach, but in my case that's what I needed):

i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.character)
Share:
11,185
Waldir Leoncio
Author by

Waldir Leoncio

Perpetual student.

Updated on June 17, 2022

Comments

  • Waldir Leoncio
    Waldir Leoncio almost 2 years

    I have an .RData file to read on my Linux (UTF-8) machine, but I know the file is in Latin1 because I've created them myself on Windows. Unfortunately, I don't have access to the original files or a Windows machine and I need to read those files on my Linux machine.

    To read an Rdata file, the normal procedure is to run load("file.Rdata"). Functions such as read.csv have an encoding argument that you can use to solve those kind of issues, but load has no such thing. If I try load("file.Rdata", encoding = latin1), I just get this (expected) error:

    Error in load("file.Rdata", encoding = "latin1") : unused argument (encoding = "latin1")

    What else can I do? My files are loaded with text variables containing accents that get corrupted when opened in an UTF-8 environment.