Apply Encoding to Entire Data.Table

10,510

Solution 1

I tried this:

Encoding(raw$title) <- "UTF-8"

Which sets the encoding for the entire column. That will work fine for now. Still open to any other options so it will do this automatically upon import.

Solution 2

This has been recently implemented in the devel version of data.table, v1.9.5. This'll be soon pushed to CRAN (as v1.9.6). Could you please give the devel version a try to see if that solves this for you?

fread() has gained an encoding argument, specifically for issues with windows.

require(data.table) # v1.9.5+
fread("file.txt", encoding="UTF-8")

should solve the issue. There's no file for me to test. If it doesn't solve your issue, please file an issue on the project page, with a reproducible example/file.

Solution 3

Sadly, there does not seem to be a way of doing this while importing (yet) with fread.

While you seem to have figured it out already, I'll post a way of setting the encoding of the entire dt after import.

One way of getting it done would be to loop that over all the character columns in a data table:

for (name in colnames(raw[,sapply(raw, is.character), with=F])){
  Encoding(raw[[name]]) <- "UTF-8"}

the colnames... bit first gets the columns that are characters (with=F being necessary for dt it seems), and then one gets the column names that one will loop over. In short: this gives users what you have already found works, but across all char columns.

Now ... since there's no guarantee that the colnames for your integers, floats etc will not need some massaging, the following should solve it:

for (name in colnames(raw)){
  Encoding(colnames(raw)) <- "UTF-8"
}
Share:
10,510
user1477388
Author by

user1477388

Authored open-source project PHP One, an MVC framework for PHP designed like Microsoft's ASP.NET MVC. https://github.com/DominicArchual/Php-One

Updated on June 23, 2022

Comments

  • user1477388
    user1477388 almost 2 years

    I have the following file read into a data.table like so:

    raw <- fread("avito_train.tsv", nrows=1000)
    

    Then, if I change the encoding of a particular column and row like this:

    Encoding(raw$title[2]) <- "UTF-8"
    

    It works perfectly.

    But, how can I apply the encoding to all columns, and all rows?

    I checked the fread documentation but there doesn't appear to be any encoding option. Also, I tried Encoding(raw) but that gives me an error (a character vector argument expected).

    Edit: This article details more information on foreign text in RStudio on Windows http://quantifyingmemory.blogspot.com/2013/01/r-and-foreign-characters.html