Read a UTF-8 text file with BOM

18,884

Solution 1

Have you tried read.csv(..., fileEncoding = "UTF-8-BOM")?. ?file says:

As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted and will remove a Byte Order Mark if present (which it often is for files and webpages generated by Microsoft applications).

Solution 2

This was handled between versions 1.9.6 and 1.9.8 with this commit; update your data.table installation to fix this.

Once done, you can just use fread:

fread("file_name.csv")
Share:
18,884

Related videos on Youtube

djhurio
Author by

djhurio

Data scientist, statistician

Updated on July 02, 2022

Comments

  • djhurio
    djhurio over 1 year

    I have a text file with Byte order mark (U+FEFF) at the beginning. I am trying to read the file in R. Is it possible to avoid the Byte order mark?

    The function fread (from the data.table package) reads the file, but adds ļ»æ at the beginning of the first variable name:

    > names(frame_pers)[1]
    [1] "ļ»æreg_date"
    

    The same is with read.csv function.

    Currently I have made a function which removes the BOM from the first column name, but I believe there should be a way how to automatically strip the BOM.

    remove.BOM <- function(x) setnames(x, 1, substring(names(x)[1], 4))
    
    > names(frame_pers)[1]
    [1] "ļ»æreg_date"
    > remove.BOM(frame_pers)
    > names(frame_pers)[1]
    [1] "reg_date"
    

    I am using the native encoding for the R session:

    > options("encoding" = "")
    > options("encoding")
    $encoding
    [1] ""
    
  • EngrStudent
    EngrStudent over 6 years
    Also not working for me. My raw data looks like "31.1" when copy-paste from notepad++ but in R with fread it splits into 2 columns, and with read.csv I get the following as prefix "" (using as.is = TRUE). I used autohotkey and convert2txt to get ocr from a gui display, and I wrote it to file. This gives me the problem that "31.2" becomes " .331"
  • EngrStudent
    EngrStudent over 6 years
    I'm using 1.10.4. I ended up using "read_csv" and setting "col_types = "c" ", then trimming the first character before converting to numeric. It was a kludge.