Importing data with special characters in R

18,595

Your problem is an encoding issue. There are two aspects to this: First, what is saved by Notepad++ may not correspond to the encoding that you are expecting in the saved text file, and second, R may be reading the file in using read.csv() based on a different encoding, which is especially possible since if you are using Notepad++ then this suggests you are using Windows, and therefore you may be unable to have UTF-8 as your system locale for R.

So taking each issue in turn:

  1. Getting Notepad++ to save your file in a specific encoding. Here you can set your encoding for the new file based using these instructions. I always use UTF-8 but here since your texts are Danish, Latin-1 should work too.

    To verify the encoding of your texts, you may wish to use the file utility supplied with RTools. This will tell you something about the probable encoding of your file from the command line, although it is not perfect. (OS X and Linux users already have this without needing to install additional utilities.)

  2. Setting encoding when importing the .csv file into R. When you import the file using read.csv(), specify encoding = "UTF-8" or encoding = "Latin-1". You might also want to check though what your system encoding is, and match that. You can do this with Sys.getlocale() (and set it with Sys.setlocale().) On my system for instance:

    > Sys.getlocale()
    [1] "en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8"
    

    You could of course set this to Windows-1252 but you might have trouble then with portability if using this on other platforms. UTF-8 is the best solution to this.

Share:
18,595
Mpizos Dimitris
Author by

Mpizos Dimitris

Updated on June 05, 2022

Comments

  • Mpizos Dimitris
    Mpizos Dimitris almost 2 years

    The following pic shows how the data is before i import it(notepad) in R and after importing.

    enter image description here

    I use the following command to import it in R:

    Data <- read.csv('data.csv',stringsAsFactors = FALSE,header = TRUE,quote = "")
    

    It can be seen that the special characters such as the ae is replaced with something like A| (line 19 on the left,line 18 or the right). Is there a way to import the CSV file as it is? (Using R)

  • xilliam
    xilliam about 2 years
    This does not really answer the question. If you have a different question, you can ask it by clicking Ask Question. To get notified when this question gets new answers, you can follow this question. Once you have enough reputation, you can also add a bounty to draw more attention to this question. - From Review