R: read.table interprets \r as new line

10,293

Since R runs on multiple OSes, and different OSes use different line endings, it can be quite difficult to control exactly what gets used as a line ending that will work across all OSes. The easiest way to fix this would be to wrap the tweet column in quotes. When you have quoted fields, embedded linefeeds are allowed. Otherwise you can manipulate the bytes with regular expressions and such. It all depends on what you intend to do with the embedded newlines. Not sure if you want to preserve them or not.

Here's a dump of your sample file

ctx <- "488397464357974017\t2168124983\t20140713190004\t24.584653\t46.540044\tالرياض, المملكة العربية السعودية\tأتوقع البطولة أرجنتينية ، من بداية البطولة كل الظروف والعوامل تريد الأرجنتين ..\r\n488397464438071297\t403662206\t20140713190004\t19.320504\t-76.426316\t\t@Toneishe_Lovee @purifiedhoran \r(:\r\n488397464442265600\t2510306157\t20140713190004\t36.517741\t-5.317234\tGaucín, Málaga\t#AlemaniaArgentina Vamos #GER\r\n488397464584871936\t539048975\t20140713190004\t42.550627\t9.440454\tLucciana, Haute-Corse\ton a tous le seum contre Pauline 4/5 mais dsl zayn l'a pas unfollow , ça fait 5 mois que vous sortez ça \U0001f615\r\n488397463997276160\t194876164\t20140713190004\t37.724866\t-120.93389\tRiverbank, CA\t@AlexxisAvila Shhh! Lol\r\n"

We can split it up into a character matrix with

mm <- do.call(rbind, strsplit(strsplit(ctx, "\r\n")[[1]], "\t"))

Then we can convert to a data.frame

dd<-data.frame(mm, stringsAsFactors=F)
dd[,c(1,2,4,5)]<-lapply(dd[,c(1,2,4,5)], as.numeric)

then if you write this out to a file (and allow the character values to be quoted)

write.table(dd, "tweets2.csv", row.names=F, col.names=F, sep="\t")

You can read it back in without problems with

dd2 <- read.table("tweets2.csv", sep = "\t", comment.char = "",
    col.names = c("id", "user", "date", "latitude", 
        "longitude", "location", "tweet"),
    colClasses = c("character", "numeric", "character",
        "double", "double", "character",
         "character"),
    encoding = "utf8")

So if the file came to you with the quotes around the last column, it would be much easier to import it.

And if you want to read the file in as one big character string as I did to create ctx, you can do that with

ctx <- readChar(fileName, file.info(fileName)$size)

which may be helpful if you want to do another manipulations first. For example, you might want to remove the \r values that not followed by \n. You could do what with

gsub("\\r(?!\\n)","[nl]", ctx, perl=T)

and i think you can read that directly into read.table

read.table(text=gsub("\\r(?!\\n)","[nl]", ctx, perl=T), sep="\t")

(I'm testing on a Mac which uses different line endings so it doesn't work, but it might on windows).

Share:
10,293
Sebastian Höffner
Author by

Sebastian Höffner

I am a computer enthusiast, cognitive scientist, mostly self-taught programmer and enjoy solving problems.

Updated on June 04, 2022

Comments

  • Sebastian Höffner
    Sebastian Höffner almost 2 years

    Summary

    I try to read twitter data with read.table. But I have lines terminated only in \r which causes problems, so I'd like to skip some lines.

    Data format

    The data is in a tab-separated csv and of the following form:

    id \t userid \t date \t latitude \t longitude \t location \t tweet \r\n
    

    (Note: I added spaces for readability, and \t, \r and \n are as expected TAB, CR and LF)

    Some examples are:

    488397447040086017  1220042672  20140713190000  -22.923528  -43.238966  Rio de Janeiro, Rio de Janeiro  os moradores da minha rua devem me odiar
    488397446960381952  1960969112  20140713190000  60.998575   68.998468   Ханты-Мансийск, Ханты-Мансийск  Вот интересом, мне одной пофиг на футбол?
    488397446997762049  1449959828  20140713190000  32.777693   -97.307257  Fort Worth, TX  Buena suerte Argentina
    

    Reading in data

    There were some problems (# as comments, ' as quote character, encoding, ...) which I partly solved already:

    readTweets <- function(fileName) {
      # read tweets from file
      tweets <- read.table(fileName, sep = "\t", quote = "", comment.char = "",
                           col.names = c("id", "user", "date", "latitude", 
                                         "longitude", "location", "tweet"),
                           colClasses = c("numeric", "numeric", "character",
                                          "double", "double", "character",
                                          "character"), encoding = "utf8")
    
      tweets
    }
    

    As you can easily see I also added the colClasses parameter to give the fields some useful types (I also changed the date column to POSIXct, but I have to do the formatting myself - side quest: is there a way to apply functions to imported columns automatically?).

    The error

    This worked on a small test set like the one given above. However, when I tried to load a bigger dataset, I got the following error:

    Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
      scan() expected 'a real', got '(:'
    

    A little bit of searching through the file shows the following entry:

    488397464438071297  403662206   20140713190004  19.320504   -76.426316      @Toneishe_Lovee @purifiedhoran 
    (:
    

    This looks like there is just a newline in the wrong place! That's a huge problem now, how can I say that a line is a new line or not? And why is it that way? I decided to have a more detailed look and found out (spaces added again, now you see why I posted the format more exactly) using the "Show all characters" Option in Notepad++ how the entry really looks like:

    488397464438071297 \t 403662206 \t 20140713190004 \t 19.320504 \t -76.426316 \t @Toneishe_Lovee @purifiedhoran \r (: \r\n
    

    Note the CR in front of the smiley.

    The simple solution

    I somehow "solved" this problem by reading in the first column as characters, filling up the rows and setting empty fields to NA and then using complete.cases:

    readTweets <- function(fileName) {
      # read tweets from file
      tweets <- read.table(fileName, sep = "\t", quote = "", comment.char = "",
                           col.names = c("id", "user", "date", "latitude", 
                                         "longitude", "location", "tweet"),
                           colClasses = c("character", "numeric", "character",
                                          "double", "double", "character",
                                          "character"), encoding = "utf8",
                           fill = TRUE, na.strings = TRUE)
      # remove incorrect rows and convert id to numeric
      tweets      <- tweets[complete.cases(tweets[,c("id", "user", "date")]),]
      tweets$id   <- as.numeric(tweets$id)
      rownames(tweets) <- NULL
      tweets
    }
    

    I still wonder if it's even possible to enter CRs in twitter or the person who gave me the csv files just messed the format up.

    The professional solution

    Is it possible to skip non-full lines (without processing all the data again) so that I can use the colClass numeric for the ID directly?

    OS/File/etc.

    As requested in the comments here some more technical information:

    • $platform: "x86_64-w64-mingw32"
    • $system: "x86_64, mingw32"
    • $svn rev: "66115"
    • $version.string: "R version 3.1.1 (2014-07-10)"
    • OS: Windows 8 (I didn't expect R to be running with my mingw installation)

    Example file:

    • Download, 788 B, csv (tab separated), contains 5 tweets including the errorneous one (the second)
    • File format is UTF-8 without BOM, Notepad++ identifies the line endings as Dos\Windows