In R, read_csv() parsing failures: Converting integers into NA's

13,375

Solution 1

These numbers are too big to fit into an integer.

.Machine$integer.max
[1] 2147483647

Solution 2

As others have mentioned, read_csv will use the first 1000 rows to guess your column types. It sounds like the first 1000 rows of your data are integers and so it reads them in as such for all of your data. It then encounters a big integer in your data which the integer class can't handle. You would run into similar problems if in row 1001 you had any non-integer value. Some examples:

#build text data - read_csv uses the first 1000 rows to guess column types
csv_ok <- "column_header"

for(t in 1:1000){
  csv_ok <- paste(csv_ok, t ,sep="\n")
}

#add a "problematic" double to row 1001:
csv_w_dbl <- paste(csv_ok, 1000.25, sep="\n")

#add a "problematic" integer:
csv_w_bigint <- paste(csv_ok, .Machine$integer.max+1, sep="\n")

#can't parse these without specifying column type
read_csv(csv_w_dbl)
read_csv(csv_w_bigint)

#can parse these
read_csv(csv_ok) #all integers
read_csv(csv_w_dbl, col_types="d") #specify double as col type
read_csv(csv_w_bigint, col_types="d") #specify double as col type to handle big integers
Share:
13,375

Related videos on Youtube

Mars Chen
Author by

Mars Chen

Updated on June 04, 2022

Comments

  • Mars Chen
    Mars Chen almost 2 years

    I just meet a problem when I use read_csv() and read.csv() to import CSV files into R. My file contains 1.7 million rows and 78 variables.Most of those variable are integers. When I use the read_csv(), some cells, which are integers, are converted into NA's and I get the following warnings. However, those cells are also integers so that I do not know why it goes wrong.

    10487 parsing failures.
    row col   expected      actual                                              
    3507 X27 an integer 2946793000  
    3507 X46 an integer 5246675000  
    3508 X8  an integer 11599000000 
    3508 X23 an integer 2185000000  
    3508 X26 an integer 2185000000.
    

    When I access df[3507,27], it just shows NA. Also, X27,X46 and X8 are all integers so that I do not understand why the function works for most rows but does not work in those several rows.

    However, when I use read.csv(). It works and returns 2946793000. Can someone tell me why these two functions behave differently here?

    • Andrew Brēza
      Andrew Brēza almost 7 years
      read_csv looks at the first rows of your data and guesses the data type of the column. There are times when it guesses incorrectly, especially with massive datasets. For example, I had a dataset with a gender column that readr thought was boolean (all the first rows were "F"). Try reading the head of the file up to the row where the first error occurs and seeing if there's some string formatting. You could also force it to read the offending columns as characters and then convert them to numeric.
  • joran
    joran almost 7 years
    ...so they could read them in as numeric (rather than integer) by specifying the column classes, or they could read them as character and use one of the big integer packages from CRAN (I can't recall their names, but I think there are several).
  • Rui Barradas
    Rui Barradas almost 7 years
    For big integers see link