read.csv blank fields to NA

94,715

Solution 1

After reading the csv file, try the following. It will replace the NA values with "".

b[is.na(b)]<-""

Fairly certain that won't fix your NaN values. That will need to be resolved in a separate statement

b[is.nan(b)]<-""

Solution 2

Late edit: After re-reading this after the edits and extended comments, I'm wondering if what was needed (or asked for, at least) was pretty much the exact opposite of what I advise below. The request for this:

Unfortunately, read.csv is converting all the blanks and NA to "NA". I want to read NA and NaN as characters.

,,, might have been satisfied (somewhat paradoxically) with the arguments: colClasses="character", stringsAsFactors=FALSE, na.strings="."`

Then any character value including an empty string would come in as itself. Arguing against this is the acceptance of the answer that converts empty character values ("") to R _NA_character values.

Here's a test example with various results:

 sapply(read.csv(text='A\tB\tC\tD\na\t""\tNA\tNaN', sep='\t', na.strings=""), class )
#        A         B         C         D 
# "factor" "logical"  "factor" "numeric" 
 sapply(read.csv(text='A\tB\tC\tD\na\t""\tNA\tNaN', sep='\t', na.strings="x"), class )
#        A         B         C         D 
# "factor" "logical"  "factor" "numeric" 
 sapply(read.csv(text='A\tB\tC\tD\na\t""\tNA\tNaN', sep='\t', na.strings="x", stringsAsFactors=FALSE), class )
#          A           B           C           D 
#"character"   "logical" "character"   "numeric" 

#Almost the expressed desired result
 sapply(read.csv(text='A\tB\tC\tD\na\t""\tNA\tNaN', sep='\t', #colClasses="character", stringsAsFactors=FALSE), class )
#          A           B           C           D 
#"character" "character" "character" "character" 
#But ... still get a real R <NA>
read.csv(text='A\tB\tC\tD\na\t""\tNA\tNaN', sep='\t', colClasses="character", stringsAsFactors=FALSE)
#  A B    C   D
#1 a   <NA> NaN
#So add all three
 read.csv(text='A\tB\tC\tD\na\t""\tNA\tNaN', sep='\t', colClasses="character", stringsAsFactors=FALSE,na.strings=".")
#  A B  C   D
#1 a   NA NaN
# Finally all columns are character and no "real" R NA's

The default for na.strings is just "NA", so you perhaps need to add "NaN". True blanks ("") are set to missing but spaces (" ") are not:

 b<- read.csv("a.txt",  skip =0,  
               comment.char = "",check.names = FALSE, quote="",
               na.strings=c("NA","NaN", " ") )

It's not clear that this is the problem since your data example is malformed and does not have commas. That may be the fundamental problem since read.csv does not allow tab-separation. Use read.delim or read.table if your data has tab-separation.

b<- read.table("a.txt", sep="\t" skip =0, header = TRUE, 
               comment.char = "",check.names = FALSE, quote="",
               na.strings=c("NA","NaN", " ") )

# worked example for csv text file connection
 bt <- "A,B,C  
10,20,NaN
30,,40
40,30,20
,NA,20"

 b<- read.csv(text=bt, sep=",", 
                comment.char = "",check.names = FALSE, quote="\"",
                na.strings=c("NA","NaN", " ") )
 b
#--------------
   A  B  C
1 10 20 NA
2 30 NA 40
3 40 30 20
4 NA NA 20

Example 2:

bt <- "A,B,C,D
10,20,NaN
30,,40
40,30,20
,NA,20"

 b<- read.csv(text=bt, sep=",", 
                comment.char = "",check.names = FALSE, quote="\"",
                na.strings=c("NA","NaN", " ") , colClasses=c(rep("numeric", 3), "logical")) 
 b
#----------------
   A  B  C  D
1 10 20 NA NA
2 30 NA 40 NA
3 40 30 20 NA
4 NA NA 20 NA
> str(b)
'data.frame':   4 obs. of  4 variables:
 $ A: num  10 30 40 NA
 $ B: num  20 NA 30 NA
 $ C: num  NA 40 20 20
 $ D: logi  NA NA NA NA

It's mildly interesting that NA and NaN are not identical for numeric vectors. NaN is returned by operations that have no mathematical meaning (but as noted in the help page you get with ?NaN, the results of operations may depend on the particular OS. Tests of equality are not appropriate for either NaN or NA. There are specific is functions for them:

> Inf*0
[1] NaN

> is.nan(c(1,2.2,3,NaN, NA) )
[1] FALSE FALSE FALSE  TRUE FALSE
> is.na(c(1,2.2,3,NaN, NA) )
[1] FALSE FALSE FALSE  TRUE  TRUE  # note the difference

Solution 3

You can specify colClasses in the read.csv statement to read the column as text.

Solution 4

Use the na.string argument.
na.string is used to define what arguments are to be read as na value from the data. So if you mention

read.csv(text=bt, na.string = "abc")

then whenever in your data it the value "abc" occurs, then it will convert it into na.
Since "abc" is not found in your data it won't convert any value into na.

Share:
94,715

Related videos on Youtube

user1631306
Author by

user1631306

Updated on February 01, 2020

Comments

  • user1631306
    user1631306 over 4 years

    I have tab delimited text file, named 'a.txt'. The D column is empty.

     A       B       C    D
    10      20     NaN
    30              40
    40      30      20
    20      NA      20
    

    I want to have the dataframe looking and acting exactly as the text file, with a space in the 2nd row and in the 2nd column.

    Unfortunately, read.csv is converting all the blanks and NA to "NA". I want to read NA and NaN as characters.

     b<- read.csv("a.txt",sep="\t", skip =0, header = TRUE, comment.char = "",check.names = FALSE, quote="", )
    

    To summarize: I want to replicate the same values in output file without modifying them:

    • If there is a blank in input, the output should be blank.
    • If the input has NA or Nan, then the output should also have NA or NaN.
  • joran
    joran over 10 years
    This is not correct. colClasses doesn't help here. Or maybe more precisely, there is a more appropriate argument to use, I think.
  • beroe
    beroe over 10 years
    Very cryptic... stringsAsFactors=FALSE?
  • TheComeOnMan
    TheComeOnMan over 10 years
    I don't understand. When the OP says he/she wants to have the dataframe exactly as the text file, do they mean that we should get three columns with a blank, an "NA", and a "NaN" as three of the entries as shown in the question?
  • joran
    joran over 10 years
    I agree the OP's question is a bit unclear. My only point was that, as DWin points out, the na.strings argument seems a more likely suspect than colClasses here.
  • Brian Diggs
    Brian Diggs over 10 years
    That will convert all the columns to string variables. The second one, then will not fix NaN's then because all of the columns of b will be strings after using the first one.
  • user1631306
    user1631306 over 10 years
    This will convert the current NA and NaN values to "".. I wont be able to get them back.. I need to keep NA, NaN and blank values.
  • user1631306
    user1631306 over 10 years
    This will convert the current NA and NaN values to "".. I wont be able to get them back.. I need to keep NA, NaN and blank values.
  • user1631306
    user1631306 over 10 years
    What I meant, if I write "b" to nee text file, it should be exactly as the input t.txt,,with spaces...not newly introduced Na
  • user1631306
    user1631306 over 10 years
    setting colClasses="character" worked, but it chnages everything to string, even numeric part..which restrict the calculation part
  • TheComeOnMan
    TheComeOnMan over 10 years
    Lets take a step back, why do you need to treat blanks as different from NAs? If you intend to write out b as a csv then maybe you can keep a copy with all columns read in as text, and have a temporary dataset in which you convert the necessary columns to numeric. As you can see on most of the answers and comments, nobody is sure what exactly you're trying to do. It will help if you make your question more clear.
  • IRTFM
    IRTFM over 10 years
    No. It will convert them all to NA.
  • user1631306
    user1631306 over 10 years
    I dig more into my data...The column D is totally empty ,,with class "logical". \
  • IRTFM
    IRTFM over 10 years
    How can something be totally empty with class logical? PLEASE POST YOUR DATA. You have not shown us what the teext file looks like and are perhaps showing us what screen output you are seeing.... not the same thing.
  • user1631306
    user1631306 over 10 years
    The column D is empty.If you do sapply(b,class).. you get A B C D ""integer" "integer" "numeric" logical"
  • IRTFM
    IRTFM over 10 years
    You still are not telling us what your data file looks like. I can make an empty logical column if I use colClasses> See above.
  • user1631306
    user1631306 over 10 years
    How can I post my file as attachment?
  • IRTFM
    IRTFM over 10 years
    Open it up in a text editor and paste it into your question. (NOT into a commnet.)
  • user1631306
    user1631306 over 10 years
  • user1631306
    user1631306 over 10 years
    Sorry for confusion.. I have the the file, with one column empty, some columns with Na, and NaN. I am doing some calculation on this file and writing it to to new files, adding some columns.. But the empty colummn is not empty there anymore,,its NA..What I want, is exactly same format of file in my output file.
  • TheComeOnMan
    TheComeOnMan over 10 years
    I'm sorry, I still don't completely understand your problem. This shall be my last response to this question - read all the columns as text into b. Make a copy of b and convert the columns to numeric in this copy. Run your calculations on the copy. If you need to update b before you write it out again, then update specific columns (like b$A <- something), and not the whole dataset (i.e. not b <- something), leaving D untouched. Use write.csv (alongwith the na argument if needed).
  • mtelesha
    mtelesha over 9 years
    b<- read.csv("a.txt", skip =0, comment.char = "",check.names = FALSE, quote="", na.strings=c("NA","NaN", " ") ) The blank fields to be a NA you just need to have the quotes with no spaces. From what I read he wants to keep the NaN separated from NA which is not clear with his question. na.strings = c("NA", "NaN", ""))