Unseen factor levels when appending new records with unseen string values to a dataframe, cause Warning and result in NA

96,781

Solution 1

It could be caused by mismatch of types in two data.frames.

First of all check types (classes). To diagnostic purposes do this:

new2old <- rbind( alltime, all2008 ) # this gives you a warning
old2new <- rbind( all2008, alltime ) # this should be without warning

cbind(
    alltime = sapply( alltime, class),
    all2008 = sapply( all2008, class),
    new2old = sapply( new2old, class),
    old2new = sapply( old2new, class)
)

I expect there be a row looks like:

            alltime  all2008   new2old  old2new
...         ...      ...       ...      ...
some_column "factor" "numeric" "factor" "character"
...         ...      ...       ...      ...

If so then explanation: rbind don't check types match. If you analyse rbind.data.frame code then you could see that the first argument initialized output types. If in first data.frame type is a factor, then output data.frame column is factor with levels unique(c(levels(x1),levels(x2))). But when in second data.frame column isn't factor then levels(x2) is NULL, so levels don't extend.

It means that your output data are wrong! There are NA's instead of true values

I suppose that:

  1. you create you old data with another R/RODBC version so types were created with different methods (different settings - decimal separator maybe)
  2. there are NULL's or some specific data in problematic column, eg. someone change column under database.

Solution:

find wrong column and find reason why its's wrong and fixed. Eliminate cause not symptoms.

Solution 2

An "easy" way is to simply not have your strings set as factors when importing text data.

Note that the read.{table,csv,...} functions take a stringsAsFactors parameter, which is by default set to TRUE. You can set this to FALSE while you're importing and rbind-ing your data.

If you'd like to set the column to be a factor at the end, you can do that too.

For example:

alltime <- read.table("alltime.txt", stringsAsFactors=FALSE)
all2008 <- read.table("all2008.txt", stringsAsFactors=FALSE)
alltime <- rbind(alltime, all2008)
# If you want the doctor column to be a factor, make it so:
alltime$doctor <- as.factor(alltime$doctor)

Solution 3

1) create the data frame with stringsAsFactor set to FALSE. This should resolve the factor-issue

2) afterwards don't use rbind - it messes up the column names if the data frame is empty. simply do it this way:

df[nrow(df)+1,] <- c("d","gsgsgd",4)

/

> df <- data.frame(a = character(0), b=character(0), c=numeric(0))

> df[nrow(df)+1,] <- c("d","gsgsgd",4)

Warnmeldungen:
1: In `[<-.factor`(`*tmp*`, iseq, value = "d") :
  invalid factor level, NAs generated
2: In `[<-.factor`(`*tmp*`, iseq, value = "gsgsgd") :
  invalid factor level, NAs generated

> df <- data.frame(a = character(0), b=character(0), c=numeric(0), stringsAsFactors=F)

> df[nrow(df)+1,] <- c("d","gsgsgd",4)

> df
  a      b c
1 d gsgsgd 4

Solution 4

As suggested in the previous answer, read the columns as character and do the conversion to factors after rbind. SQLFetch (I assume RODBC) has also the stringsAsFactors or the as.is argument to control the conversion of characters. Allowed values are as for read.table, e.g., as.is=TRUE or some column number.

Solution 5

I had the same problem with type mismatches, especially with factors. I had to glue together two otherwise compatible datasets.

My solution is to convert factors in both dataframes to "character". Then it works like a charm :-)

    convert.factors.to.strings.in.dataframe <- function(dataframe)
    {
        class.data  <- sapply(dataframe, class)
        factor.vars <- class.data[class.data == "factor"]
        for (colname in names(factor.vars))
        {
            dataframe[,colname] <- as.character(dataframe[,colname])
        }
        return (dataframe)
    }

If you want to see the types in your two dataframes run (change var names):

    cbind("orig"=sapply(allSurveyData, class), 
          "merge" = sapply(curSurveyDataMerge, class),
          "eq"=sapply(allSurveyData, class) == sapply(curSurveyDataMerge, class)
    )
Share:
96,781
Farrel
Author by

Farrel

Not a programmer but not afraid to use simple programming / macroing such as manipulate and analyze data in R or write 3-10 line autohotkey commands.

Updated on March 22, 2020

Comments

  • Farrel
    Farrel over 4 years

    I have a dataframe (14.5K rows by 15 columns) containing billing data from 2001 to 2007.

    I append new 2008 data to it with: alltime <- rbind(alltime,all2008)

    Unfortunately that generates a warning:

    > Warning message:
    In `[<-.factor`(`*tmp*`, ri, value = c(NA, NA, NA, NA, NA, NA, NA,  :
      invalid factor level, NAs generated
    

    My guess is that there are some new patients whose names were not in the previous dataframe and therefore it would not know what level to give those. Similarly new unseen names in the 'referring doctor' column.

    What's the solution?

  • Farrel
    Farrel over 14 years
    Yip. You are correct. in one data frame a column's class was a factor and in another it was a numeric. That messed things up badly. I converted the numeric to a factor and all was OK. Thank you for your guidance. There were some other discrepancies as well. For instance, factor-character discrepancy did not mess things up.
  • Marek
    Marek over 14 years
    You have right about factor-character, somewhere in code I found that levels for this combination will be unique(c(levels(x1),x2)). One thing: combination factor-character leads to a factor, combination character-factor to character. So it's better when types match.
  • A5C1D2H2I1M1N2O1R2T1
    A5C1D2H2I1M1N2O1R2T1 almost 11 years
    mydf[sapply(mydf, is.factor)] <- lapply(mydf[sapply(mydf, is.factor)], as.character) seems like a simpler approach.