party package for decision tree in R does not support character data type?

10,382

The scale of the response variable and all explanatory variables is important for two aspects of the CTree algorithm: (1) The association tests that are carried out in each node to determine which variable should be used for splitting. (2) The selection of the best split point in a given explanatory variable.

The association tests always capture "correlation" or "lack of independence" between the response and each explanatory variable. And the type of correlation measure depends on the scale of the variables involved (see this post on Cross Validated: https://stats.stackexchange.com/questions/144143). The variables can be numeric (or integer), unordered categorical (i.e., factor), ordered categorical, or censored (Surv objects). Selecting an appropriate variable type for a given variable in a data frame is crucial to obtain meaningful results from the tree.

Similarly, the determination of the possible binary splits in a given variable depends crucially on the scale. And character is not a scale for which there is a standard way how to assess correlation or splits.

Share:
10,382

Related videos on Youtube

user121
Author by

user121

Updated on September 16, 2022

Comments

  • user121
    user121 over 1 year

    If one of the columns in my data frame is of data type character, I get the error below.

    > library("party")
    > r2 <- ctree(Sepal.Length ~ .,data=df)
    Error in trafo(data = data, numeric_trafo = numeric_trafo, factor_trafo = factor_trafo,  : 
      data class character is not supported
    > plot(r2)    
    > sapply(df,class)
    Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
        "factor"     "factor"     "factor"  "character"     "factor" 
    

    Sometimes, I also get this error

     Error in match.arg(type) : 
      'arg' should be one of “response”, “node”, “prob” > 
    > sapply(df,class)
              AGE        GENDER          STAY      GRADE          XYNS        CHARGE 
        "integer"     "integer"      "factor"     "integer"     "integer"     "integer" 
    

    How do I get around these?

    • MrFlick
      MrFlick about 9 years
      Convert your character values to factors. df$Petal.Width <- factor(df$Petal.Width). You can't really model arbitrary string values. You need to at least assume they are a discrete/categorical variable.
    • MrFlick
      MrFlick about 9 years
      That's a methodological problem. If you have questions about the statistical methods that these packages use and why they scale poorly with the number of factors, you'd get better luck on Cross Validated where such statistical discussions are on-topic.