party package for decision tree in R does not support character data type?
The scale of the response variable and all explanatory variables is important for two aspects of the CTree algorithm: (1) The association tests that are carried out in each node to determine which variable should be used for splitting. (2) The selection of the best split point in a given explanatory variable.
The association tests always capture "correlation" or "lack of independence" between the response and each explanatory variable. And the type of correlation measure depends on the scale of the variables involved (see this post on Cross Validated: https://stats.stackexchange.com/questions/144143). The variables can be numeric (or integer), unordered categorical (i.e., factor), ordered categorical, or censored (Surv objects). Selecting an appropriate variable type for a given variable in a data frame is crucial to obtain meaningful results from the tree.
Similarly, the determination of the possible binary splits in a given variable depends crucially on the scale. And character
is not a scale for which there is a standard way how to assess correlation or splits.
Related videos on Youtube
user121
Updated on September 16, 2022Comments
-
user121 over 1 year
If one of the columns in my data frame is of data type character, I get the error below.
> library("party") > r2 <- ctree(Sepal.Length ~ .,data=df) Error in trafo(data = data, numeric_trafo = numeric_trafo, factor_trafo = factor_trafo, : data class character is not supported > plot(r2) > sapply(df,class) Sepal.Length Sepal.Width Petal.Length Petal.Width Species "factor" "factor" "factor" "character" "factor"
Sometimes, I also get this error
Error in match.arg(type) : 'arg' should be one of “response”, “node”, “prob” > > sapply(df,class) AGE GENDER STAY GRADE XYNS CHARGE "integer" "integer" "factor" "integer" "integer" "integer"
How do I get around these?
-
MrFlick about 9 yearsConvert your character values to factors.
df$Petal.Width <- factor(df$Petal.Width)
. You can't really model arbitrary string values. You need to at least assume they are a discrete/categorical variable. -
MrFlick about 9 yearsThat's a methodological problem. If you have questions about the statistical methods that these packages use and why they scale poorly with the number of factors, you'd get better luck on Cross Validated where such statistical discussions are on-topic.
-