Different decision tree algorithms with comparison of complexity or performance

performance machine-learning complexity-theory classification decision-tree

36,628

Decision Tree implementations differ primarily along these axes:

the splitting criterion (i.e., how "variance" is calculated)
whether it builds models for regression (continuous variables, e.g., a score) as well as classification (discrete variables, e.g., a class label)
technique to eliminate/reduce over-fitting
whether it can handle incomplete data

The major Decision Tree implementations are:

ID3, or Iterative Dichotomizer, was the first of three Decision Tree implementations developed by Ross Quinlan (Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.)
CART, or Classification And Regression Trees is often used as a generic acronym for the term Decision Tree, though it apparently has a more specific meaning. In sum, the CART implementation is very similar to C4.5; the one notable difference is that CART constructs the tree based on a numerical splitting criterion recursively applied to the data, whereas C4.5 includes the intermediate step of constructing rule sets.
C4.5, Quinlan's next iteration. The new features (versus ID3) are: (i) accepts both continuous and discrete features; (ii) handles incomplete data points; (iii) solves over-fitting problem by (very clever) bottom-up technique usually known as "pruning"; and (iv) different weights can be applied the features that comprise the training data. Of these, the first three are very important--and I would suggest that any DT implementation you choose has all three. The fourth (differential weighting) is much less important
C5.0, the most recent Quinlan iteration. This implementation is covered by a patent and probably, as a result, is rarely implemented (outside of commercial software packages). I have never coded a C5.0 implementation myself (I have never even seen the source code) so I can't offer an informed comparison of C5.0 versus C4.5. I have always been skeptical about the improvements claimed by its inventor (Ross Quinlan)--for instance, he claims it is "several orders of magnitude" faster than C4.5. Other claims are similarly broad ("significantly more memory efficient") and so forth. I'll just point you to studies which report the result of comparison of the two techniques and you can decide for yourself.
CHAID (chi-square automatic interaction detector) actually predates the original ID3 implementation by about six years (published in a Ph.D. thesis by Gordon Kass in 1980). I know every little about this technique.The R Platform has a Package called CHAID which includes excellent documentation
MARS (multi-adaptive regression splines) is actually a term trademarked by the original inventor of MARS, Salford Systems. As a result, MARS clones in libraries not sold by Salford are named something other than MARS--e.g., in R, the relevant function is polymars in the poly-spline library. Matlab and Statistica also have implementations with MARS-functionality

I would recommend CART or C4.5 (though again, I have no direct experience with C5.0 or with CHAID, though I am familiar with their feature sets).

C4.5 is the Decision Tree flavor implemented in Orange; CART is the flavor in sklearn--both excellent implementations in excellent ML libraries.

C4.5 is a major step beyond ID3--both in terms of range (C4.5 has a far broader use case spectrum because it can handle continuous variables in the training data) and in terms of model quality.

Perhaps the most significant claimed improvement of C5.0 versus C4.5 is support for boosted trees. Ensemble support for DTs--boosted trees and Random Forests--has been included in the DT implementation in Orange; here, ensemble support was added to a C4.5 algorithm. sklearn also features a range of random forest and boosting methods.

36,628

Y2theZ

Updated on July 09, 2022

Comments

Y2theZ almost 2 years
I am doing research on data mining and more precisely, decision trees.

I would like to know if there are multiple algorithms to build a decision trees (or just one?), and which is better, based on criteria such as
- Performance
- Complexity
- Errors in decision making
- and more.
- Has QUIT--Anony-Mousse about 12 years
  
  Retagged this as classification, machine-learning instead of the buzzwordish data-mining.
Deepak Mathpal about 12 years

@Youssef : no problem. (please note that my original Answer contained a mis-statement regarding sklearn's implementation; i checked it after posting, and corrected it just now.)
chubbsondubs about 12 years

CART and ID3, C4.5, C5.0 differ in the way splits are preformed. CART is a binary tree where the others are not. That means CART will choose several discrete values to split on. For example, if a feature is { red, green, blue } it could split on {red, green} on the left and {blue} on the right or any combination of the 3. CART also handles discrete as well as continuous values too.
chubbsondubs about 12 years

CART also supports surrogate splits which will split along several features at once. That produces splits that visually can be thought of as lines of any slope where splitting along a single feature produce lines of either vertical or horizontal slope. The idea behind this is clustered data might not be possible without lots of splitting when all you have is vertical or horizontal splitting. With lines of any slope we can surround clusters in fewer splits thus more robust trees.
user1207217 over 10 years

I'd like to point out that C5.0 (as of Feb 2014 at least) is available in full, but single-threaded form under the GPL, although that obviously doesn't resolve any patent licencing issues rulequest.com/download.html
Kenston Choi over 10 years

And an implementation of C5.0 is now available for R
Admin about 6 years

@doug I have a question regarding the splitting criteria of the decision tree in classification. Are all splitting criteria in the classification of decision tree impurity based?
Deepak Mathpal about 6 years

@Victor it depends on the choice made the the library author. I am only familiar with two techniques used to calculate variance for "discrete" data: gini impurity and information entropy. In my experience there is very little practical difference between them. Of course if you are building a regression tree then you can just use plain variance
wordsforthewise about 3 years

I'm confused, can you use the R package or code your own C5.0 algo even though it's patented? Or are you going to get sued or be breaking the law? I cannot find the C5.0 patent.