Looking for a C++ implementation of the C4.5 algorithm

10,506

Solution 1

I may have found a possible C++ "implementation" of C5.0 (See5.0), but I haven't been able to dig into the source code enough to determine if it really works as advertised.

To reiterate my original concerns, the author of the port states the following about the C5.0 algorithm:

Another drawback with See5Sam [C5.0] is the impossibility to have more than one application tree at the same time. An application is read from files each time the executable is run and is stored in global variables here and there.

I will update my answer as soon as I get some time to look into the source code.

Update

It's looking pretty good, here is the C++ interface:

class CMee5
{
  public:

    /**
      Create a See 5 engine from tree/rules files.
      \param pcFileStem The stem of the See 5 file system. The engine
             initialisation will look for the following files:
              - pcFileStem.names Vanilla See 5 names file (mandatory)
              - pcFileStem.tree or pcFileStem.rules Vanilla See 5 tree or rules
                file (mandatory)
              - pcFileStem.costs Vanilla See 5 costs file (mandatory)
    */
    inline CMee5(const char* pcFileStem, bool bUseRules);

    /**
      Release allocated memory for this engine.
    */
    inline ~CMee5();

    /**
      General classification routine accepting a data record.
    */
    inline unsigned int classifyDataRec(DataRec Case, float* pOutConfidence);

    /**
      Show rules that were used to classify the last case.
      Classify() will have set RulesUsed[] to
      number of active rules for trial 0,
      first active rule, second active rule, ..., last active rule,
      number of active rules for trial 1,
      first active rule, second active rule, ..., last active rule,
      and so on.
    */
    inline void showRules(int Spaces);

    /**
      Open file with given extension for read/write with the actual file stem.
    */
    inline FILE* GetFile(String Extension, String RW);

    /**
      Read a raw case from file Df.

      For each attribute, read the attribute value from the file.
      If it is a discrete valued attribute, find the associated no.
      of this attribute value (if the value is unknown this is 0).

      Returns the array of attribute values.
    */
    inline DataRec GetDataRec(FILE *Df, Boolean Train);
    inline DataRec GetDataRecFromVec(float* pfVals, Boolean Train);
    inline float TranslateStringField(int Att, const char* Name);

    inline void Error(int ErrNo, String S1, String S2);

    inline int getMaxClass() const;
    inline int getClassAtt() const;
    inline int getLabelAtt() const;
    inline int getCWtAtt() const;
    inline unsigned int getMaxAtt() const;
    inline const char* getClassName(int nClassNo) const;
    inline char* getIgnoredVals();

    inline void FreeLastCase(void* DVec);
}

I would say that this is the best alternative I've found so far.

Solution 2

A C++ implementation for C4.5 called YaDT is available here, in the "Decision Trees" section:
http://www.di.unipi.it/~ruggieri/software.html

This is the source code for the last version:
http://www.di.unipi.it/~ruggieri/YaDT/YaDT1.2.5.zip

From the paper where the tool is described:

[...] In this paper, we describe a new from-scratch C++ implementation of a decision tree induction algorithm, which yields entropy-based decision trees in the style of C4.5. The implementation is called YaDT, an acronym for Yet another Decision Tree builder. The intended contribution of this paper is to present the design principles of the implementation that allowed for obtaining a highly efficient system. We discuss our choices on memory representation and modelling of data and metadata,on the algorithmic optimizations and their effect on memory and time performances, and on the trade-off between efficiency and accuracy of pruning heuristics. [...]

The paper is available here.

Solution 3

If I'm reading this correctly...it appears not to be organized as a C API, but as a C program. A data set is fed in, then it runs an algorithm and gives you back some rule descriptions.

I'd think the path you should take depends on whether you:

  1. merely want a C++ interface for supplying data and retrieving rules from the existing engine, or...

  2. want a C++ implementation that you can tinker with in order to tweak the algorithm to your own ends

If what you want is (1) then you could really just spawn the program as a process, feed it input as strings, and take the output as strings. That would probably be the easiest and most future-proof way of developing a "wrapper", and then you'd only have to develop C++ classes to represent the inputs and model the rule results (or match existing classes to these abstractions).

But if what you want is (2)...then I'd suggest trying whatever hacks you have in mind on top of the existing code in either C or Java--whichever you are most comfortable. You'll get to know the code that way, and if you have any improvements you may be able to feed them upstream to the author. If you build a relationship over the longer term then maybe you could collaborate and bring the C codebase slowly forward to C++, one aspect at a time, as the language was designed for.

Guess I just think the "When in Rome" philosophy usually works better than Port-In-One-Go, especially at the outset.


RESPONSE TO UPDATE: Process isolation takes care of your global variable issue. As for performance and data set size, you only have as many cores/CPUs and memory as you have. Whether you're using processes or threads usually isn't the issue when you're talking about matters of scale at that level. The overhead you encounter is if the marshalling is too expensive.

Prove the marshalling is the bottleneck, and to what extent... and you can build a case for why a process is a problem over a thread. But, there may be small tweaks to existing code to make marshalling cheaper which don't require a rewrite.

Share:
10,506
Kiril
Author by

Kiril

CEO and Co-Founder of ST6.io E-mail: click to reveal e-mail

Updated on June 22, 2022

Comments

  • Kiril
    Kiril almost 2 years

    I've been looking for a C++ implementation of the C4.5 algorithm, but I haven't been able to find one yet. I found Quinlan's C4.5 Release 8, but it's written in C... has anybody seen any open source C++ implementations of the C4.5 algorithm?

    I'm thinking about porting the J48 source code (or simply writing a wrapper around the C version) if I can't find an open source C++ implementation out there, but I hope I don't have to do that! Please let me know if you have come across a C++ implementation of the algorithm.

    Update

    I've been considering the option of writing a thin C++ wrapper around the C implementation of the C5.0 algorithm (C5.0 is the improved version of C4.5). I downloaded and compiled the C implementation of the C5.0 algorithm, but it doesn't look like it's easily portable to C++. The C implementation uses a lot of global variables and simply writing a thin C++ wrapper around the C functions will not result in an object oriented design because each class instance will be modifying the same global parameters. In other words: I will have no encapsulation and that's a pretty basic thing that I need.

    In order to get encapsulation I will need to make a full blown port of the C code into C++, which is about the same as porting the Java version (J48) into C++.

    Update 2.0

    Here are some specific requirements:

    1. Each classifier instance must encapsulate its own data (i.e. no global variables aside from constant ones).
    2. Support the concurrent training of classifiers and the concurrent evaluation of the classifiers.

    Here is a good scenario: suppose I'm doing 10-fold cross-validation, I would like to concurrently train 10 decision trees with their respective slice of the training set. If I just run the C program for each slice, I would have to run 10 processes, which is not horrible. However, if I need to classify thousands of data samples in real time, then I would have to start a new process for each sample I want to classify and that's not very efficient.