Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

python machine-learning data-mining classification scikit-learn

28,070

Solution 1

You have at least two options:

Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set. We don't have any utility to perform this automatically in scikit-learn but it should not be too complicated to do it yourself. Then fit a unique multinomial NB on those categorical representation of your data.
Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features: np.hstack((multinomial_probas, gaussian_probas)) and then refit a new model (e.g. a new gaussian NB) on the new features.

Solution 2

Hope I'm not too late. I recently wrote a library called Mixed Naive Bayes, written in NumPy. It can assume a mix of Gaussian and categorical (multinoulli) distributions on the training data features.

https://github.com/remykarem/mixed-naive-bayes

The library is written such that the APIs are similar to scikit-learn's.

In the example below, let's assume that the first 2 features are from a categorical distribution and the last 2 are Gaussian. In the fit() method, just specify categorical_features=[0,1], indicating that Columns 0 and 1 are to follow categorical distribution.

from mixed_naive_bayes import MixedNB
X = [[0, 0, 180.9, 75.0],
     [1, 1, 165.2, 61.5],
     [2, 1, 166.3, 60.3],
     [1, 1, 173.0, 68.2],
     [0, 2, 178.4, 71.0]]
y = [0, 0, 1, 1, 0]
clf = MixedNB(categorical_features=[0,1])
clf.fit(X,y)
clf.predict(X)

Pip installable via pip install mixed-naive-bayes. More information on the usage in the README.md file. Pull requests are greatly appreciated :)

Solution 3

The simple answer: multiply result!! it's the same.

Naive Bayes based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features - meaning you calculate the Bayes probability dependent on a specific feature without holding the others - which means that the algorithm multiply each probability from one feature with the probability from the second feature (and we totally ignore the denominator - since it is just a normalizer).

so the right answer is:

calculate the probability from the categorical variables.
calculate the probability from the continuous variables.
multiply 1. and 2.

Solution 4

@Yaron's approach needs an extra step (4. below):

Calculate the probability from the categorical variables.
Calculate the probability from the continuous variables.
Multiply 1. and 2. AND
Divide 3. by the sum of the product of 1. and 2. EDIT: What I actually mean is that the denominator should be (probability of the event given the hypotnesis is yes) + (probability of evidence given the hypotnesis is no) (asuming a binary problem, without loss of generality). Thus, the probabilities of the hypotheses (yes or no) given the evidence would sum to 1.

Step 4. is the normalization step. Take a look at @remykarem's mixed-naive-bayes as an example (lines 268-278):

        if self.gaussian_features.size != 0 and self.categorical_features.size != 0:
            finals = t * p * self.priors
        elif self.gaussian_features.size != 0:
            finals = t * self.priors
        elif self.categorical_features.size != 0:
            finals = p * self.priors

        normalised = finals.T/(np.sum(finals, axis=1) + 1e-6)
        normalised = np.moveaxis(normalised, [0, 1], [1, 0])

        return normalised

The probabilities of the Gaussian and Categorical models (t and p respectively) are multiplied together in line 269 (line 2 in extract above) and then normalized as in 4. in line 275 (fourth line from the bottom in extract above).

Solution 5

For hybrid features, you can check this implementation.

The author has presented mathematical justification in his Quora answer, you might want to check.

View more solutions

28,070

Author by

user1499144

Updated on July 05, 2022

Comments

user1499144 almost 2 years

I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernoulli Naive Bayes can be used for categorical data. However, since I want to have both categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!
unutbu over 11 years

@ogrisel: Am I right in believing that the second method might miss correlations between the continuous and categorical data? For example, suppose young people who register online are typically male, but young people who do not register online are typically female. But further suppose for the sake of concreteness that the gaussian NB model predicts young people (without knowledge of the categorical data) are generally male. Since only this probability is being passed on to the second-stage gaussian NB, it will miss the correlation.
Sam almost 10 years

@unutbu: Naive Bayes classifiers assumes independence of the features given the class. The first method listed above will learn P(age|gender) and P(registration_type|gender) independently. The correlation between age and registration_type will not be captured for a given gender.
jai about 6 years

@ogrisel can we use one-hot-encoding to convert the categorical variables to values between 0 and n-1 for n classes and keep the continuous variables as they are for GaussianNB() ? based on this post: dataaspirant.com/2017/02/20/…
Him about 6 years

@jai, No! First, one-hot encoding is not the same as converting to values between 0 and n-1. Second, converting categorical variables to values between 0 and n-1 and then treating them as continuous variables makes no sense. Third, one-hot categorical variables are so non-Gaussian, that treating them as Gaussian (which GaussianNB assumes) does not, in my experience, produce good results.
Davis about 6 years

Gaussian NB gives a density estimate for the prior. I'm not sure about what you meant for the second part.
Yaron about 6 years

@Davis, I'm not sure what you meant, but the Gaussian NB means that the the likelihood of the features is assumed to be Gaussian and this how the P(x|y) is calculated.
Davis about 6 years

I mean there isn't Pr(x_i | y) anymore, but this prior is replaced Norm(mu_i, sig_i) which is a density estimate because the probability of Pr(X_i = x | y) is zero as the RV X_i is continuous.
Yaron about 6 years

I think your question is not related to the topic but you can get your answer from: stats.stackexchange.com/questions/26624/…
Chuck over 5 years

@ogrisel, I thought that predict_proba is to be used for predicting probabilities on the 'test' data. E.g. I create 2 separate classifiers on the train data, and I can then use this to predict the probability for my remaining test data. If I then train another Gaussian model on the predict_proba result from the test data, doesn't that leave with nothing to test on? Am I looking at this correctly? Cheers
Chuck over 5 years

Yaron ,when you say calculate the probabilities, is this on the test or train data, and what function would you use to do this? Are you using predict-proba on the train data as well as doing the fit on the train data? I'm struggling to figure out what I should be multiplying... Cheers
Yaron over 5 years

yes, you should use predict_proba to get the probability. you should use the DB that you want to use (train for train and test for test). the model itself should be developed using the only the train set offcourse.
Chuck over 5 years

@Yaron I'm confused, isn't predict_proba carried out on the test data, not on the train data. Are you saying do your clf.fit() on the test, and then also do your predict_proba on the test as well?
Yaron over 5 years

@Chuck, sure not, you should use the predict only for the test set.
paperskilltrees about 2 years

Regarding the second bullet point. I can't quite figure out what fitting a Gaussian NB to the concatenated probabilities effectively does, and how the result compares to a proper naive Bayes model, as described in other answers. Any ideas?