Text Classification using Decision Trees in Python

10,044

Decision trees can only work when your feature vectors are all the same length. Personally I've got no clue as to how effective Decision Trees would be at text analysis like this, but if you're to try and go for it, the way I'd suggest is a "one-hot" "bag of words" style vector.

Essentially, keep tag of how many times words appear in your example, and put them in a vector that represents the whole corpus. Say, once you removed all the stop-words the set of the entire corpus was:

{"Apple", "Banana", "Cherry", "Date", "Eggplant"}

You represent this by a vector the same size as the corpus, with each value representing whether or not the word appears. In our example, a 5 length vector where the first element is associated with "Apple", the second with "Banana" and so on. You might get something like:

bag("Apple Banana Date")
#: [1, 1, 0, 1, 0]
bag("Cherry")
#: [0, 0, 1, 0, 0]
bag("Date Eggplant Banana Banana")
#: [0, 1, 0, 1, 1]
# For this case, I have no clue if Banana having the value 2 would improve results.
# It might. It might not. Something you'd need to test.

This way, you have the same sized vector regardless of the input, and the decision tree knows where to look for certain outputs. Say "Banana" corresponds strongly to bug reports, in which case the decision tree will know that a 1 in the second element means a bug report is more likely.

Of course, your corpus might be thousands of words long. In that case, your decision tree probably won't be the best tool for the job. Not unless you first take some time to trim down your features.

Share:
10,044
Venkatesh
Author by

Venkatesh

MS Software Engineering, Rochester Institute of Technology, Class of 2019 Software Engineer, Oracle India Pvt Ltd, June 2014 - July 2017 M.Sc. Software Engineering, Coimbatore Institute of Technology, Class of 2014 Cooking, Travelling, Adventure, F.R.I.E.N.D.S, The Big Bang Theory, Game of Thrones

Updated on June 04, 2022

Comments

  • Venkatesh
    Venkatesh almost 2 years

    I am new to Python as well as machine learning. My implementation is based on the IEEE research paper http://ieeexplore.ieee.org/document/7320414/ (Bug report, feature request, or simply praise? On automatically classifying app reviews)

    I want to classify text into categories. The text is user reviews from google play store or apple app store. The categories used in the research were Bug, Feature, User Experience, Rating. Given this situation, I am trying to implement a decision tree using sklearn package in python. I came across an example data set provided by sklearn 'IRIS', which builds a tree model using the features and their values mapped to the target. In this example, it is numeric data.

    I am trying to classify text instead of numeric data. Examples:

    1. I liked very much the upgrade to pdfs. However, they aren't displaying anymore Fix it and it will be perfect [BUG]
    2. I just wish it would notify me if I go below a certain dollar amount [FEATURE]
    3. This app is very helpful in my line of business [Rating]
    4. Easy to find songs and purchase in iTunes [UserExperience]

    Given these texts and lot more user reviews of these categories, I want to create a classifier that can train using the data and predict the target of any given user reviews.

    So far I have pre-processed the text and created training data in the form of list of tuples that contains the pre-processed data and its target.

    My Pre-processing:

    1. Tokenize multi-line comments to into single sentences
    2. Tokenize each sentence into words
    3. Remove stop words in the tokenized sentence
    4. Lemmatize the words in the tokenized sentence

    (['i', 'liked', 'much', 'upgrade', 'pdfs', 'however', 'displaying', 'anymore', 'fix', 'perfect'], "BUG")

    Here's what I have so far:

    import json
    from sklearn import tree
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    from nltk.tokenize import sent_tokenize, RegexpTokenizer
    
    # define a tokenizer to tokenize sentences and also remove punctuation
    tokenizer = RegexpTokenizer(r'\w+')
    
    # this list stores all the training data along with it's label
    tagged_tokenized_comments_corpus = []
    
    
    # Method: to add data to training set
    # Parameter: Tuple in the format (Data, Label)
    def tag_tokenized_comments_corpus(*tuple_data):
    tagged_tokenized_comments_corpus.append(tuple_data)
    
    
    # step 1: Load all the stop words from the nltk package
    stop_words = stopwords.words("english")
    stop_words.remove('not')
    
    # creating a temporary list to copy the existing stop words
    temp_stop_words = stop_words
    
    for word in temp_stop_words:
    if "n't" in word:
        stop_words.remove(word)
    
    # load the data set
    files = ["Bug.txt", "Feature.txt", "Rating.txt", "UserExperience.txt"]
    
    d = {"Bug": 0, "Feature": 1, "Rating": 2, "UserExperience": 3}
    
    for file in files:
    input_file = open(file, "r")
    file_text = input_file.read()
    json_content = json.loads(file_text)
    
    # step 3: Tokenize multi sentence into single sentences from the user comments
    comments_corpus = []
    for i in range(len(json_content)):
        comments = json_content[i]['comment']
        if len(sent_tokenize(comments)) > 1:
            for comment in sent_tokenize(comments):
                comments_corpus.append(comment)
        else:
            comments_corpus.append(comments)
    
    # step 4: Tokenize each sentence, remove stop words and lemmatize the comments corpus
    lemmatizer = WordNetLemmatizer()
    tokenized_comments_corpus = []
    for i in range(len(comments_corpus)):
        words = tokenizer.tokenize(comments_corpus[i])
        tokenized_sentence = []
        for w in words:
            if w not in stop_words:
                tokenized_sentence.append(lemmatizer.lemmatize(w.lower()))
        if tokenized_sentence:
            tokenized_comments_corpus.append(tokenized_sentence)
            tag_tokenized_comments_corpus(tokenized_sentence, d[input_file.name.split(".")[0]])
    
    # step 5: Create a dictionary of words from the tokenized comments corpus
    unique_words = []
    for sentence in tagged_tokenized_comments_corpus:
    for word in sentence[0]:
        unique_words.append(word)
    unique_words = set(unique_words)
    
    dictionary = {}
    i = 0
    for dict_word in unique_words:
    
    dictionary.update({i, dict_word})
    i = i + 1
    
    
    train_target = []
    train_data = []
    for sentence in tagged_tokenized_comments_corpus:
    train_target.append(sentence[0])
    train_data.append(sentence[1])
    
    clf = tree.DecisionTreeClassifier()
    clf.fit(train_data, train_target)
    
    test_data = "Beautiful Keep it up.. this far is the most usable app editor.. 
    it makes my photos more beautiful and alive.."
    
    test_words = tokenizer.tokenize(test_data)
    test_tokenized_sentence = []
    for test_word in test_words:
        if test_word not in stop_words:
         test_tokenized_sentence.append(lemmatizer.lemmatize(test_word.lower()))
    
    #predict using the classifier
    print("predicting the labels: ")
    print(clf.predict(test_tokenized_sentence))
    

    However, This doesn't seem to work since it throws an error during run time when we train the algorithm. I was thinking If I can map the words in the tuple to the dictionary and convert the text into numeric form and train the algorithm. But I am not sure if this may work.

    Can anyone suggest how can I fix this code? Or if there is any better way to implement this decision tree.

    Traceback (most recent call last):
      File "C:/Users/venka/Documents/GitHub/RE-18/Test.py", line 87, in <module>
    clf.fit(train_data, train_target)
      File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\tree\tree.py", line 790, in fit
    X_idx_sorted=X_idx_sorted)
     File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\tree\tree.py", line 116, in fit
    X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 441, in check_array
    "if it contains a single sample.".format(array))
    ValueError: Expected 2D array, got 1D array instead:
    array=[ 0.  0.  0. ...,  3.  3.  3.].
    Reshape your data either using array.reshape(-1, 1) if your data has a 
    single feature or array.reshape(1, -1) if it contains a single sample.
    
    • SCB
      SCB over 6 years
      Do you have the traceback of the error that occurs? We can't exactly help you without knowing what actually went wrong. That said, I'm going to assume you might struggle plugging sentences straight into sklearn Decision Trees.
    • Venkatesh
      Venkatesh over 6 years
      I have updated the post with the Traceback. I understand that it is very difficult to plug sentences straight into decision trees. I was thinking to create a dictionary taking all the unique words from the corpus and assign a unique numeric value to each word so that we can pass this as tuple of numeric data instead of sentence but not sure if this might work
    • SCB
      SCB over 6 years
      Have you tried it? Why do you think it won't work? The error here is reasonably self explanatory, you need to hand in a 2d array. Is the one errorring your data or target?
    • Venkatesh
      Venkatesh over 6 years
      I am yet to try it. In the IRIS data set example that I’ve mentioned in the post, the training data consists of constant 4 features for every singe data and I am assuming that these 4 values make it easy for the algorithm to understand and construct a model. However, when I convert text into numerical data, each sentence is not going to be of same length. I agree that this question exists even now without converting into numerical data. And regarding the error, I am looking into the stack trace again. It makes more sense now after you’ve mentioned and I am gonna try again.
    • Vivek Kumar
      Vivek Kumar over 6 years
      You can use CountVectorizer or TfidfVectorizer in sklearn to convert your filtered words into numerical data ready to be used by ML algorithms.
    • Venkatesh
      Venkatesh over 6 years
      Thank you for the suggestion, I did the same now to convert my textual data to numerical data
  • Venkatesh
    Venkatesh over 6 years
    Thanks for the answer. As expected and like you mentioned, Decision trees can work only when the feature vectors are all the same length. I came to understand that what you've suggested is to convert the text to a vector. This tutorial had a very good explanation blog.christianperone.com/2011/09/… of behind the scenes of the sklearn's CountVectorizer and TfidfTransformer. The decision tree is done, I used the model_selection to split the data into train and test. I get an accuracy of 50.15%, however the Naive Bayes is giving me 58.30
  • Venkatesh
    Venkatesh over 6 years
    I believe the accuracy can vary on the training data. For this accuracy that I've mentioned above, I had Bug: 814, Feature: 744, Rating: 3961, UserExperience: 1225 and split the train and test data as 60% and 40% respectively. This accuracy I achieved does not match the paper ieeexplore.ieee.org/document/7320414 and their's is much more though. I will however continue to collect more data or choose a different way to pick the training data to improve the accuracy.