simple k-means clustering for bag of words model using python

10,436

Kmeans is a good idea.

Some examples and code from the web:

1) Document Clustering with Python link

2) Clustering text documents using scikit-learn kmeans in Python link

3) Clustering a long list of strings (words) into similarity groups link

4) Kaggle post link

Share:
10,436
kanglais
Author by

kanglais

Updated on June 04, 2022

Comments

  • kanglais
    kanglais almost 2 years

    The input dataset looks like this:

    {"666": ["abc",
             "xyz"],
     "888": ["xxxo",
             "xxxo"], 
     "007": ["abc"]}  
    

    We start by creating a bag-of-words model using the following function:

    def associate_terms_with_user(unique_term_set, all_users_terms_dict):
    
        associated_value_return_dict = {}
    
        # consider the first user
        for user_id in all_users_terms_dict:
    
            # what terms *could* this user have possibly used
            this_user_zero_vector = []
    
            # this could be refactored somehow
            for term in  unique_term_set:
                this_user_zero_vector.extend('0')
    
            # what terms *did* this user use
            terms_belong_to_this_user = all_users_terms_dict.get(user_id)
    
            # let's start counting all the possible terms that this term in the personal
            # user list of words could correspond to... 
            global_term_element_index = 0
    
            # while this one term is in the range of all possible terms
            while global_term_element_index < len(unique_term_set):
    
                # start counting the number of terms he used
                local_term_set_item_index = 0
    
                # if this one term he used is still in the range of terms he used, counting them one by one
                while local_term_set_item_index < len(terms_belong_to_this_user):
    
                    # if this one user term is the same as this one global term
                    if list(unique_term_set)[global_term_element_index] == terms_belong_to_this_user[local_term_set_item_index]:
    
                        # increment the number of times this user used this term
                        this_user_zero_vector[global_term_element_index] = '1'
    
                    # go to the next term for this user
                    local_term_set_item_index += 1
    
                # go to the next term in the global list of all possible terms
                global_term_element_index += 1
    
            associated_value_return_dict.update({user_id: this_user_zero_vector})
    
        pprint.pprint(associated_value_return_dict)
    

    The output of the program looks like this:

    {'007': ['0', '0', '1'], 
     '666': ['0', '1', '1'], 
     '888': ['1', '0', '0']}
    

    How could we implement a simple function to cluster those vectors based on their similarity to one another? I envisage using k-means and possibly scikit-learn.

    I've never done that before and I don't know how, I'm new to machine learning in generally and I don't really even know where to start.

    Finally 666 and 007 would probably be clustered together, and 888 would be alone in a cluster by itself, isn't it?

    The full code lives here.

    • seralouk
      seralouk almost 7 years
      Kmeans is a good idea I think. You can see an example here: link
    • kanglais
      kanglais almost 7 years
      ah, cool- thank you. but I mean like- do you know how I would feed in that bag of words dict data structure that I have to a k-means function? do I need to change it at all first?
    • seralouk
      seralouk almost 7 years
      i will post some websites in an answer. there are some examples and answers. hope this helps
    • Has QUIT--Anony-Mousse
      Has QUIT--Anony-Mousse almost 7 years
      K-means does not work well on short text.