simple k-means clustering for bag of words model using python
Kmeans is a good idea.
Some examples and code from the web:
1) Document Clustering with Python link
2) Clustering text documents using scikit-learn kmeans in Python link
3) Clustering a long list of strings (words) into similarity groups link
4) Kaggle post link
kanglais
Updated on June 04, 2022Comments
-
kanglais almost 2 years
The input dataset looks like this:
{"666": ["abc", "xyz"], "888": ["xxxo", "xxxo"], "007": ["abc"]}
We start by creating a bag-of-words model using the following function:
def associate_terms_with_user(unique_term_set, all_users_terms_dict): associated_value_return_dict = {} # consider the first user for user_id in all_users_terms_dict: # what terms *could* this user have possibly used this_user_zero_vector = [] # this could be refactored somehow for term in unique_term_set: this_user_zero_vector.extend('0') # what terms *did* this user use terms_belong_to_this_user = all_users_terms_dict.get(user_id) # let's start counting all the possible terms that this term in the personal # user list of words could correspond to... global_term_element_index = 0 # while this one term is in the range of all possible terms while global_term_element_index < len(unique_term_set): # start counting the number of terms he used local_term_set_item_index = 0 # if this one term he used is still in the range of terms he used, counting them one by one while local_term_set_item_index < len(terms_belong_to_this_user): # if this one user term is the same as this one global term if list(unique_term_set)[global_term_element_index] == terms_belong_to_this_user[local_term_set_item_index]: # increment the number of times this user used this term this_user_zero_vector[global_term_element_index] = '1' # go to the next term for this user local_term_set_item_index += 1 # go to the next term in the global list of all possible terms global_term_element_index += 1 associated_value_return_dict.update({user_id: this_user_zero_vector}) pprint.pprint(associated_value_return_dict)
The output of the program looks like this:
{'007': ['0', '0', '1'], '666': ['0', '1', '1'], '888': ['1', '0', '0']}
How could we implement a simple function to cluster those vectors based on their similarity to one another? I envisage using k-means and possibly scikit-learn.
I've never done that before and I don't know how, I'm new to machine learning in generally and I don't really even know where to start.
Finally
666
and007
would probably be clustered together, and888
would be alone in a cluster by itself, isn't it?The full code lives here.
-
seralouk almost 7 yearsKmeans is a good idea I think. You can see an example here: link
-
kanglais almost 7 yearsah, cool- thank you. but I mean like- do you know how I would feed in that bag of words
dict
data structure that I have to a k-means function? do I need to change it at all first? -
seralouk almost 7 yearsi will post some websites in an answer. there are some examples and answers. hope this helps
-
Has QUIT--Anony-Mousse almost 7 yearsK-means does not work well on short text.
-