Clustering Categorical data using jaccard similarity

11,378

Solution 1

Interpreted Python code is slow. Really slow.

That is why the good python toolkits contain plenty of Cython code and even C and Fortran code (e.g. matrix operations in numpy), and only use Python for driving the overall process.

You may be able to speed up your code substantially if you try to use as much numpy as possible. Or if you use Cython instead.

Instead of fighting centroids, consider using an distance-based clustering algorithm:

  • Hierarchical Agglomerative Clustering (HAC), which expects a distance matrix
  • DBSCAN, which can work with arbitrary distances. It does not even need a distance matrix, only a list of similar items for some threshold.
  • K-medoids/PAM of course is worth a try, too; but usually not very fast.

Solution 2

First of all, the way you calculate Jaccard seems to be inefficient (if not erroneous). You are using the for loop that is probably the slowest way to do stuff in Python. I would recommend you to utilize Python's set to store the rows. Sets provide fast intersection, because they are hash-tables and all the calculations are performed in C/C++ not in Python itself. Imagine r1 and r2 are two rows.

r1 = set(some_row1)
r2 = set(some_row2)
intersection_len = len(r1.intersect(r2))
union_len = len(r1) + len(r2) - intersection_len
jaccard = intersection_len / union_len

Set construction is expensive, thus you should initially store all your rows as sets. Then you should get rid of the

for i in range(0,data_set):
    for j in range(0,data_set):

part as well. Use itertools instead. Let's assume data_set is a list of rows.

for row1, row2 in itertools.combinations(data_set, r=2):
    ...

This thing runs a lot faster and kills the need for the if j>=i check. This way you get the upper triangle of the matrix. Let's make a sketch of the final algorithm. Update: adding numpy.

from scipy.spatial import distance
from itertools import combinations
import numpy as np


def jaccard(set1, set2):
    intersection_len = set1.intersection(set2)
    union_len = len(set1) + len(set2) - intersection_len
    return intersection_len / union_len

original_data_set = [row1, row2, row3,..., row_m]
data_set = [set(row) for row in original_data_set]

jaccard_generator = (jaccard(row1, row2) for row1, row2 in combinations(data_set, r=2))
flattened_matrix = np.fromiter(jaccard_generator, dtype=np.float64)

# since flattened_matrix is the flattened upper triangle of the matrix
# we need to expand it.
normal_matrix = distance.squareform(flattened_matrix)
# replacing zeros with ones at the diagonal. 
normal_matrix += np.identity(len(data_set))

That's it. You've got your matrix. From this point you might consider taking this block of code and porting it to Cython (there isn't much work left to do, you only need to define the jaccard function in a slightly different way, i.e. add type declarations for local variables). Something like:

cpdef double jaccard(set set1, set set2):
    cdef long intersection_len, union_len # or consider int 
    intersection_len = set1.intersection(set2)
    union_len = len(set1) + len(set2) - intersection_len
    return intersection_len / union_len

but I'm not sure this will compile correctly (my Cython experience is very limited)

P.S. You can use numpy arrays instead of sets, because they provide a similar intersection method and run in C/C++ as well, but intersection of two arrays takes roughly O(n^2) time, while intersection of two hash-tables (set objects) takes O(n) time, provided the collision rate is close to zero.

Share:
11,378
Sam
Author by

Sam

Updated on June 04, 2022

Comments

  • Sam
    Sam almost 2 years

    I am trying to build a clustering algorithm for categorical data.

    I have read about different algorithm's like k-modes, ROCK, LIMBO, however I would like to build one of mine and compare the accuracy and cost to others.

    I have (m) training set and (n=22) features

    Approach

    My approach is simple:

    • Step 1: I calculate the jaccard similarity between each of my training data forming a (m*m) similarity matrix.
    • Step 2: Then I perform some operations to find the best centroids and find the clusters by using a simple k-means approach.

    The similarity matrix I create in step 1 would be used while performing the k-means algorithm

    Matrix creation:

    total_columns=22
    for i in range(0,data_set):
        for j in range(0,data_set):
            if j>=i:
                # Calculating jaccard similarity between two data rows i and j 
                for column in data_set.columns:    
                    if data_orig[column][j]==data_new[column][i]:
                        common_count=common_count+1
                probability=common_count/float(total_columns)    
                fnl_matrix[i][j] =probability  
                fnl_matrix[j][i] =probability
    

    Partial snapshot of my fnl_matrix (for 6 rows) is given below:

    enter image description here

    Problem Statement:

    The problem I face is that when I create the (m*m) matrix, for a larger data set my performance goes for a toss. Even for a smaller dataset with 8000 rows the creation of the similarity matrix takes unbearable time. Is there any way I could tune my code or do something to the matrix that would be cost efficient.

    • Eli Korvigo
      Eli Korvigo almost 9 years
      In case you've already seen my answer, I've just added some dramatic improvements and further recommendations.
  • Eli Korvigo
    Eli Korvigo almost 9 years
    In fact, Python built-ins and most extensively used modules (like itertools) are entirely written in C/C++, hence it's sufficient to get rid of the for loops and pure Python function calls in every possible place. Using as much numpy as possible may slow things down if numpy is not used wisely.
  • Sam
    Sam almost 9 years
    Thanks man, for the awesome explanation. Your code really helped me a lot
  • Eli Korvigo
    Eli Korvigo almost 9 years
    @user2404193 You're welcome. I've added a sketch of the Cythonized jaccard function.
  • Sam
    Sam almost 9 years
    Eli, By using the above technique. All my diagonals turns to 0, however I want to keep the diagonals as 1. I am aware that I can update all the diagonals to 1 after the matrix formation. Can I do the same on the fly i.e while the matrix is being formed
  • Eli Korvigo
    Eli Korvigo almost 9 years
    @Sam I'm not aware of such a function. Nevertheless the shortest solution is to add the identity matrix of size m to the expanded matrix, where m is the number of rows in the data_set, i.e. normal_matrix += np.identity(m). This will turn all diagonal elements into 1. The performance overhead is close to none.
  • Sam
    Sam almost 9 years
    ya i get that, but cant I do something on the fly, when the matrix is getting created. Because identity matrix can only be added once the matrix is ready.
  • Eli Korvigo
    Eli Korvigo almost 9 years
    @Sam I told that there is no such function in SciPy/NumPy. The thing is that this functions are meant to create distance matrices with zeros at the diagonal by definition. Another way around this caveat would be to use Jaccard distance instead of similarity, i.e. return 1 - intersection_len/union_len in the jaccard function. In this case the definition of a distance matrix will hold and you won't need to add the identity matrix. Additionally this solution will enable you to use SciPy's hierarchical clustering, since these functions require a distance matrix. Google scipy.cluster.hierarchy
  • Sam
    Sam almost 9 years
    Hi Eli, I have faced another problem, I have defined the problem statement under a different question. Would like have inputs from you on that. please find the question in stackoverflow.com/questions/30144462/…