Clustering Categorical data using jaccard similarity
Solution 1
Interpreted Python code is slow. Really slow.
That is why the good python toolkits contain plenty of Cython code and even C and Fortran code (e.g. matrix operations in numpy), and only use Python for driving the overall process.
You may be able to speed up your code substantially if you try to use as much numpy
as possible. Or if you use Cython instead.
Instead of fighting centroids, consider using an distance-based clustering algorithm:
- Hierarchical Agglomerative Clustering (HAC), which expects a distance matrix
- DBSCAN, which can work with arbitrary distances. It does not even need a distance matrix, only a list of similar items for some threshold.
- K-medoids/PAM of course is worth a try, too; but usually not very fast.
Solution 2
First of all, the way you calculate Jaccard seems to be inefficient (if not erroneous). You are using the for
loop that is probably the slowest way to do stuff in Python. I would recommend you to utilize Python's set
to store the rows. Sets provide fast intersection, because they are hash-tables and all the calculations are performed in C/C++ not in Python itself. Imagine r1
and r2
are two rows.
r1 = set(some_row1)
r2 = set(some_row2)
intersection_len = len(r1.intersect(r2))
union_len = len(r1) + len(r2) - intersection_len
jaccard = intersection_len / union_len
Set construction is expensive, thus you should initially store all your rows as sets. Then you should get rid of the
for i in range(0,data_set):
for j in range(0,data_set):
part as well. Use itertools
instead. Let's assume data_set is a list of rows.
for row1, row2 in itertools.combinations(data_set, r=2):
...
This thing runs a lot faster and kills the need for the if j>=i
check. This way you get the upper triangle of the matrix. Let's make a sketch of the final algorithm. Update: adding numpy
.
from scipy.spatial import distance
from itertools import combinations
import numpy as np
def jaccard(set1, set2):
intersection_len = set1.intersection(set2)
union_len = len(set1) + len(set2) - intersection_len
return intersection_len / union_len
original_data_set = [row1, row2, row3,..., row_m]
data_set = [set(row) for row in original_data_set]
jaccard_generator = (jaccard(row1, row2) for row1, row2 in combinations(data_set, r=2))
flattened_matrix = np.fromiter(jaccard_generator, dtype=np.float64)
# since flattened_matrix is the flattened upper triangle of the matrix
# we need to expand it.
normal_matrix = distance.squareform(flattened_matrix)
# replacing zeros with ones at the diagonal.
normal_matrix += np.identity(len(data_set))
That's it. You've got your matrix. From this point you might consider taking this block of code and porting it to Cython (there isn't much work left to do, you only need to define the jaccard
function in a slightly different way, i.e. add type declarations for local variables). Something like:
cpdef double jaccard(set set1, set set2):
cdef long intersection_len, union_len # or consider int
intersection_len = set1.intersection(set2)
union_len = len(set1) + len(set2) - intersection_len
return intersection_len / union_len
but I'm not sure this will compile correctly (my Cython experience is very limited)
P.S.
You can use numpy
arrays instead of set
s, because they provide a similar intersection method and run in C/C++ as well, but intersection of two arrays takes roughly O(n^2) time, while intersection of two hash-tables (set
objects) takes O(n) time, provided the collision rate is close to zero.
Sam
Updated on June 04, 2022Comments
-
Sam almost 2 years
I am trying to build a clustering algorithm for categorical data.
I have read about different algorithm's like k-modes, ROCK, LIMBO, however I would like to build one of mine and compare the accuracy and cost to others.
I have (m) training set and (n=22) features
Approach
My approach is simple:
- Step 1: I calculate the jaccard similarity between each of my training data forming a (m*m) similarity matrix.
- Step 2: Then I perform some operations to find the best centroids and find the clusters by using a simple k-means approach.
The similarity matrix I create in step 1 would be used while performing the k-means algorithm
Matrix creation:
total_columns=22 for i in range(0,data_set): for j in range(0,data_set): if j>=i: # Calculating jaccard similarity between two data rows i and j for column in data_set.columns: if data_orig[column][j]==data_new[column][i]: common_count=common_count+1 probability=common_count/float(total_columns) fnl_matrix[i][j] =probability fnl_matrix[j][i] =probability
Partial snapshot of my
fnl_matrix
(for 6 rows) is given below:Problem Statement:
The problem I face is that when I create the (m*m) matrix, for a larger data set my performance goes for a toss. Even for a smaller dataset with 8000 rows the creation of the similarity matrix takes unbearable time. Is there any way I could tune my code or do something to the matrix that would be cost efficient.
-
Eli Korvigo almost 9 yearsIn case you've already seen my answer, I've just added some dramatic improvements and further recommendations.
-
Eli Korvigo almost 9 yearsIn fact, Python built-ins and most extensively used modules (like
itertools
) are entirely written in C/C++, hence it's sufficient to get rid of thefor
loops and pure Python function calls in every possible place. Using as muchnumpy
as possible may slow things down ifnumpy
is not used wisely. -
Sam almost 9 yearsThanks man, for the awesome explanation. Your code really helped me a lot
-
Eli Korvigo almost 9 years@user2404193 You're welcome. I've added a sketch of the Cythonized
jaccard
function. -
Sam almost 9 yearsEli, By using the above technique. All my diagonals turns to 0, however I want to keep the diagonals as 1. I am aware that I can update all the diagonals to 1 after the matrix formation. Can I do the same on the fly i.e while the matrix is being formed
-
Eli Korvigo almost 9 years@Sam I'm not aware of such a function. Nevertheless the shortest solution is to add the identity matrix of size m to the expanded matrix, where m is the number of rows in the
data_set
, i.e.normal_matrix += np.identity(m)
. This will turn all diagonal elements into 1. The performance overhead is close to none. -
Sam almost 9 yearsya i get that, but cant I do something on the fly, when the matrix is getting created. Because identity matrix can only be added once the matrix is ready.
-
Eli Korvigo almost 9 years@Sam I told that there is no such function in SciPy/NumPy. The thing is that this functions are meant to create distance matrices with zeros at the diagonal by definition. Another way around this caveat would be to use Jaccard distance instead of similarity, i.e.
return 1 - intersection_len/union_len
in thejaccard
function. In this case the definition of a distance matrix will hold and you won't need to add the identity matrix. Additionally this solution will enable you to use SciPy's hierarchical clustering, since these functions require a distance matrix. Google scipy.cluster.hierarchy -
Sam almost 9 yearsHi Eli, I have faced another problem, I have defined the problem statement under a different question. Would like have inputs from you on that. please find the question in stackoverflow.com/questions/30144462/…