Hierarchical clustering of 1 million objects

python machine-learning cluster-analysis data-mining hierarchical-clustering

21,714

Solution 1

To beat O(n^2), you'll have to first reduce your 1M points (documents) to e.g. 1000 piles of 1000 points each, or 100 piles of 10k each, or ...
Two possible approaches:

build a hierarchical tree from say 15k points, then add the rest one by one: time ~ 1M * treedepth
first build 100 or 1000 flat clusters, then build your hierarchical tree of the 100 or 1000 cluster centres.

How well either of these might work depends critically on the size and shape of your target tree -- how many levels, how many leaves ?
What software are you using, and how many hours / days do you have to do the clustering ?

For the flat-cluster approach, K-d_tree s work fine for points in 2d, 3d, 20d, even 128d -- not your case. I know hardly anything about clustering text; Locality-sensitive_hashing ?

Take a look at scikit-learn clustering -- it has several methods, including DBSCAN.

Added: see also
google-all-pairs-similarity-search "Algorithms for finding all similar pairs of vectors in sparse vector data", Beyardo et el. 2007
SO hierarchical-clusterization-heuristics

Solution 2

The problem probably is that they will try to compute the full 2D distance matrix (about 8 GB naively with double precision) and then their algorithm will run in O(n^3) time anyway.

You should seriously consider using a different clustering algorithm. Hierarchical clustering is slow and the results are not at all convincing usually. In particular for millions of objects, where you can't just look at the dendrogram to choose the appropriate cut.

If you really want to continue hierarchical clustering, I belive that ELKI (Java though) has a O(n^2) implementation of SLINK. Which at 1 million objects should be approximately 1 million times as fast. I don't know if they already have CLINK, too. And I'm not sure if there actually is any sub-O(n^3) algorithm for other variants than single-link and complete-link.

Consider using other algorithms. k-means for example scales very well with the number of objects (it's just not very good usually either, unless your data is very clean and regular). DBSCAN and OPTICS are quite good in my opinion, once you have a feel for the parameters. If your data set is low dimensional, they can be accelerated quite well with an appropriate index structure. They should then run in O(n log n), if you have an index with O(log n) query time. Which can make a huge difference for large data sets. I've personally used OPTICS on a 110k images data set without problems, so I can imagine it scales up well to 1 million on your system.

21,714

Author by

Atish Kathpal

Updated on October 14, 2020

Comments

Atish Kathpal over 3 years

Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange.

hcluster had trouble with 18k objects. Orange was able to cluster 18k objects in seconds, but failed with 100k objects (saturated memory and eventually crashed).

I am running on a 64bit Xeon CPU (2.53GHz) and 8GB of RAM + 3GB swap on Ubuntu 11.10.