Memory Error when calculating pairwise distances in scipy

10,347

Solution 1

Well, hierarchical clustering doesn't make that much sense for large datasets. It's actually mostly a textbook example in my opinion. The problem with hierarchical clustering is that it doesn't really build sensible clusters. It builds a dendrogram, but with 14000 objects the dendrogram becomes pretty much unusable. And very few implementations of hierarchical clustering have non-trivial methods to extract sensible clusters from the dendrogram. Plus, in the general case, hierarchical clustering is of complexity O(n^3) which makes it scale really bad to large datasets.

DBSCAN technically does not need a distance matrix. In fact, when you use a distance matrix, it will be slow, as computing the distance matrix already is O(n^2). And even then, you can safe the O(n^2) memory cost for DBSCAN by computing the distances on the fly at the cost of computing distances twice each. DBSCAN visits each point once, so there is next to no benefit from using a distance matrix except the symmetry gain. And technically, you could do some neat caching tricks to even reduce that, since DBSCAN also just needs to know which objects are below the epsilon threshold. When the epsilon is chosen reasonably, managing the neighbor sets on the fly will use significantly less memory than O(n^2) at the same CPU cost of computing the distance matrix.

Any really good implementation of DBSCAN (it is spelled all uppercase, btw, as it is an abbreviation, not a scan) however should have support for index structures and then run in O(n log n) runtime.

On http://elki.dbs.ifi.lmu.de/wiki/Benchmarking they run DBSCAN on a 110250 object dataset and 8 dimensions, and the non-indexed variant takes 1446 seconds, the one with index just 219. That is about 7 times faster, including index buildup. (It's not python, however) Similarly, OPTICS is 5 times faster with the index. And their kmeans implementation in my experiments was around 6x faster than WEKA kmeans and using much less memory. Their single-link hierarchical clustering also is an optimized O(n^2) implementation. Actually the only one I've seen so far that is not the naive O(n^3) matrix-editing approach. If you are willing to go beyond python, that might be a good choice.

Solution 2

It's possible that you really are running out of RAM. Finding pairwise distances between N objects means storing N^2 distances. In your case, N^2 is going to be 14039 ^ 2 = 1.97 * 10^8. If we assume that each distance takes only four bytes (which is almost certainly not the case, as they have to be held in some sort of data structure which may have non-constant overhead) that works out to 800 megabytes. That's a lot of memory for the interpreter to be working with. 32-bit architectures only allow up to 2 GB of process memory, and just your raw data is taking up around 50% of that. With the overhead of the data structure you could be looking at usage much higher than that -- I can't say how much because I don't know the memory model behind SciPy/numpy.

I would try breaking your data sets up into smaller sets, or not constructing the full distance matrix. You can break it down into more manageable chunks (say, 14 subsets of around 1000 elements) and do nearest-neighbor between each chunk and all of the vectors -- then you're looking at loading an order of magnitude less into memory at any one time (14000 * 1000, 14 times instead of 14000 * 14000 once).

Edit: agf is completely right on both counts: I missed your edit, and the problem probably comes about when it tries to construct the giant string that represents your matrix. If it's printing floating point values, and we assume 10 characters are printed per element and the string is stored with one byte per character, then you're looking at exactly 2 GB of memory usage just for the string.

Share:
10,347
Maxwell
Author by

Maxwell

Updated on June 04, 2022

Comments

  • Maxwell
    Maxwell almost 2 years

    I am trying to apply hierarchial clustering to my dataset which consists of 14039 vectors of users. Each vector has 10 features, where each feature is basically frequency of tags tagged by that user. I am using Scipy api for clustering. Now I need to calculate pairwise distances between these 14039 users and pass tis distance matrix to linkage function.

      import scipy.cluster.hierarchy as sch
      Y = sch.distance.pdist( allUserVector,'cosine')
      set_printoptions(threshold='nan')
      print Y
    

    But my program gives me MemoryError while calculating the distance matrix itself

      File "/usr/lib/pymodules/python2.7/numpy/core/numeric.py", line 1424, in array_str
        return array2string(a, max_line_width, precision, suppress_small, ' ', "", str)
      File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 306, in array2string
        separator, prefix)
      File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 210, in _array2string
        format_function = FloatFormat(data, precision, suppress_small)
      File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 392, in __init__
        self.fillFormat(data)
      File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 399, in fillFormat
        non_zero = absolute(data.compress(not_equal(data, 0) & ~special))
    MemoryError
    

    Any idea how to fix this? Is my dataset too large? But I guess clustering 14k users shouldnt be too much that it should cause Memory error. I am running it on i3 and 4 Gb Ram. I need to apply DBScan clustering too, but that too needs distance matrix as input.

    Any suggestions appreciated.

    Edit: I get the error only when I print Y. Any ideas why?

  • agf
    agf about 12 years
    I think you missed his edit. He gets the error only when trying to print the entire distance matrix. It's constructing the huge string that is exhausting his memory.
  • ximiki
    ximiki over 6 years
    The pre-editted answer was really helpful for a related problem. Please keep this answer so I can link other questions to it.