How can I cluster a graph in Python?

22,490

Solution 1

Why not use a real graph library, like Python-Graph? It has a function to determine connected components (though no example is provided). I'd imagine a dedicated library is going to be faster than whatever ad-hoc graph code you've cooked up.

EDIT: NetworkX seems like it might be a better choice than python-graph; its documentation (here for the connected components function) certainly is.

Solution 2

In SciPy you can use sparse matrices. Also note, that there are more efficient ways of multiplying matrix by itself. Anyway, what you're trying to do can by done by SVD decomposition.

Introduction with useful links.

Solution 3

There's also graph_tool and networkit that have efficient routines for connected components, and both store the network efficiently. If you're going to work with millions of nodes, networkx will likely not be sufficient (it's pure python afaik). Both those tools are written in C++ so can handle analysis of large graphs with reasonable run times.

As Phil points out, your method will have horribly long compute times for large graphs (we're talking days, weeks, months...), and your representation for a graph of a million nodes will need something like a million gigabytes of memory!

Solution 4

Finding an optimal graph partition is an NP-hard problem, so whatever the algorithm, it is going to be an approximation or a heuristic. Not surprisingly, different clustering algorithms produce (wildly) different results.

Python implementation of Newman's modularity algorithm: modularity

Also: MCL, MCODE, CFinder, NeMo, clusterONE

Solution 5

As others have pointed out, no need to reinvent the wheel. A lot of thought has been put into optimal clustering techniques. Here is one well-known clustering program.

Share:
22,490
amg
Author by

amg

I am here to learn.

Updated on May 31, 2020

Comments

  • amg
    amg almost 4 years

    Let G be a graph. So G is a set of nodes and set of links. I need to find a fast way to partition the graph. The graph I am now working has only 120*160 nodes, but I might soon be working on an equivalent problem, in another context (not medicine, but website development), with millions of nodes.

    So, what I did was to store all the links into a graph matrix:

    M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))
    

    Now M holds a 1 in position s,t, if node s is connected to node t. I make sure M is symmetrical M[s,t]=M[t,s] and each node links to itself M[s,s]=1.

    If I remember well if I multiply M with M, the results is a matrix that represents the graph that connects vertexes that are reached on through two steps.

    So I keep on multplying M with itself, until the number of zeros in the matrix do not decrease any longer. Now I have the list of the connected components. And now I need to cluster this matrix.

    Up to now I am pretty satisfied with the algorithm. I think it is easy, elegant, and reasonably fast. I am having trouble with this part.

    Essentially I need to split this graph into its connected components.

    I can go through all the nodes, and see what are they connected to.

    But what about sorting the matrix reordering the lines. But I don't know if it is possible to do it.

    What follows is the code so far:

    def findzeros(M):
        nZeros=0
        for t in M.flat:
            if not t:
                nZeros+=1
        return nZeros
    
    M=numpy.mat(numpy.zeros((len(data.keys()),len(data.keys()))))    
    for s in data.keys():
        MatrixCells[s,s]=1
        for t in data.keys():
            if t<s:
                if (scipy.corrcoef(data[t],data[s])[0,1])>threashold:
                    M[s,t]=1
                    M[t,s]=1
    
    nZeros=findzeros(M)
    M2=M*M
    nZeros2=findzeros(M2)
    
    while (nZeros-nZeros2):
        nZeros=nZeros2
        M=M2
        M2=M*M
        nZeros2=findzeros(M2)
    

    Edit:

    It has been suggested that I use SVD decomposition. Here is a simple example of the problem on a 5x5 graph. We shall use this since with the 19200x19200 square matrix is not that easy to see the clusters.

    import numpy
    import scipy
    
    M=numpy.mat(numpy.zeros((5,5)))
    
    M[1,3]=1
    M[3,1]=1
    M[1,1]=1
    M[2,2]=1
    M[3,3]=1
    M[4,4]=1
    M[0,0]=1
    
    print M
    
    u,s,vh = numpy.linalg.linalg.svd(M)
    print u
    print s
    print vh
    

    Essentially there are 4 clusters here: (0),(1,3),(2),(4) But I still don't see how the svn can help in this context.

  • amg
    amg about 15 years
    Thank you. I looked up the resource, but I honestly do not see how it can help. I updated the question with a simple example, and how SVN des not seem to solve it. Or then maybe I am using it wrongly? But how then? Thanks in any case :)
  • amg
    amg about 15 years
    Thank you! Looks like a great resource. I shall investigate it thoroughly.
  • vartec
    vartec about 15 years
    That's SVD (Singlular Value Decomposition). Basically for something as large as millions of nodes, you'll need approximation algorithm, rather than exact one (graph clustering is NP-complete). Article got links to papers explaining such algorithms.
  • vartec
    vartec about 15 years
    BTW. are you trying to reinvent PageRank or HITS?
  • amg
    amg about 15 years
    Not really. Right now just sorting which data belong to which biological cell. In fuure I have an equivalent problem that will eventually generate a search engine. But not on pages. And not using links. (Can't say more at this stage :) ). In any case, congratulations! Well spotted, LOL.
  • amg
    amg about 15 years
    Thanks. I admit the SVD suggestion totally went over my head. I am aware that graph partitioning is a well studied problem, so I was hoping to get some good insights when I posted here. But I also wanted to write what I knew, to show my good will :-)
  • vartec
    vartec about 15 years
    Latent Semantic Analysis then? ;-) Ok, I'm not going to pull your tongue. Just keep in mind, that what is possible in small scale, gets really complicated when it's big. Most graph algorithms have hight polynomial complexity, so it's to fissile to use then on 1mln nodes.
  • Phil H
    Phil H about 15 years
    I think the key is to decide whether you want to learn about partitioning enough to rewrite the software on it (probably not), or whether you just want to partition a graph. If you decide just to use existing solutions, pick a library and use it. Seek to solve it at the highest level.
  • amg
    amg about 15 years
    I tried to install PyMetis, but it seem to have a hard time in installing. There seem to be no configuration file. Looking for the easiest way out I shall instead install networkx. Thanks, Pietro