Cutting SciPy hierarchical dendrogram into clusters via a threshold value

16,964

Here's the answer - I didn't add 'distance' as an option to fcluster. With it, I get the correct (3) cluster assignments.

assignments = fcluster(linkage(distanceMatrix, method='complete'),4,'distance')

print assignments
       [3 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

cluster_output = pandas.DataFrame({'team':df.teamID.tolist() , 'cluster':assignments})

print cluster_output
    cluster team
0         3  NYA
1         2  BOS
2         2  PHI
3         2  CHA
4         2  SFN
5         2  LAN
6         2  TEX
7         2  ATL
8         2  SLN
9         2  SEA
10        2  NYN
11        2  HOU
12        1  BAL
13        2  DET
14        1  ARI
15        2  CHN
16        1  CLE
17        1  CIN
18        1  TOR
19        1  COL
20        1  OAK
21        1  MIL
22        1  MIN
23        1  SDN
24        1  KCA
25        1  TBA
26        1  FLO
27        1  PIT
28        1  LAA
29        1  WAS
30        1  ANA
31        1  MON
32        1  MIA
Share:
16,964

Related videos on Youtube

Bryan
Author by

Bryan

Updated on September 15, 2022

Comments

  • Bryan
    Bryan over 1 year

    I'm trying to use SciPy's dendrogram method to cut my data into a number of clusters based on a threshold value. However, once I create a dendrogram and retrieve its color_list, there is one fewer entry in the list than there are labels.

    Alternatively, I've tried using fcluster with the same threshold value I identified in dendrogram; however, this does not render the same result -- it gives me one cluster instead of three.

    here's my code.

    import pandas
    data = pandas.DataFrame({'total_runs': {0: 2.489857755536053,
    1: 1.2877651950650333, 2: 0.8898850111727028, 3: 0.77750321282732704, 4: 0.72593099987615461, 5: 0.70064977003207007, 
    6: 0.68217502514600825, 7: 0.67963194285399975, 8: 0.64238326692987524, 9: 0.6102581538587678, 10: 0.52588765899448564,
    11: 0.44813665774322564, 12: 0.30434031343774476, 13: 0.26151929543260161, 14: 0.18623657993534984, 15: 0.17494230269731209,
    16: 0.14023670906519603, 17: 0.096817318756050832, 18: 0.085822227670014059, 19: 0.042178447746868117, 20: -0.073494398270518693,
    21: -0.13699665903273103, 22: -0.13733324345373216, 23: -0.31112299949731331, 24: -0.42369178918768974, 25: -0.54826542322710636,
    26: -0.56090603814914863, 27: -0.63252372328438811, 28: -0.68787316140457322, 29: -1.1981351436422796, 30: -1.944118415387774,
    31: -2.1899746357945964, 32: -2.9077222144449961},
    'total_salaries': {0: 3.5998991340231234,
    1: 1.6158435140488829, 2: 0.87501176080187315, 3: 0.57584734201367749, 4: 0.54559862861592978, 5: 0.85178295446270169,
    6: 0.18345463930386757, 7: 0.81380836410678736, 8: 0.43412670908952178, 9: 0.29560433676606418, 10: 1.0636736398252848,
    11: 0.08930130612600648, 12: -0.20839133305170349, 13: 0.33676911316165403, 14: -0.12404710480916628, 15: 0.82454221267393346,
    16: -0.34510456295395986, 17: -0.17162157282367937, 18: -0.064803261585569982, 19: -0.22807757277294818, 20: -0.61709008778669083,
    21: -0.42506873158089231, 22: -0.42637946918743924, 23: -0.53516500398181921, 24: -0.68219830809296633, 25: -1.0051418692474947,
    26: -1.0900316082184143, 27: -0.82421065378673986, 28: 0.095758053930450004, 29: -0.91540963929213015, 30: -1.3296449323844519,
    31: -1.5512503530547552, 32: -1.6573856443389405}})
    
    from scipy.spatial.distance import pdist
    from scipy.cluster.hierarchy import linkage, dendrogram
    
    distanceMatrix = pdist(data)
    dend = dendrogram(linkage(distanceMatrix, method='complete'), 
               color_threshold=4, 
               leaf_font_size=10,
               labels = df.teamID.tolist())
    

    dendrogram

    len(dend['color_list'])
    Out[169]: 32
    len(df.index)
    Out[170]: 33
    

    Why is dendrogram only assigning colors to 32 labels, although there are 33 observations in the data? Is this how I extract the labels and their corresponding clusters (colored in blue, green and red above)? If not, how else do I 'cut' the tree properly?

    Here's my attempt at using fcluster. Why does it return only one cluster for the set, when the same threshold for dend returns three?

    from scipy.cluster.hierarchy import fcluster
    fcluster(linkage(distanceMatrix, method='complete'), 4)
    Out[175]: 
    array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
    
  • user3479780
    user3479780 almost 8 years
    This problem has been bothering me too. I have finally worked it out that the order of the fcluster is the same as the order of the input distance matrix, not the same as the leaves from left to right in the dendrogram.
  • Statham
    Statham over 5 years
    the parameter criterion is very important, different criterion set different convergence criterion