Kmeans matlab "Empty cluster created at iteration 1" error

10,685

Solution 1

It is simply telling you that during the assign-recompute iterations, a cluster became empty (lost all assigned points). This is usually caused by an inadequate cluster initialization, or that the data has less inherent clusters than you specified.

Try changing the initialization method using the start option. Kmeans provides four possible techniques to initialize clusters:

  • sample: sample K points randomly from the data as initial clusters (default)
  • uniform: select K points uniformly across the range of the data
  • cluster: perform preliminary clustering on a small subset
  • manual: manually specify initial clusters

Also you can try the different values of emptyaction option, which tells MATLAB what to do when a cluster becomes empty.

Ultimately, I think you need to reduce the number of clusters, i.e try K=2 clusters.


I tried to visualize your data to get a feel for it:

load matlab_X.mat
figure('renderer','zbuffer')
line(XX(:,1), XX(:,2), XX(:,3), ...
    'LineStyle','none', 'Marker','.', 'MarkerSize',1)
axis vis3d; view(3); grid on

After some manual zooming/panning, it looks like a silhouette of a person:

3d_points

You can see that the data of 307200 points is really dense and compact, which confirms what I suspected; the data doesnt have that many clusters.


Here is the code I tried:

>> [IDX,C] = kmeans(XX, 3, 'start','uniform', 'emptyaction','singleton');
>> tabulate(IDX)
  Value    Count   Percent
      1    18023      5.87%
      2    264690     86.16%
      3    24487      7.97%

Whats more, the entire points in cluster 2 are all duplicate points ([0 0 0]):

>> unique(XX(IDX==2,:),'rows')
ans =
     0     0     0

The other two clusters look like:

clr = lines(max(IDX));
for i=1:max(IDX)
line(XX(IDX==i,1), XX(IDX==i,2), XX(IDX==i,3), ...
    'Color',clr(i,:), 'LineStyle','none', 'Marker','.', 'MarkerSize',1)
end

clustered points

So you might get better clusters if you first remove duplicate points first...


In addition, you have a few outliers that might affect the result of clustering. Visually, I narrowed down the range of the data to the following intervals which encompasses most of the data:

>> xlim([-500 100])
>> ylim([-500 100])
>> zlim([900 1500])

Here is the result after removing dupe points (over 250K points) and outliers (around 250 data points), and clustering with K=3 (best of out of 5 runs with the replicates option):

XX = unique(XX,'rows');
XX(XX(:,1) < -500 | XX(:,1) > 100, :) = [];
XX(XX(:,2) < -500 | XX(:,2) > 100, :) = [];
XX(XX(:,3) < 900 | XX(:,3) > 1500, :) = [];

[IDX,C] = kmeans(XX, 3, 'replicates',5);

with almost an equal split across the three clusters:

>> tabulate(IDX)
  Value    Count   Percent
      1    15605     36.92%
      2    15048     35.60%
      3    11613     27.48%

Recall that the default distance function is euclidean distance, which explains the shape of the formed clusters.

final clustering

Solution 2

If you are confident with your choice of "k=3", here is the code I wrote for not getting an empty cluster:

[IDX,C] = kmeans(XX,3,'distance','cosine','start','sample', 'emptyaction','singleton');

while length(unique(IDX))<3 ||  histc(histc(IDX,[1 2 3]),1)~=0
% i.e. while one of the clusters is empty -- or -- we have one or more clusters with only one member
[IDX,C] = kmeans(XX,3,'distance','cosine','start','sample', 'emptyaction','singleton');
end  
Share:
10,685
Tak
Author by

Tak

Apparently, this user prefers to keep an air of mystery about them

Updated on July 22, 2022

Comments

  • Tak
    Tak almost 2 years

    I'm using this script to cluster a set of 3D points using the kmeans matlab function but I always get this error "Empty cluster created at iteration 1". The script I'm using:

    [G,C] = kmeans(XX, K, 'distance','sqEuclidean', 'start','sample');
    

    XX can be found in this link XX value and the K is set to 3 So if anyone could please advise me why this is happening.

  • Tak
    Tak almost 11 years
    Thank you very much, I really appreciate your kind assistance! I've used this code but the figure appears without clustering the points with different colors: XX = unique(XX,'rows'); XX(XX(:,1) < -500 | XX(:,1) > 100, :) = []; XX(XX(:,2) < -500 | XX(:,2) > 100, :) = []; XX(XX(:,3) < 900 | XX(:,3) > 1500, :) = []; [IDX,C] = kmeans(XX, 3, 'replicates',5); clr = lines(max(IDX)); for ii=1:1 line(XX(IDX==ii,1), XX(IDX==ii,2), XX(IDX==ii,3), ... 'Color',clr(ii,:), 'LineStyle','none', 'Marker','.', 'MarkerSize',1) axis vis3d; view(3); grid on end
  • Amro
    Amro almost 11 years
    oops there was a typo (for i=1:1 should have been for i=1:K where K is the number of clusters). Fixed now, see the edit :)
  • Tak
    Tak almost 11 years
    Thanks for this amazing solution, I really appreciate it. I'm wondering if you could assist me in doing another thing. Is it possible to remove the regions outside the big region? For example, in the above figure on the bottom right corner there are two regions of points not attached to the big region, I was wondering how can I remove the outlier regions? Many Thanks
  • Amro
    Amro almost 11 years
    @user1460166: perhaps another clustering algorithm like DBSCAN or mean-shift (density-based) might separate the points in the way you are expecting. Here are some other posts that will point you in the right direction: stackoverflow.com/questions/4567515/…, stackoverflow.com/a/2303583/97160 .
  • Amro
    Amro almost 11 years
    You can always do this as a one-time interactive filtering using MATLAB's brushing capabilities. See this one for an example: stackoverflow.com/a/7455390/97160
  • Tak
    Tak almost 11 years
    Thanks for your suggestions, but the problem is that I use this code on running frames of data, so I can't stop on each frame and manually edit it. That's why I'm asking if there is any other solution to apply it here? Many Thanks
  • Amro
    Amro almost 11 years
    figured as much :) You should look into the other clustering algorithms I mentioned. I think hierarchical clustering (which is implemented in the Statistics toolbox) might also work. See here for an example: stackoverflow.com/a/7133422/97160 . Experiment with the different linkage criteria
  • Tak
    Tak almost 11 years
    I've also tried this [class,type]=dbscanN(XX,3,[]); from the file found here but its taking avery long time dropbox.com/s/ugc7a0qyx5l3rav/DBSCANN.M
  • Tak
    Tak almost 11 years
    DO you have any suggestions? Many Thanks
  • Tak
    Tak almost 11 years
    Also I have a another small question, If I'm looping on more than XX matrices, the figures drawn from this clr = lines(max(IDX)); for i=1:max(IDX) line(XX(IDX==i,1), XX(IDX==i,2), XX(IDX==i,3), ... 'Color',clr(i,:), 'LineStyle','none', 'Marker','.', 'MarkerSize',1) end; are displayed on top of each other, how can I clear the figure before each iteration? Thanks
  • Has QUIT--Anony-Mousse
    Has QUIT--Anony-Mousse almost 11 years
    DBSCAN should be used with index acceleration. Try the version in ELKI instead. I don't know if there is any index accelerated anything available in Matlab. Oh, and you should be fine downsampling this data without much loss.
  • Amro
    Amro almost 11 years
    @user1460166: I am not familiar with that implementation so I cant comment on it, but there is a note on Wikipedia about the algorithm complexity and how it can be accelerated using a spatial indexing structure as Anony-Mousse mentioned. His remark about downsampling is also worth trying... As for you other comment, if you want to produce separate plots, you could either open new figures inside the loop, or clear existing ones using clf at the beginning of each iteration.
  • Tak
    Tak almost 11 years
    @Amro Thank you very much, I really appreciate your assistance
  • Tak
    Tak almost 11 years
    @Anony-Mousse Thank you very much for your suggestion!
  • Tak
    Tak almost 11 years
    @Amro I'm asking if you could please check my question found here, thank you very much stackoverflow.com/questions/18052326/…
  • Tak
    Tak almost 11 years
    @Amro I want to rearrange the result IDX into 307200x1 or 640x480x1 to know in which cluster each pixel of the image belongs to. I don't know how can I do this as after clustering the IDX and XX become 37426 which is very good in performance but I don't know which pixel belongs to which cluster, do you have any suggestions please? Thanks
  • Tak
    Tak almost 11 years
    @Amro I want to do this as I want to know the nearest 2 points "pixels" between each cluster, I've posted a question regarding this here so I'd appreciate if you could have a look and let me know. Thanka alot stackoverflow.com/questions/18095542/…
  • Tak
    Tak almost 11 years
    @Amro I've replaced the zeros with NaN and the Kmeans run with no empty clusters, but this made me not use the Unique fn and increases the processing time. So I'm asking if you have any suggestions, as my target is to get the nearest point between two clusters "in any of them" as described in my question stated in my previous comment. Thanks
  • Amro
    Amro almost 11 years
    @user1460166: I posted an answer on your linked question
  • Tak
    Tak almost 11 years
    @Amro Thank you very much