How can I perform K-means clustering on time series data?

32,875

Solution 1

Time series are usually high-dimensional. And you need specialized distance function to compare them for similarity. Plus, there might be outliers.

k-means is designed for low-dimensional spaces with a (meaningful) euclidean distance. It is not very robust towards outliers, as it puts squared weight on them.

Doesn't sound like a good idea to me to use k-means on time series data. Try looking into more modern, robust clustering algorithms. Many will allow you to use arbitrary distance functions, including time series distances such as DTW.

Solution 2

It's probably too late for an answer, but:

The methods above use R. You'll find more methods by looking, e.g., for "Iterative Incremental Clustering of Time Series".

Solution 3

I have recently come across the kml R package which claims to implement k-means clustering for longitudinal data. I have not tried it out myself.

Also the Time-series clustering - A decade review paper by S. Aghabozorgi, A. S. Shirkhorshidi and T. Ying Wah might be useful to you to seek out alternatives. Another nice paper although somewhat dated is Clustering of time series data-a survey by T. Warren Liao.

Solution 4

If you did really want to use clustering, then dependent on your application you could generate a low dimensional feature vector for each time series. For example, use time series mean, standard deviation, dominant frequency from a Fourier transform etc. This would be suitable for use with k-means, but whether it would give you useful results is dependent on your specific application and the content of your time series.

Share:
32,875
Jaz
Author by

Jaz

Updated on September 28, 2020

Comments

  • Jaz
    Jaz over 3 years

    How can I do K-means clustering of time series data? I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data length. In particular, I'm not sure how to update the mean of the cluster for time series data.

    I have a set of labelled time series, and I want to use the K-means algorithm to check whether I will get back a similar label or not. My X matrix will be N X M, where N is number of time series and M is data length as mentioned above.

    Does anyone know how to do this? For example, how could I modify this k-means MATLAB code so that it would work for time series data? Also, I would like to be able to use different distance metrics besides Euclidean distance.

    To better illustrate my doubts, here is the code I modified for time series data:


    % Check if second input is centroids
    if ~isscalar(k) 
        c=k;
        k=size(c,1);
    else
        c=X(ceil(rand(k,1)*n),:); % assign centroid randomly at start
    end
    
    % allocating variables
    g0=ones(n,1); 
    gIdx=zeros(n,1);
    D=zeros(n,k);
    
    % Main loop converge if previous partition is the same as current
    while any(g0~=gIdx)
    %     disp(sum(g0~=gIdx))
        g0=gIdx;
        % Loop for each centroid
        for t=1:k
            %  d=zeros(n,1);
            % Loop for each dimension
            for s=1:n
                D(s,t) = sqrt(sum((X(s,:)-c(t,:)).^2)); 
            end
        end
        % Partition data to closest centroids
        [z,gIdx]=min(D,[],2);
        % Update centroids using means of partitions
        for t=1:k
    
            % Is this how we calculate new mean of the time series?
            c(t,:)=mean(X(gIdx==t,:));
    
        end
    end
    
  • samkhan13
    samkhan13 almost 11 years
    could you please suggest some of the robust clustering algorithms. and what is DTW? thanks.
  • Has QUIT--Anony-Mousse
    Has QUIT--Anony-Mousse almost 11 years
    Grab any book on time series, and it will teach you DTW. Or google for "time series DTW". It's state of the art. As for clustering, look up DBSCAN and OPTICS on Wikipedia. They can be used with DTW, k-means cannot.