How can I write the clustering results from mclust to file?

12,210

To calculate the actual clustering parameters themselves (mean, variance, what cluster each point belongs to), you need to use Mclust. To do the writing you can use (for example) write.csv.

By default Mclust calculates the parameters based on the most optimal model as determined by BIC, so if that's what you want to do, you can do:

myMclust <- Mclust(myData)

Then myMclust$BIC will contain the results for all the other models (ie myMclust$BIC is more-or-less the same as mclustBIC(myData)).

See ?Mclust in the Value: section to see what other information myMclust has. For example, myMclust$parameters$mean is the mean for each cluster, myMclust$parameters$variance the variance for each cluster, ...

However myMclust$classification will contain which cluster each point belongs to, calculated for the most optimal model.

So, to get the output you want, you can do:

# create some data for example purposes -- you have your read.csv(...) instead.
myData <- data.frame(x=runif(100),y=runif(100),z=runif(100))
# get parameters for most optimal model
myMclust <- Mclust(myData)
# if you wanted to do your summary like before:
mySummary <- summary( myMclust$BIC, data=myData )

# add a column in myData CLUST with the cluster.
myData$CLUST <- myMclust$classification
# now to write it out:
write.csv(myData[,c("CLUST","x","y","z")], # reorder columns to put CLUST first
          file="out.csv",                  # output filename
          row.names=FALSE,                 # don't save the row numbers
          quote=FALSE)                     # don't surround column names in ""

A note on the write.csv - if you don't put in row.names=FALSE you'll get an extra column in your csv containing the row number. Also, quote=FALSE puts your column headings as CLUST,x,y,z whereas otherwise they'd be "CLUST","x","y","z". It's your choice.

Suppose we wanted to do the same, but use the parameters from a different model that was not optimal. However, Mclust calculates parameters only for the optimal model by default. To calculate parameters for a particular model (say "EEI"), you'd do:

myMclust <- Mclust(myData,modelNames="EEI")

and then proceed as before.

Share:
12,210
si28719e
Author by

si28719e

Updated on June 25, 2022

Comments

  • si28719e
    si28719e almost 2 years

    I'm using the mclust library for R ( http://www.stat.washington.edu/mclust ) to do some experimental EM-based GMM clustering. The package is great and seems to generally find very good clusters for my data.

    The problem is that I don't really know R at all, and while I have managed to muddle through the clustering process based on the help() contents and the extensive readme, I cannot for the life of me figure out how to write out the actual cluster results to file. I am using the following absurdly simple script to perform the clustering,

    myData <- read.csv("data.csv", sep=",", header=FALSE)
    attach(myData)
    myBIC <- mclustBIC(myData)
    mySummary <- summary( myBIC, data=myData )
    

    at which point I have cluster results and a summary. The data in data.csv is just a list of multi-dimensional points, one per line. So each line looks like 'x,y,z' (in the case of 3 dimensions).

    If I use 2d points (e.g. just the x and y vals) I can then use the internal plot function to get a very pretty graph that plots the original points and color codes each point based on the cluster it was assigned to. So I know all the info is somewhere in 'myBIC', but the docs and help don't seem to provide any insight as to how to print out this data!

    I want to print out a new file based on the results I believe are encoded in myBIC. Something like,

    CLUST x, y, z
    1 1.2, 3.4, 5.2
    1 1.2, 3.3, 5.2
    2 5.5, 1.3, 1.3
    3 7.1, 1.2, -1.0
    3 7.2, 1.2, -1.1
    

    and then - hopefully - also print out the parameters/centroids of the individual gaussians/clusters that the clustering process found.

    Surely this is an absurdly easy thing to do and I'm just too ignorant of R to figure it out...

    EDIT: I seem to have gotten a little bit further along. Doing the following prints out a somewhat cryptic matrix,

        > mySummary$classification
    [1] 1 1 2 1 3
    [6] 1 1 1 3 1
    [12] 1 2 1 3 1
    [18] 1 3 
    

    which upon reflection I realized is actually the list of samples and their classifications. I guess it is not possible to write this directly via the write command, but a bit more experimentation in the R console lead me to realize that I can do this:

    > newData <- mySummary$classification
    > write( newData, file="class.csv" )
    

    and that the result actually looks pretty nice!

     $ head class.csv
    "","x"
    "1",1
    "2",2
    "3",2
    

    where the first column apparenly matches the index for the input data, and the second column describes the assigned class identity.

    The 'mySummary$parameters' object appears to be nested though, and has a bunch of sub-objects corresponding to the individual gaussians and their parameters, etc. The 'write' function fails when I try to just write it out, but individually writing out each sub object name is a bit tedious. Which leads me to a new question: how do I iterate over a nested object in R and print the elements out in a serial fashion to a file descriptor?

    I have this 'mySummary$parameters' object. It is composed of several sub-objects like 'mySummary$parameters$variance$sigma', etc. I would like to just iterate over everything and print it all to file in the same way that this is done to the CLI automatically...

  • si28719e
    si28719e over 12 years
    Awesome! thank you very much for the detailed rundown. I guess it will just take a while to get used to the quirks of R (as with any new language). This gave me some important insights into what is going on. I'm also pretty amazed at how much I've managed to do without knowing anything about R. Thanks again.
  • mathematical.coffee
    mathematical.coffee over 12 years
    I fell into R much like you did so I know what you mean, I felt like a monkey bashing away at my keyboard when I first started :P Good luck!
  • NiuBiBang
    NiuBiBang almost 10 years
    I know comments are supposed to avoid "+1" & "thanks", but myData$CLUST <- myMclust$classification & myMclust <- Mclust(myData,modelNames="EEI") are beautiful, exactly what I needed. I also appended the z-scores (MyData$PROB <- MyClust$z) to look @ the relative probabilities for each of the records' cluster membership.
  • rafa.pereira
    rafa.pereira about 7 years
    @NiuBiBang , just be aware that MyClust$z returns a probability matrix that shows the probability of each observation to fall in each and every cluster. So when you do MyData$PROB <- MyClust$z you're only getting the probabilities for the observations to fall within the first! cluster. Apart from that, thanks for the question ! If you want to get the highest probability to the cluster assingned, you should do ` MyData$PROB <- apply(d_clust$z, 1 , max)`