Mahout: CSV to vector and running the program

10,848

Solution 1

For getting your data in SequenceFile format, you have a couple of strategies you can take. Both involve writing your own code -- i.e., not strictly command-line.

Strategy 1 Use Mahout's CSVVectorIterator class. You pass it a java.io.Reader and it will read in your CSV file, turn each row into a DenseVector. I've never used this, but saw it in the API. Looks straight-forward enough if you're ok with DenseVectors.

Strategy 2 Write your own parser. This is really easy, since you just split each line on "," and you have an array you can loop through. For each array of values in each line, you instantiate a vector using something like this:

new DenseVector(<your array here>);

and add it to a List (for example).

Then ... once you have a List of Vectors, you can write them to SequenceFiles using something like this (I'm using NamedVectors in below code):

FileSystem fs = null;
SequenceFile.Writer writer;
Configuration conf = new Configuration();

List<NamedVector> vectors = <here's your List of vectors obtained from CSVVectorIterator>;

// Write the data to SequenceFile
try {
    fs = FileSystem.get(conf);

    Path path = new Path(<your path> + <your filename>);
    writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class);

    VectorWritable vec = new VectorWritable();
    for (NamedVector vector : dataVector) {

        vec.set(vector);
        writer.append(new Text(vector.getName()), vec);

    }
    writer.close();

} catch (Exception e) {
    System.out.println("ERROR: "+e);
}

Now you have a directory of "points" in SequenceFile format that you can use for your K-means clustering. You can point the command line Mahout commands at this directory as input.

Anyway, that's the general idea. There are probably other approaches as well.

Solution 2

To run kmeans with csv file, first you have to create a SequenceFile to pass as an argument in KmeansDriver. The following code reads each line of the CSV file "points.csv" and converts it into vector and write it to the SequenceFile "points.seq"

try (
            BufferedReader reader = new BufferedReader(new FileReader("testdata2/points.csv"));
            SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,new Path("testdata2/points.seq"), LongWritable.class, VectorWritable.class)
        ) {
            String line;
            long counter = 0;
            while ((line = reader.readLine()) != null) {
                String[] c = line.split(",");
                if(c.length>1){
                    double[] d = new double[c.length];
                    for (int i = 0; i < c.length; i++)
                            d[i] = Double.parseDouble(c[i]);
                    Vector vec = new RandomAccessSparseVector(c.length);
                    vec.assign(d);

                VectorWritable writable = new VectorWritable();
                writable.set(vec);
                writer.append(new LongWritable(counter++), writable);
            }
        }
        writer.close();
    }

Hope it helps!!

Solution 3

There were a few issues when I was running the above code, so with a few modifications in the syntax here is the working code.

String inputfiledata = Input_file_path;
            String outputfile = output_path_for_sequence_file;
            FileSystem fs = null;
            SequenceFile.Writer writer;
            Configuration conf = new Configuration();
            fs = FileSystem.get(conf);
            Path path = new Path(outputfile);`enter code here`
            writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class);
            VectorWritable vec = new VectorWritable();
            List<NamedVector> vects = new ArrayList<NamedVector>();
            try {
                fr = new FileReader(inputfiledata);
                br = new BufferedReader(fr);
                s = null;
                while((s=br.readLine())!=null){

                    // My columns are split by tabs with each entry in a new line as rows
                    String spl[] = s.split("\\t");
                    String key = spl[0];
                    Integer val = 0;
                    for(int k=1;k<spl.length;k++){
                                colvalues[val] = Double.parseDouble(spl[k]);
                                val++;
                        }
                    }
                    NamedVector nmv = new NamedVector(new DenseVector(colvalues),key);
                    vec.set(nmv);
                    writer.append(new Text(nmv.getName()), vec);
                }
                            writer.close();

            } catch (Exception e) {
                System.out.println("ERROR: "+e);
            }
        }
Share:
10,848
Eduard Gamonal
Author by

Eduard Gamonal

Updated on June 16, 2022

Comments

  • Eduard Gamonal
    Eduard Gamonal almost 2 years

    I'm analysing the k-means algorithm with Mahout. I'm going to run some tests, observe performance, and do some statistics with the results I get.

    I can't figure out the way to run my own program within Mahout. However, the command-line interface might be enough.

    To run the sample program I do

    $ mahout seqdirectory --input uscensus --output uscensus-seq
    $ mahout seq2sparse -i uscensus-seq -o uscensus-vec
    $ mahout kmeans -i reuters-vec/tfidf-vectors -o uscensus-kmeans-clusters -c uscensus-kmeans-centroids -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
    

    The dataset is one large CSV file. Each line is a record. Features are comma separated. The first field is an ID. Because of the input format I can not use seqdirectory right away. I'm trying to implement the answer to this similar question How to perform k-means clustering in mahout with vector data stored as CSV? but I still have 2 Questions:

    1. How do I convert from CSV to SeqFile? I guess I can write my own program using Mahout to make this conversion and then use its output as input for seq2parse. I guess I can use CSVIterator (https://cwiki.apache.org/confluence/display/MAHOUT/File+Format+Integrations). What class should I use to read and write?
    2. How do I build and run my new program? I couldn't figure it out with the book Mahout in action or with other questions here.