How to convert a map to Spark's RDD

16,456

Solution 1

I guess you want something like this

import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint

// If you know this upfront, otherwise it can be computed
// using flatMap
// trainMap.values.flatMap(_._2.keys).max + 1
val nFeatures: Int = ??? 

val trainMap = Map(
  "x001" -> (-1, Map(0 -> 1.0, 3 -> 5.0)),
  "x002" -> (1, Map(2 -> 5.0, 3 -> 6.0)))

val trainRdd: RDD[(String, LabeledPoint)]  = sc
  // Convert Map to Seq so it can passed to parallelize
  .parallelize(trainMap.toSeq)
  .map{case (id, (labelInt, values)) => {

      // Convert nested map to Seq so it can be passed to Vector
      val features = Vectors.sparse(nFeatures, values.toSeq)

      // Convert label to Double so it can be used for LabeledPoint
      val label = labelInt.toDouble 

      (id, LabeledPoint(label, features))
 }}

Solution 2

It can be done in two ways

  1. sc.textFile("libsvm_data.txt").map(s => createObject())
  2. Convert map into collection of objects and use sc.parallelize()

The first one is preferrable.

Share:
16,456
Alt
Author by

Alt

Updated on June 15, 2022

Comments

  • Alt
    Alt almost 2 years

    I have a data set which is in the form of some nested maps, and its Scala type is:

    Map[String, (LabelType,Map[Int, Double])]
    

    The first String key is a unique identifier for each sample, and the value is a tuple that contains the label (which is -1 or 1), and a nested map which is the sparse representation of the non-zero elements which are associated with the sample.

    I would like to load this data into Spark (using MUtil) and train and test some machine learning algorithms.

    It's easy to write this data into a file with LibSVM's sparse encoding, and then load it in Spark:

    writeMapToLibSVMFile(data_map,"libsvm_data.txt") // Implemeneted some where else
    val conf = new SparkConf().setAppName("DecisionTree").setMaster("local[4]")
    val sc = new SparkContext(conf)
    
    // Load and parse the data file.
    val data = MLUtils.loadLibSVMFile(sc, "libsvm_data.txt")
    // Split the data into training and test sets
    val splits = data.randomSplit(Array(0.7, 0.3))
    val (trainingData, testData) = (splits(0), splits(1))
    
    // Train a DecisionTree model.
    

    I know it should be as easy to directly load the data variable from data_map, but I don't know how.

    Any help is appreciated!

  • Alt
    Alt over 8 years
    Please note that the "libsvm_data.txt" needs to be written to file first, which I want to avoid.
  • evgenii
    evgenii over 8 years
    That's true, it has to be avoided.