How to convert a map to Spark's RDD
16,456
Solution 1
I guess you want something like this
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
// If you know this upfront, otherwise it can be computed
// using flatMap
// trainMap.values.flatMap(_._2.keys).max + 1
val nFeatures: Int = ???
val trainMap = Map(
"x001" -> (-1, Map(0 -> 1.0, 3 -> 5.0)),
"x002" -> (1, Map(2 -> 5.0, 3 -> 6.0)))
val trainRdd: RDD[(String, LabeledPoint)] = sc
// Convert Map to Seq so it can passed to parallelize
.parallelize(trainMap.toSeq)
.map{case (id, (labelInt, values)) => {
// Convert nested map to Seq so it can be passed to Vector
val features = Vectors.sparse(nFeatures, values.toSeq)
// Convert label to Double so it can be used for LabeledPoint
val label = labelInt.toDouble
(id, LabeledPoint(label, features))
}}
Solution 2
It can be done in two ways
sc.textFile("libsvm_data.txt").map(s => createObject())
- Convert map into collection of objects and use
sc.parallelize()
The first one is preferrable.
Author by
Alt
Updated on June 15, 2022Comments
-
Alt almost 2 years
I have a data set which is in the form of some nested maps, and its Scala type is:
Map[String, (LabelType,Map[Int, Double])]
The first
String
key is a unique identifier for each sample, and the value is a tuple that contains the label (which is -1 or 1), and a nested map which is the sparse representation of the non-zero elements which are associated with the sample.I would like to load this data into Spark (using MUtil) and train and test some machine learning algorithms.
It's easy to write this data into a file with LibSVM's sparse encoding, and then load it in Spark:
writeMapToLibSVMFile(data_map,"libsvm_data.txt") // Implemeneted some where else val conf = new SparkConf().setAppName("DecisionTree").setMaster("local[4]") val sc = new SparkContext(conf) // Load and parse the data file. val data = MLUtils.loadLibSVMFile(sc, "libsvm_data.txt") // Split the data into training and test sets val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) // Train a DecisionTree model.
I know it should be as easy to directly load the
data
variable fromdata_map
, but I don't know how.Any help is appreciated!
-
Alt over 8 yearsPlease note that the "libsvm_data.txt" needs to be written to file first, which I want to avoid.
-
evgenii over 8 yearsThat's true, it has to be avoided.