Convert a Spark Vector of features into an array

arrays scala apache-spark vector apache-spark-sql

14,755

Solution 1

Spark 3.0 added vector_to_array UDF. No need to implement yourself https://github.com/apache/spark/pull/26910

import org.apache.spark.ml.linalg.{SparseVector, Vector}
import org.apache.spark.mllib.linalg.{Vector => OldVector}

private val vectorToArrayUdf = udf { vec: Any =>
    vec match {
      case v: Vector => v.toArray
      case v: OldVector => v.toArray
      case v => throw new IllegalArgumentException(
        "function vector_to_array requires a non-null input argument and input type must be " +
        "`org.apache.spark.ml.linalg.Vector` or `org.apache.spark.mllib.linalg.Vector`, " +
        s"but got ${ if (v == null) "null" else v.getClass.getName }.")
    }
  }.asNonNullable()

Solution 2

If you just want to convert DenseVector into Array[Double] this is fairly simple with the UDF:

import org.apache.spark.ml.linalg.DenseVector
val toArr: Any => Array[Double] = _.asInstanceOf[DenseVector].toArray
val toArrUdf = udf(toArr)
val dataWithFeaturesArr = dataWithFeatures.withColumn("features_arr",toArrUdf('features))

This will give you a new column:

|-- features_arr: array (nullable = true)
|    |-- element: double (containsNull = false)

Solution 3

Here is a way (without udf) to get a Datagrame(String, Array) from a Dataframe (String, Vector). Main idea is to use an intermediate RDD to cast as a Vector, and use its toArray method:

val arrayDF = vectorDF.rdd
    .map(x => x.getAs[String](0) -> x.getAs[Vector](1).toArray)
    .toDF("word","array")

14,755

Author by

LucieCBurgess

Updated on June 13, 2022

Comments

LucieCBurgess almost 2 years
I have a features column which is packaged into a Vector of vectors using Spark's VectorAssembler, as follows. data is the input DataFrame (of type spark.sql.DataFrame).
```
val featureCols = Array("feature_1","feature_2","feature_3")
val featureAssembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val dataWithFeatures = featureAssembler.transform(data)
```
I am developing a custom Classifier using the Classifier and ClassificationModel developer API. ClassificationModel requires development of a predictRaw() function which outputs a vector of predicted labels from the model.
```
def predictRaw(features: FeaturesType) : Vector
```
This function is set by the API and takes a parameter, features of FeaturesType and outputs a Vector (which in my case I'm taking to be a Spark DenseVector as DenseVector extends the Vector trait).

Due to the packaging by VectorAssembler, the features column is of type Vector and each element is itself a vector, of the original features for each training sample. For example:

features column - of type Vector
[1.0, 2.0, 3.0] - element1, itself a vector
[3.5, 4.5, 5.5] - element2, itself a vector

I need to extract these features into an Array[Double] in order to implement my predictRaw() logic. Ideally I would like the following result in order to preserve the cardinality:
```
`val result: Array[Double] = Array(1.0, 3.5, 2.0, 4.5, 3.0, 4.5)` 
```
i.e. in column-major order as I will turn this into a matrix.

I've tried:
```
val array = features.toArray // this gives an array of vectors and doesn't work
```
I've also tried to input the features as a DataFrame object rather than a Vector but the API is expecting a Vector, due to the packaging of the features from VectorAssembler. For example, this function inherently works, but doesn't conform to the API as it's expecting FeaturesType to be Vector as opposed to DataFrame:
```
def predictRaw(features: DataFrame) :DenseVector = {
  val featuresArray: Array[Double] = features.rdd.map(r => r.getAs[Vector](0).toArray).collect 
//rest of logic would go here
}
```
My problem is that features is of type Vector, not DataFrame. The other option might be to package features as a DataFrame but I don't know how to do that without using VectorAssembler.

All suggestions appreciated, thanks! I have looked at Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) but this is in python and I'm using Scala.
LucieCBurgess over 6 years

Hello - I'm not sure if any of these are really doing quite what I want. With the extract_features UDF above I seem to get the same column as the features column, as follows: +--------------------+--------------------+ | features| extracted_features| +--------------------+--------------------+ |[-9.5357,0.016682...|[-9.5357, 0.01668...| +--------------------+--------------------+
LucieCBurgess over 6 years

In other words the features column and extracted features look exactly the same. I can get to each element like this: only showing top 1 row. If I then do the following: val featuresArray1: Array[Double] = temp.rdd.map(r => r.getAs[Double](0)).collect (using index elements 1 and 2) - will ask another question as running out of space
LucieCBurgess over 6 years

I think the problem is toArray gives an array with 3 elements for each row and then I struggle to access these. I'm going to ask a separate question so this is clearer. Please take a look, thanksI