Apache Spark: How to create a matrix from a DataFrame?

18,146

Since you didn't provide an example input I'll assume it looks more or less like this where id is a row number and image contains values.

traindf = sqlContext.createDataFrame([
    (1, [1, 2, 3]),
    (2, [4, 5, 6]),
    (3, (7, 8, 9))
], ("id", "image"))

First thing you have to understand is that the DenseMatrix is a local data structure. To be precise it is a wrapper around numpy.ndarray. As for now (Spark 1.4.1) there are no distributed equivalents in PySpark MLlib.

Dense Matrix take three mandatory arguments numRows, numCols, values where values is a local data structure. In your case you have to collect first:

values = (traindf.
    rdd.
    map(lambda r: (r.id, r.image)). # Extract row id and data
    sortByKey(). # Sort by row id
    flatMap(lambda (id, image): image).
    collect())


ncol = len(traindf.rdd.map(lambda r: r.image).first())
nrow = traindf.count()

dm = DenseMatrix(nrow, ncol, values)

Finally:

> print dm.toArray()
[[ 1.  4.  7.]
 [ 2.  5.  8.]
 [ 3.  6.  9.]]

Edit:

In Spark 1.5+ you can use mllib.linalg.distributed as follows:

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix

mat = IndexedRowMatrix(traindf.map(lambda row: IndexedRow(*row)))
mat.numRows()
## 4
mat.numCols()
## 3

although as for now API is still to limited to be useful in practice.

Share:
18,146

Related videos on Youtube

NormallySane
Author by

NormallySane

Undergraduate Student at University of Waterloo studying Statistics and Computational Mathematics.

Updated on June 07, 2022

Comments

  • NormallySane
    NormallySane almost 2 years

    I have a DataFrame in Apache Spark with an array of integers, the source is a set of images. I ultimately want to do PCA on it, but I am having trouble just creating a matrix from my arrays. How do I create a matrix from a RDD?

    > imagerdd = traindf.map(lambda row: map(float, row.image))
    > mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd)
    Traceback (most recent call last):
    
      File "<ipython-input-21-6fdaa8cde069>", line 2, in <module>
    mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd)
    
      File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__
    values = self._convert_to_array(values, np.float64)
    
      File     "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array
        return np.asarray(array_like, dtype=dtype)
    
      File "/usr/local/python/conda/lib/python2.7/site-        packages/numpy/core/numeric.py", line 462, in asarray
        return array(a, dtype, copy=False, order=order)
    
    TypeError: float() argument must be a string or a number
    

    I'm getting the same error from every possible arrangement I can think of:

    imagerdd = traindf.map(lambda row: Vectors.dense(row.image))
    imagerdd = traindf.map(lambda row: row.image)
    imagerdd = traindf.map(lambda row: np.array(row.image))
    

    If I try

    > imagedf = traindf.select("image")
    > mat = DenseMatrix(numRows=206456, numCols=10, values=imagedf)
    

    Traceback (most recent call last):

      File "<ipython-input-26-a8cbdad10291>", line 2, in <module>
    mat = DenseMatrix(numRows=206456, numCols=10, values=imagedf)
    
      File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__
        values = self._convert_to_array(values, np.float64)
    
      File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array
        return np.asarray(array_like, dtype=dtype)
    
      File "/usr/local/python/conda/lib/python2.7/site-packages/numpy/core/numeric.py", line 462, in asarray
        return array(a, dtype, copy=False, order=order)
    
    ValueError: setting an array element with a sequence.
    
  • Moustafa Mahmoud
    Moustafa Mahmoud over 6 years
    Do you know how to do the same into scala? stackoverflow.com/questions/47010126/…