Apache Spark: How to create a matrix from a DataFrame?
Since you didn't provide an example input I'll assume it looks more or less like this where id
is a row number and image
contains values.
traindf = sqlContext.createDataFrame([
(1, [1, 2, 3]),
(2, [4, 5, 6]),
(3, (7, 8, 9))
], ("id", "image"))
First thing you have to understand is that the DenseMatrix
is a local data structure. To be precise it is a wrapper around numpy.ndarray
. As for now (Spark 1.4.1) there are no distributed equivalents in PySpark MLlib.
Dense Matrix take three mandatory arguments numRows
, numCols
, values
where values
is a local data structure. In your case you have to collect first:
values = (traindf.
rdd.
map(lambda r: (r.id, r.image)). # Extract row id and data
sortByKey(). # Sort by row id
flatMap(lambda (id, image): image).
collect())
ncol = len(traindf.rdd.map(lambda r: r.image).first())
nrow = traindf.count()
dm = DenseMatrix(nrow, ncol, values)
Finally:
> print dm.toArray()
[[ 1. 4. 7.]
[ 2. 5. 8.]
[ 3. 6. 9.]]
Edit:
In Spark 1.5+ you can use mllib.linalg.distributed
as follows:
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
mat = IndexedRowMatrix(traindf.map(lambda row: IndexedRow(*row)))
mat.numRows()
## 4
mat.numCols()
## 3
although as for now API is still to limited to be useful in practice.
Related videos on Youtube
NormallySane
Undergraduate Student at University of Waterloo studying Statistics and Computational Mathematics.
Updated on June 07, 2022Comments
-
NormallySane almost 2 years
I have a DataFrame in Apache Spark with an array of integers, the source is a set of images. I ultimately want to do PCA on it, but I am having trouble just creating a matrix from my arrays. How do I create a matrix from a RDD?
> imagerdd = traindf.map(lambda row: map(float, row.image)) > mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd) Traceback (most recent call last): File "<ipython-input-21-6fdaa8cde069>", line 2, in <module> mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd) File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__ values = self._convert_to_array(values, np.float64) File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array return np.asarray(array_like, dtype=dtype) File "/usr/local/python/conda/lib/python2.7/site- packages/numpy/core/numeric.py", line 462, in asarray return array(a, dtype, copy=False, order=order) TypeError: float() argument must be a string or a number
I'm getting the same error from every possible arrangement I can think of:
imagerdd = traindf.map(lambda row: Vectors.dense(row.image)) imagerdd = traindf.map(lambda row: row.image) imagerdd = traindf.map(lambda row: np.array(row.image))
If I try
> imagedf = traindf.select("image") > mat = DenseMatrix(numRows=206456, numCols=10, values=imagedf)
Traceback (most recent call last):
File "<ipython-input-26-a8cbdad10291>", line 2, in <module> mat = DenseMatrix(numRows=206456, numCols=10, values=imagedf) File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__ values = self._convert_to_array(values, np.float64) File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array return np.asarray(array_like, dtype=dtype) File "/usr/local/python/conda/lib/python2.7/site-packages/numpy/core/numeric.py", line 462, in asarray return array(a, dtype, copy=False, order=order) ValueError: setting an array element with a sequence.
-
Moustafa Mahmoud over 6 yearsDo you know how to do the same into scala? stackoverflow.com/questions/47010126/…