SparseVector to DenseVector conversion in Pyspark

10,996

Spark 2.0.2+

You should be able to iterate SparseVectors. See: SPARK-17587.

Spark < 2.0.2

Well, the first case is quite interesting but overall behavior doesn't look like a bug at all. If you take a look at the DenseVector constructor it considers only two cases.

  1. ar is a bytes object (immutable sequence of integers in the range 0 <= x < 256)
  2. Otherwise we simply call np.array(ar, dtype=np.float64)

SparseVector is clearly not a bytes object so when pass it to the constructor it is used a an object parameter for np.array call. If you check numpy.array docs you learn that object should be

An array, any object exposing the array interface, an object whose __array__ method returns an array, or any (nested) sequence.

You can check that SparseVector doesn't meet above criteria. It is not a Python sequence type and:

>>> sv = SparseVector(5, {4: 1.})
>>> isinstance(sv, np.ndarray)
False
>>> hasattr(sv, "__array_interface__")
False
>>> hasattr(sv, "__array__")
False
>>> hasattr(sv, "__iter__")
False

If you want to convert SparseVector to DenseVector you should probably use toArray method:

DenseVector(sv.toArray())

Edit:

I think this behavior explains why DenseVector(SparseVector(...)) may work in some cases:

>>> [x for x in SparseVector(5, {0: 1.})]
[1.0]
>>> [x for x in SparseVector(5, {4: 1.})]
Traceback (most recent call last):
...
ValueError: Index 5 out of bounds.
Share:
10,996
Admin
Author by

Admin

Updated on June 07, 2022

Comments

  • Admin
    Admin almost 2 years

    Unexpected errors when converting a SparseVector to a DenseVector in PySpark 1.4.1:

    from pyspark.mllib.linalg import SparseVector, DenseVector
    
    DenseVector(SparseVector(5, {4: 1.}))
    

    This runs properly on Ubuntu, running pyspark, returning:

    DenseVector([0.0, 0.0, 0.0, 0.0, 1.0])

    This results into an error on RedHat, running pyspark, returning:

    Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/mllib/linalg.py", line 206, in init ar = np.array(ar, dtype=np.float64) File "/usr/lib/spark/python/pyspark/mllib/linalg.py", line 673, in getitem raise ValueError("Index %d out of bounds." % index) ValueError: Index 5 out of bounds.


    Also, on both platform, evaluating the following also results into an error:

    DenseVector(SparseVector(5, {0: 1.}))
    

    I would expect:

    DenseVector([1.0, 0.0, 0.0, 0.0, 0.0])

    but get:

    • Ubuntu:

    Traceback (most recent call last): File "", line 1, in File "/home/skander/spark-1.4.1-bin-hadoop2.6/python/pyspark/mllib/linalg.py", line 206, in init ar = np.array(ar, dtype=np.float64) File "/home/skander/spark-1.4.1-bin-hadoop2.6/python/pyspark/mllib/linalg.py", line 676, in getitem row_ind = inds[insert_index] IndexError: index out of bounds

    Note: this error message is different from the previous one, although the error occurs in the same function (code at https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg.html)

    • RedHat: the same command results into a Segmentation Fault, which crashes Spark.