Creating Spark dataframe from numpy matrix
Solution 1
You are mixing functionality from ML and MLlib, which are not necessarily compatible. You don't need a LabeledPoint
when using spark-ml
:
sc.version
# u'2.1.1'
import numpy as np
from pyspark.ml.linalg import Vectors
df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
dff = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df)
mydf = spark.createDataFrame(dff,schema=["label", "features"])
mydf.show(5)
# +-----+-------------+
# |label| features|
# +-----+-------------+
# | 1|[0.0,0.0,0.0]|
# | 0|[0.0,1.0,1.0]|
# | 0|[0.0,1.0,0.0]|
# | 1|[0.0,0.0,1.0]|
# | 0|[0.0,1.0,0.0]|
# +-----+-------------+
PS: As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. [ref.]
Solution 2
From Numpy to Pandas to Spark:
data = np.random.rand(4,4)
df = pd.DataFrame(data, columns=list('abcd'))
spark.createDataFrame(df).show()
Output:
+-------------------+-------------------+------------------+-------------------+
| a| b| c| d|
+-------------------+-------------------+------------------+-------------------+
| 0.8026427193838694|0.16867056812634307|0.2284873209015007|0.17141853164400833|
| 0.2559088794287595| 0.3896957084615589|0.3806810025185623| 0.9362280141470332|
|0.41313827425060257| 0.8087580640179158|0.5547653674054028| 0.5386190454838264|
| 0.2948395900484454| 0.4085807623354264|0.6814694724946697|0.32031773805256325|
+-------------------+-------------------+------------------+-------------------+
Solution 3
The problem is easy to solve. You're using the ml
and the mllib
API at the same time. Stick to one. Otherwise you get this error.
This is the solution for the mllib
API:
import numpy as np
from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint
df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)
mydf = spark.createDataFrame(df,["label", "features"])
For the ml
API, you don't really need LabeledPoint
anymore. Here is an example. I would suggest to use the ml
API since the mllib
API is going to deprecated soon.
Jan Sila
Student of life, quantitative finance, mathematics and economics from Czech Republic https://cz.linkedin.com/in/jansila
Updated on June 15, 2022Comments
-
Jan Sila almost 2 years
it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it.
I've tried this:
%pyspark import numpy as np from pyspark.ml.linalg import Vectors, VectorUDT from pyspark.mllib.regression import LabeledPoint df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1) df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df) mydf = spark.createDataFrame(df,["label", "features"])
but I cannot get rid of :
TypeError: Cannot convert type <class 'pyspark.ml.linalg.DenseVector'> into Vector
I'm using the ML library for vector and the input is a double array, so what's the catch, please? It should be fine according to the documentation.
Many thanks.