Convert Array[Row] to DataFrame in Spark/Scala
Solution 1
You have a bug in the first line. collect
returns an Array while map
is a method that operates on DataFrames/RDDs.
Try val arrayOfRows = myDataFrame.map(t => myfun(t)).collect()
instead.
Solution 2
case class PgRnk (userId : Long , pageRank: Double )
// create a case class
sc.parallelize(pg10.map(r1 => PgRnk(r1.getLong(0), r1.getDouble(1)))).toDS()
// convert into a dataset, sc.parallelize converts the array into a RDD, and then to DS
rvp
Interests in Distributed Systems , Artificial Intelligence , Big Data Analytics and Android FW.
Updated on June 26, 2022Comments
-
rvp almost 2 years
I want to convert
Array[org.apache.spark.sql.Row]
to aDataFrame
. Could anyone suggest me a better way?I tried to first convert it into
RDD
and then tried to convert it intoDataframe
, but when I perform any operation on theDataFrame
, exceptions are shown.val arrayOfRows = myDataFrame.collect().map(t => myfun(t)) val distDataRDD = sc.parallelize(arrayOfRows) val newDataframe = sqlContext.createDataFrame(distDataRDD,myschema)
Here
myfun()
is a function which returnsRow (org.apache.spark.sql.Row)
. The contents in the array is correct and I am able to print it without any problem.But when I tried to count the records in the
RDD
, it gave me the count as well as a warning that one of the stage contains a task of very large size.I guess I am doing something wrong. Please help.