Convert Array[Row] to DataFrame in Spark/Scala

10,153

Solution 1

You have a bug in the first line. collect returns an Array while map is a method that operates on DataFrames/RDDs.

Try val arrayOfRows = myDataFrame.map(t => myfun(t)).collect() instead.

Solution 2

case class PgRnk (userId : Long , pageRank: Double ) 
// create a case class 

sc.parallelize(pg10.map(r1 => PgRnk(r1.getLong(0), r1.getDouble(1)))).toDS() 
// convert into a dataset, sc.parallelize converts the array into a RDD, and then to DS 
Share:
10,153
rvp
Author by

rvp

Interests in Distributed Systems , Artificial Intelligence , Big Data Analytics and Android FW.

Updated on June 26, 2022

Comments

  • rvp
    rvp almost 2 years

    I want to convert Array[org.apache.spark.sql.Row] to a DataFrame. Could anyone suggest me a better way?

    I tried to first convert it into RDD and then tried to convert it into Dataframe , but when I perform any operation on the DataFrame , exceptions are shown.

    val arrayOfRows = myDataFrame.collect().map(t => myfun(t))
    val distDataRDD = sc.parallelize(arrayOfRows)
    val newDataframe = sqlContext.createDataFrame(distDataRDD,myschema)
    

    Here myfun() is a function which returns Row (org.apache.spark.sql.Row). The contents in the array is correct and I am able to print it without any problem.

    But when I tried to count the records in the RDD, it gave me the count as well as a warning that one of the stage contains a task of very large size.I guess I am doing something wrong. Please help.