How to convert a case-class-based RDD into a DataFrame?
24,250
Solution 1
All you need is just
val dogDF = sqlContext.createDataFrame(dogRDD)
Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.
Solution 2
You can create a DataFrame
directly from a Seq
of case class instances using toDF
as follows:
val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
Author by
sparkour
Updated on January 07, 2020Comments
-
sparkour over 4 years
The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using
sqlContext.createDataFrame(RDD, CaseClass)
, but my DataFrame ends up empty. Here's my Scala code:// sc is the SparkContext, while sqlContext is the SQLContext. // Define the case class and raw data case class Dog(name: String) val data = Array( Dog("Rex"), Dog("Fido") ) // Create an RDD from the raw data val dogRDD = sc.parallelize(data) // Print the RDD for debugging (this works, shows 2 dogs) dogRDD.collect().foreach(println) // Create a DataFrame from the RDD val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog]) // Print the DataFrame for debugging (this fails, shows 0 dogs) dogDF.show()
The output I'm seeing is:
Dog(Rex) Dog(Fido) ++ || ++ || || ++
What am I missing?
Thanks!
-
sparkour almost 8 yearsThis worked. I also had to move the definition of the case class outside of my main function to avoid
error: No TypeTag available for Dog
. Thanks! -
qwwqwwq over 7 yearsI see, very interesting, so the second parameter is only ever required when calling from the Java API, scala will just automagically detect the fields of the Type that should be converted to columns?
-
Praveen L about 6 yearsIt worked only if
case class
moved outside ofmain
. @Vitalii , @ sparkour .. is there any explanation for whycase class
need to be moved outside ofmain
. -
Anish over 5 yearsI am getting
abstract
is a reserved keyword and cannot be used as field name as my case class hasabstract
as field name. Any workaround for this. -
Remis Haroon - رامز almost 5 yearsany explanations , why it wont work in cluster mode?
-
Federico Rizzo almost 3 yearsRemember to add import spark.implicits._