How to convert a case-class-based RDD into a DataFrame?

24,250

Solution 1

All you need is just

val dogDF = sqlContext.createDataFrame(dogRDD)

Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.

Solution 2

You can create a DataFrame directly from a Seq of case class instances using toDF as follows:

val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
Share:
24,250
sparkour
Author by

sparkour

Updated on January 07, 2020

Comments

  • sparkour
    sparkour over 4 years

    The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here's my Scala code:

    // sc is the SparkContext, while sqlContext is the SQLContext.
    
    // Define the case class and raw data
    case class Dog(name: String)
    val data = Array(
        Dog("Rex"),
        Dog("Fido")
    )
    
    // Create an RDD from the raw data
    val dogRDD = sc.parallelize(data)
    
    // Print the RDD for debugging (this works, shows 2 dogs)
    dogRDD.collect().foreach(println)
    
    // Create a DataFrame from the RDD
    val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])
    
    // Print the DataFrame for debugging (this fails, shows 0 dogs)
    dogDF.show()
    

    The output I'm seeing is:

    Dog(Rex)
    Dog(Fido)
    ++
    ||
    ++
    ||
    ||
    ++
    

    What am I missing?

    Thanks!

  • sparkour
    sparkour almost 8 years
    This worked. I also had to move the definition of the case class outside of my main function to avoid error: No TypeTag available for Dog. Thanks!
  • qwwqwwq
    qwwqwwq over 7 years
    I see, very interesting, so the second parameter is only ever required when calling from the Java API, scala will just automagically detect the fields of the Type that should be converted to columns?
  • Praveen L
    Praveen L about 6 years
    It worked only if case class moved outside of main. @Vitalii , @ sparkour .. is there any explanation for why case class need to be moved outside of main.
  • Anish
    Anish over 5 years
    I am getting abstract is a reserved keyword and cannot be used as field name as my case class has abstract as field name. Any workaround for this.
  • Remis Haroon - رامز
    Remis Haroon - رامز almost 5 years
    any explanations , why it wont work in cluster mode?
  • Federico Rizzo
    Federico Rizzo almost 3 years
    Remember to add import spark.implicits._