Spark map dataframe using the dataframe's schema

19,840

Well, you can but result is rather useless:

val df = Seq(("Justin", 19, "red")).toDF("name", "age", "color")

def getValues(row: Row, names: Seq[String]) = names.map(
  name => name -> row.getAs[Any](name)
).toMap

val names = df.columns
df.rdd.map(getValues(_, names)).first

// scala.collection.immutable.Map[String,Any] = 
//   Map(name -> Justin, age -> 19, color -> red)

To get something actually useful one would a proper mapping between SQL types and Scala types. It is not hard in simple cases but it is hard in general. For example there is built-in type which can be used to represent an arbitrary struct. This can be done using a little bit of meta-programming but arguably it is not worth all the fuss.

Share:
19,840

Related videos on Youtube

Havnar
Author by

Havnar

Raspberry Pi fan and Big Data padawan.

Updated on June 14, 2022

Comments

  • Havnar
    Havnar almost 2 years

    I have a dataframe, created from a JSON object. I can query this dataframe and write it to parquet.

    Since I infer the schema, I don't necessarily know what's in the dataframe.

    Is there a way to the the column names out or map the dataframe using its own schema?

    // The results of SQL queries are DataFrames and support all the normal  RDD operations.
    // The columns of a row in the result can be accessed by field index:
    df.map(t => "Name: " + t(0)).collect().foreach(println)
    
    // or by field name:
    df.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)
    
    // row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
    df.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
    // Map("name" -> "Justin", "age" -> 19)
    

    I would want to do something like

    df.map (_.getValuesMap[Any](ListAll())).collect().foreach(println)
    // Map ("name" -> "Justin", "age" -> 19, "color" -> "red")
    

    without knowing the actual amount or names of the columns.

Related