Apache Spark, createDataFrame example in Java using List<?> as first argument

10,368

The problem is your using of Bean Class.

From JavaBeans Wikipedia:

JavaBeans are classes that encapsulate many objects into a single object (the bean). They are serializable, have a zero-argument constructor, and allow access to properties using getter and setter methods. The name "Bean" was given to encompass this standard, which aims to create reusable software components for Java.

To be more clear, let me give you an example using Java Bean in Spark:

Suppose we use this Bean class:

import java.io.Serializable;

public class Bean implements Serializable {
    private static final long serialVersionUID = 1L;

    private String k;
    private String something;

    public String getK() {return k;}
    public String getSomething() {return something;}

    public void setK(String k) {this.k = k;}
    public void setSomething(String something) {this.something = something;}
}

And we have created b0 and b1 that are instances of Bean by:

Bean b0 = new Bean();
b0.setK("k0");
b0.setSomething("sth0");
Bean b1 = new Bean();
b1.setK("k1");
b1.setSomething("sth1");

Also we have added beans(b0,b1 here) into a List<Bean> called data:

List<Bean> data = new ArrayList<Bean>();
data.add(b0);
data.add(b1);

Now we can create a DataFrame using List<Bean> and Bean class:

DataFrame df = sqlContext.createDataFrame(data, Bean.class);

If we do df.show(), here is the output:

+---+---------+
|  k|something|
+---+---------+
| k0|     sth0|
| k1|     sth1|
+---+---------+

THE BETTER WAY TO CREATE DATAFRAME FROM JSON STRING

In Spark, you could directly create DataFrame from a List of JSON Strings:

DataFrame df = sqlContext.read().json(jsc.parallelize(data));

where jsc is an instance of JavaSparkContext.

Share:
10,368
sc so
Author by

sc so

Updated on July 29, 2022

Comments

  • sc so
    sc so almost 2 years

    Can someone give an example of java implementation of public DataFrame createDataFrame(java.util.List<?> data,java.lang.Class<?> beanClass) function, as mentioned in Spark JavaDoc?

    I have a list of JSON strings I am passing as the first argument and hence I am passing String.class as the second argument, but it gives an error

    java.lang.ClassCastException: org.apache.spark.sql.types.StringType$ cannot be cast to org.apache.spark.sql.types.StructType
    

    not sure why, hence looking for an example.