Apache Spark, createDataFrame example in Java using List<?> as first argument
The problem is your using of Bean Class.
From JavaBeans Wikipedia:
JavaBeans are classes that encapsulate many objects into a single object (the bean). They are serializable, have a zero-argument constructor, and allow access to properties using getter and setter methods. The name "Bean" was given to encompass this standard, which aims to create reusable software components for Java.
To be more clear, let me give you an example using Java Bean in Spark:
Suppose we use this Bean class:
import java.io.Serializable;
public class Bean implements Serializable {
private static final long serialVersionUID = 1L;
private String k;
private String something;
public String getK() {return k;}
public String getSomething() {return something;}
public void setK(String k) {this.k = k;}
public void setSomething(String something) {this.something = something;}
}
And we have created b0
and b1
that are instances of Bean by:
Bean b0 = new Bean();
b0.setK("k0");
b0.setSomething("sth0");
Bean b1 = new Bean();
b1.setK("k1");
b1.setSomething("sth1");
Also we have added beans(b0
,b1
here) into a List<Bean>
called data
:
List<Bean> data = new ArrayList<Bean>();
data.add(b0);
data.add(b1);
Now we can create a DataFrame
using List<Bean>
and Bean
class:
DataFrame df = sqlContext.createDataFrame(data, Bean.class);
If we do df.show()
, here is the output:
+---+---------+
| k|something|
+---+---------+
| k0| sth0|
| k1| sth1|
+---+---------+
THE BETTER WAY TO CREATE DATAFRAME FROM JSON STRING
In Spark, you could directly create DataFrame
from a List of JSON Strings:
DataFrame df = sqlContext.read().json(jsc.parallelize(data));
where jsc
is an instance of JavaSparkContext
.
sc so
Updated on July 29, 2022Comments
-
sc so almost 2 years
Can someone give an example of java implementation of
public DataFrame createDataFrame(java.util.List<?> data,java.lang.Class<?> beanClass)
function, as mentioned in Spark JavaDoc?I have a list of JSON strings I am passing as the first argument and hence I am passing
String.class
as the second argument, but it gives an errorjava.lang.ClassCastException: org.apache.spark.sql.types.StringType$ cannot be cast to org.apache.spark.sql.types.StructType
not sure why, hence looking for an example.