Copy schema from one dataframe to another dataframe
Solution 1
If schema is flat I would use simply map over per-existing schema and select
required columns:
val exprs = df1.schema.fields.map { f =>
if (df2.schema.fields.contains(f)) col(f.name)
else lit(null).cast(f.dataType).alias(f.name)
}
df2.select(exprs: _*).printSchema
// root
// |-- A: string (nullable = true)
// |-- B: integer (nullable = false)
// |-- C: double (nullable = true)
// |-- D: integer (nullable = true)
Solution 2
Working in 2018 (Spark 2.3) reading a .sas7bdat
Scala
val sasFile = "file.sas7bdat"
val dfSas = spark.sqlContext.sasFile(sasFile)
val myManualSchema = dfSas.schema //getting the schema from another dataframe
val df = spark.read.format("csv").option("header","true").schema(myManualSchema).load(csvFile)
PD: spark.sqlContext.sasFile use saurfang library, you could skip that part of code and get the schema from another dataframe.
Solution 3
Below are simple PYSPARK steps to achieve same:
df = <dataframe whose schema needs to be copied>
df_tmp = <dataframe with result with fewer fields>
#Note: field names from df_tmp must match with field names from df
df_tmp_cols = [colmn.lower() for colmn in df_tmp.columns]
for col_dtls in df.dtypes:
col_name, dtype = col_dtls
if col_name.lower() in df_tmp_cols:
df_tmp = df_tmp.withColumn(col_name,f.col(col_name).cast(dtype))
else:
df_tmp = df_tmp.withColumn(col_name,f.lit(None).cast(dtype))
df_fin = df_tmp.select(df.columns) #Final dataframe
Comments
-
RudyVerboven almost 2 years
I'm trying to change the schema of an existing dataframe to the schema of another dataframe.
DataFrame 1:
Column A | Column B | Column C | Column D "a" | 1 | 2.0 | 300 "b" | 2 | 3.0 | 400 "c" | 3 | 4.0 | 500
DataFrame 2:
Column K | Column B | Column F "c" | 4 | 5.0 "b" | 5 | 6.0 "f" | 6 | 7.0
So I want to apply the schema of the first dataframe on the second. So all the columns which are the same remain. The columns in dataframe 2 that are not in 1 get deleted. The others become "NULL".
Output
Column A | Column B | Column C | Column D "NULL" | 4 | "NULL" | "NULL" "NULL" | 5 | "NULL" | "NULL" "NULL" | 6 | "NULL" | "NULL"
So I came with a possible solution:
val schema = df1.schema val newRows: RDD[Row] = df2.map(row => { val values = row.schema.fields.map(s => { if(schema.fields.contains(s)){ row.getAs(s.name).toString }else{ "NULL" } }) Row.fromSeq(values) }) sqlContext.createDataFrame(newRows, schema)}
Now as you can see this will not work because the schema contains String, Int and Double. And all my rows have String values.
This is where I'm stuck, is there a way to automatically convert the type of my values to the schema?
-
RudyVerboven about 8 yearsThanks for the reply ! I gave it a try and it worked, exactly what I needed! (cannot upvote yet)
-
RudyVerboven about 8 yearsThanks for the reply, I edited my question. It is important to note that the dataframes are not relational.
-
pallupz almost 4 yearsHow do I do this in PySpark? I have this exact same requirement but in Python