How can I create a dataframe using other dataframe (PySpark)?

12,821
from pyspark.sql.functions import unix_timestamp, col, to_date, struct

####
#sample data
####
df = sc.parallelize([[25, 'Prem', 'M', '12-21-2006 11:00:05','abc', '1'],
                      [20, 'Kate', 'F', '05-30-2007 10:05:00', 'asdf', '2'],
                      [40, 'Cheng', 'M', '12-30-2017 01:00:01', 'qwerty', '3']]).\
    toDF(["age","name","sex","datetime_in_strFormat","initial_col_name","col_in_strFormat"])

#create 'struct' type column by combining first 3 columns of sample data - (this is built to answer query #1)
df = df.withColumn("struct_col", struct('age', 'name', 'sex')).\
    drop('age', 'name', 'sex')
df.show()
df.printSchema()

####
#query 1
####
#Convert a field that has a struct of three values (i.e. 'struct_col') in different columns (i.e. 'name', 'age' & 'sex')
df = df.withColumn('name', col('struct_col.name')).\
    withColumn('age', col('struct_col.age')).\
    withColumn('sex', col('struct_col.sex')).\
    drop('struct_col')
df.show()
df.printSchema()

####
#query 2
####
#Convert the timestamp from string (i.e. 'datetime_in_strFormat') to datetime (i.e. 'datetime_in_tsFormat')
df = df.withColumn('datetime_in_tsFormat',
                   unix_timestamp(col('datetime_in_strFormat'), 'MM-dd-yyyy hh:mm:ss').cast("timestamp"))
df.show()
df.printSchema()

####
#query 3
####
#create more columns using above timestamp (e.g. fetch date value from timestamp column)
df = df.withColumn('datetime_in_dateFormat', to_date(col('datetime_in_tsFormat')))
df.show()

####
#query 4.a
####
#Change column name (e.g. 'initial_col_name' is renamed to 'new_col_name)
df = df.withColumnRenamed('initial_col_name', 'new_col_name')
df.show()

####
#query 4.b
####
#Change column type (e.g. string type in 'col_in_strFormat' is coverted to double type in 'col_in_doubleFormat')
df = df.withColumn("col_in_doubleFormat", col('col_in_strFormat').cast("double"))
df.show()
df.printSchema()

Sample data:

+---------------------+----------------+----------------+------------+
|datetime_in_strFormat|initial_col_name|col_in_strFormat|  struct_col|
+---------------------+----------------+----------------+------------+
|  12-21-2006 11:00:05|             abc|               1| [25,Prem,M]|
|  05-30-2007 10:05:00|            asdf|               2| [20,Kate,F]|
|  12-30-2017 01:00:01|          qwerty|               3|[40,Cheng,M]|
+---------------------+----------------+----------------+------------+
root
 |-- datetime_in_strFormat: string (nullable = true)
 |-- initial_col_name: string (nullable = true)
 |-- col_in_strFormat: string (nullable = true)
 |-- struct_col: struct (nullable = false)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- sex: string (nullable = true)

Final output data:

+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+
|datetime_in_strFormat|new_col_name|col_in_strFormat| name|age|sex|datetime_in_tsFormat|datetime_in_dateFormat|col_in_doubleFormat|
+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+
|  12-21-2006 11:00:05|         abc|               1| Prem| 25|  M| 2006-12-21 11:00:05|            2006-12-21|                1.0|
|  05-30-2007 10:05:00|        asdf|               2| Kate| 20|  F| 2007-05-30 10:05:00|            2007-05-30|                2.0|
|  12-30-2017 01:00:01|      qwerty|               3|Cheng| 40|  M| 2017-12-30 01:00:01|            2017-12-30|                3.0|
+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+

root
 |-- datetime_in_strFormat: string (nullable = true)
 |-- new_col_name: string (nullable = true)
 |-- col_in_strFormat: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- datetime_in_tsFormat: timestamp (nullable = true)
 |-- datetime_in_dateFormat: date (nullable = true)
 |-- col_in_doubleFormat: double (nullable = true)
Share:
12,821
frm
Author by

frm

Updated on June 28, 2022

Comments

  • frm
    frm almost 2 years

    I'm using PySpark v1.6.1 and I want to create a dataframe using another one:

    • Convert a field that has a struct of three values in different columns
    • Convert the timestamp from string to datatime
    • Create more columns using that timestamp
    • Change the rest of the column names and types

    Right now is using .map(func) creating an RDD using that function (which transforms from one row from the original type and returns a row with the new one). But this is creating an RDD and I don't wont that.

    Is there a nicer way to do this?

    • eiram_mahera
      eiram_mahera over 6 years
      If you are already able to create an RDD, you can easily transform it into DF.
    • frm
      frm over 6 years
      but i don't want to create an RDD, i want to avoid using RDDs since they are a performance bottle neck for python, i just want to do DF transformations
    • pault
      pault over 6 years
      Please provide some code of what you've tried so we can help. In the meantime, look up pyspark.sql.functions.udf and the withColumn() method of spark dataframes.