How can I create a dataframe using other dataframe (PySpark)?

python dataframe pyspark

12,821

from pyspark.sql.functions import unix_timestamp, col, to_date, struct

####
#sample data
####
df = sc.parallelize([[25, 'Prem', 'M', '12-21-2006 11:00:05','abc', '1'],
                      [20, 'Kate', 'F', '05-30-2007 10:05:00', 'asdf', '2'],
                      [40, 'Cheng', 'M', '12-30-2017 01:00:01', 'qwerty', '3']]).\
    toDF(["age","name","sex","datetime_in_strFormat","initial_col_name","col_in_strFormat"])

#create 'struct' type column by combining first 3 columns of sample data - (this is built to answer query #1)
df = df.withColumn("struct_col", struct('age', 'name', 'sex')).\
    drop('age', 'name', 'sex')
df.show()
df.printSchema()

####
#query 1
####
#Convert a field that has a struct of three values (i.e. 'struct_col') in different columns (i.e. 'name', 'age' & 'sex')
df = df.withColumn('name', col('struct_col.name')).\
    withColumn('age', col('struct_col.age')).\
    withColumn('sex', col('struct_col.sex')).\
    drop('struct_col')
df.show()
df.printSchema()

####
#query 2
####
#Convert the timestamp from string (i.e. 'datetime_in_strFormat') to datetime (i.e. 'datetime_in_tsFormat')
df = df.withColumn('datetime_in_tsFormat',
                   unix_timestamp(col('datetime_in_strFormat'), 'MM-dd-yyyy hh:mm:ss').cast("timestamp"))
df.show()
df.printSchema()

####
#query 3
####
#create more columns using above timestamp (e.g. fetch date value from timestamp column)
df = df.withColumn('datetime_in_dateFormat', to_date(col('datetime_in_tsFormat')))
df.show()

####
#query 4.a
####
#Change column name (e.g. 'initial_col_name' is renamed to 'new_col_name)
df = df.withColumnRenamed('initial_col_name', 'new_col_name')
df.show()

####
#query 4.b
####
#Change column type (e.g. string type in 'col_in_strFormat' is coverted to double type in 'col_in_doubleFormat')
df = df.withColumn("col_in_doubleFormat", col('col_in_strFormat').cast("double"))
df.show()
df.printSchema()

Sample data:

+---------------------+----------------+----------------+------------+
|datetime_in_strFormat|initial_col_name|col_in_strFormat|  struct_col|
+---------------------+----------------+----------------+------------+
|  12-21-2006 11:00:05|             abc|               1| [25,Prem,M]|
|  05-30-2007 10:05:00|            asdf|               2| [20,Kate,F]|
|  12-30-2017 01:00:01|          qwerty|               3|[40,Cheng,M]|
+---------------------+----------------+----------------+------------+
root
 |-- datetime_in_strFormat: string (nullable = true)
 |-- initial_col_name: string (nullable = true)
 |-- col_in_strFormat: string (nullable = true)
 |-- struct_col: struct (nullable = false)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- sex: string (nullable = true)

Final output data:

+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+
|datetime_in_strFormat|new_col_name|col_in_strFormat| name|age|sex|datetime_in_tsFormat|datetime_in_dateFormat|col_in_doubleFormat|
+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+
|  12-21-2006 11:00:05|         abc|               1| Prem| 25|  M| 2006-12-21 11:00:05|            2006-12-21|                1.0|
|  05-30-2007 10:05:00|        asdf|               2| Kate| 20|  F| 2007-05-30 10:05:00|            2007-05-30|                2.0|
|  12-30-2017 01:00:01|      qwerty|               3|Cheng| 40|  M| 2017-12-30 01:00:01|            2017-12-30|                3.0|
+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+

root
 |-- datetime_in_strFormat: string (nullable = true)
 |-- new_col_name: string (nullable = true)
 |-- col_in_strFormat: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- datetime_in_tsFormat: timestamp (nullable = true)
 |-- datetime_in_dateFormat: date (nullable = true)
 |-- col_in_doubleFormat: double (nullable = true)

12,821

Author by

frm

Updated on June 28, 2022

Comments

frm almost 2 years
I'm using PySpark v1.6.1 and I want to create a dataframe using another one:
- Convert a field that has a struct of three values in different columns
- Convert the timestamp from string to datatime
- Create more columns using that timestamp
- Change the rest of the column names and types
Right now is using .map(func) creating an RDD using that function (which transforms from one row from the original type and returns a row with the new one). But this is creating an RDD and I don't wont that.

Is there a nicer way to do this?
- eiram_mahera over 6 years
  
  If you are already able to create an RDD, you can easily transform it into DF.
- frm over 6 years
  
  but i don't want to create an RDD, i want to avoid using RDDs since they are a performance bottle neck for python, i just want to do DF transformations
- pault over 6 years
  
  Please provide some code of what you've tried so we can help. In the meantime, look up pyspark.sql.functions.udf and the withColumn() method of spark dataframes.