List to DataFrame in pyspark

20,414

You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:

from pyspark.sql import Row
R = Row('ID', 'words')

# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show() 
+---+--------------------+
| ID|               words|
+---+--------------------+
|  0|[apple, ball, bal...|
|  1| [cat, camel, james]|
|  2| [none, focus, cake]|
+---+--------------------+
Share:
20,414

Related videos on Youtube

user9226665
Author by

user9226665

Updated on July 09, 2022

Comments

  • user9226665
    user9226665 almost 2 years

    Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below

    my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
    

    Now, i want to create a Dataframe as follows

    ---------------------------------
    |ID | words                     |
    ---------------------------------
     1  | ['apple','ball','ballon'] |
     2  | ['cat','camel','james']   |
    

    I even want to add ID column which is not associated in the data

  • user9226665
    user9226665 over 6 years
    Thnq for your reply.. but i am getting following error when i perform the code Py4JJavaError: An error occurred while calling o40.describe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 3, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "pyspark/worker.py", line 123, in main ("%d.%d" % sys.version_info[:2], version))
  • Psidom
    Psidom over 6 years
    Try restart pyspark shell. The error doesn't seem to be related to the code.
  • Bala
    Bala about 6 years
    Isn't Awesome. Exactly what I was searching for