List to DataFrame in pyspark

pyspark pyspark-sql

20,414

You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:

from pyspark.sql import Row
R = Row('ID', 'words')

# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show() 
+---+--------------------+
| ID|               words|
+---+--------------------+
|  0|[apple, ball, bal...|
|  1| [cat, camel, james]|
|  2| [none, focus, cake]|
+---+--------------------+

20,414

user9226665

Updated on July 09, 2022

Comments

user9226665 almost 2 years
Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
```
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
```
Now, i want to create a Dataframe as follows
```
---------------------------------
|ID | words                     |
---------------------------------
 1  | ['apple','ball','ballon'] |
 2  | ['cat','camel','james']   |
```
I even want to add ID column which is not associated in the data
user9226665 over 6 years

Thnq for your reply.. but i am getting following error when i perform the code Py4JJavaError: An error occurred while calling o40.describe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 3, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "pyspark/worker.py", line 123, in main ("%d.%d" % sys.version_info[:2], version))
Psidom over 6 years

Try restart pyspark shell. The error doesn't seem to be related to the code.
Bala about 6 years

Isn't Awesome. Exactly what I was searching for