Convert a standard python key value dictionary list to pyspark data frame
Solution 1
The other answers work, but here's one more one-liner that works well with nested data. It's may not the most efficient, but if you're making a DataFrame from an in-memory dictionary, you're either working with small data sets like test data or using spark wrong, so efficiency should really not be a concern:
d = {any json compatible dict}
spark.read.json(sc.parallelize([json.dumps(d)]))
Solution 2
Old way:
sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]).toDF()
New way:
from pyspark.sql import Row
from collections import OrderedDict
def convert_to_row(d: dict) -> Row:
return Row(**OrderedDict(sorted(d.items())))
sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]) \
.map(convert_to_row) \
.toDF()
Solution 3
For anyone looking for the solution to something different I found this worked for me: I have a single dictionary with key value pairs - I was looking to convert that to two PySpark dataframe columns:
So
{k1:v1, k2:v2 ...}
Becomes
----------------
| col1 | col2 |
|----------------|
| k1 | v1 |
| k2 | v2 |
----------------
lol= list(map(list, mydict.items()))
df = spark.createDataFrame(lol, ["col1", "col2"])
Solution 4
I had to modify the accepted answer in order for it to work for me in Python 2.7 running Spark 2.0.
from collections import OrderedDict
from pyspark.sql import SparkSession, Row
spark = (SparkSession
.builder
.getOrCreate()
)
schema = StructType([
StructField('arg1', StringType(), True),
StructField('arg2', StringType(), True)
])
dta = [{"arg1": "", "arg2": ""}, {"arg1": "", "arg2": ""}]
dtaRDD = spark.sparkContext.parallelize(dta) \
.map(lambda x: Row(**OrderedDict(sorted(x.items()))))
dtaDF = spark.createDataFrame(dtaRdd, schema)
stackit
Updated on July 05, 2022Comments
-
stackit almost 2 years
Consider i have a list of python dictionary key value pairs , where key correspond to column name of a table, so for below list how to convert it into a pyspark dataframe with two cols arg1 arg2?
[{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]
How can i use the following construct to do it?
df = sc.parallelize([ ... ]).toDF
Where to place arg1 arg2 in the above code (...)