How to cast string to ArrayType of dictionary (JSON) in PySpark
Solution 1
Use from_json
with a schema that matches the actual data in attribute3
column to convert json to ArrayType:
Original data frame:
df.printSchema()
#root
# |-- date: string (nullable = true)
# |-- attribute2: string (nullable = true)
# |-- count: long (nullable = true)
# |-- attribute3: string (nullable = true)
from pyspark.sql.functions import from_json
from pyspark.sql.types import *
Create the schema:
schema = ArrayType(
StructType([StructField("key", StringType()),
StructField("key2", IntegerType())]))
Use from_json
:
df = df.withColumn("attribute3", from_json(df.attribute3, schema))
df.printSchema()
#root
# |-- date: string (nullable = true)
# |-- attribute2: string (nullable = true)
# |-- count: long (nullable = true)
# |-- attribute3: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- key: string (nullable = true)
# | | |-- key2: integer (nullable = true)
df.show(1, False)
#+----------+----------+-----+------------------------------------+
#|date |attribute2|count|attribute3 |
#+----------+----------+-----+------------------------------------+
#|2017-09-03|attribute1|2 |[[value, 2], [value, 2], [value, 2]]|
#+----------+----------+-----+------------------------------------+
Solution 2
The answer by @Psidom does not work for me because I am using Spark 2.1.
In my case, I had to slightly modify your attribute3
string to wrap it in a dictionary:
import pyspark.sql.functions as f
df2 = df.withColumn("attribute3", f.concat(f.lit('{"data": '), "attribute3", f.lit("}")))
df2.select("attribute3").show(truncate=False)
#+--------------------------------------------------------------------------------------+
#|attribute3 |
#+--------------------------------------------------------------------------------------+
#|{"data": [{"key":"value","key2":2},{"key":"value","key2":2},{"key":"value","key2":2}]}|
#+--------------------------------------------------------------------------------------+
Now I can define the schema as follows:
schema = StructType(
[
StructField(
"data",
ArrayType(
StructType(
[
StructField("key", StringType()),
StructField("key2", IntegerType())
]
)
)
)
]
)
Now use from_json
followed by getItem()
:
df3 = df2.withColumn("attribute3", f.from_json("attribute3", schema).getItem("data"))
df3.show(truncate=False)
#+----------+----------+-----+---------------------------------+
#|date |attribute2|count|attribute3 |
#+----------+----------+-----+---------------------------------+
#|2017-09-03|attribute1|2 |[[value,2], [value,2], [value,2]]|
#+----------+----------+-----+---------------------------------+
And the schema:
df3.printSchema()
# root
# |-- attribute3: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- key: string (nullable = true)
# | | |-- key2: integer (nullable = true)
Comments
-
malhar almost 2 years
Trying to cast StringType to ArrayType of JSON for a dataframe generated form CSV.
Using
pyspark
onSpark2
The CSV file I am dealing with; is as follows -
date,attribute2,count,attribute3 2017-09-03,'attribute1_value1',2,'[{"key":"value","key2":2},{"key":"value","key2":2},{"key":"value","key2":2}]' 2017-09-04,'attribute1_value2',2,'[{"key":"value","key2":20},{"key":"value","key2":25},{"key":"value","key2":27}]'
As shown above, it contains one attribute
"attribute3"
in literal string, which is technically a list of dictionary(JSON) with exact length of 2. (This is the output of function distinct)Snippet from the
printSchema()
attribute3: string (nullable = true)
I am trying to cast the
"attribute3"
toArrayType
as followstemp = dataframe.withColumn( "attribute3_modified", dataframe["attribute3"].cast(ArrayType()) )
Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: __init__() takes at least 2 arguments (1 given)
Indeed,
ArrayType
expects datatype as argument. I tried with"json"
, but it did not work.Desired Output - In the end, I need to convert
attribute3
toArrayType()
or plain simple Python list. (I am trying to avoid use ofeval
)How do I convert it to
ArrayType
, so that I can treat it as list of JSONs?Am I missing anything here?
(The documentation,does not address this problem in straightforward way)