PySpark explode list into multiple columns based on name
Solution 1
You can use flatMap
and pivot
to achieve this. Starting from the results from the first stage:
rdd = sc.parallelize([(1,['AA 1234 ZXYV','BB A 890','CC B 321']),
(2,['AA 1234 LMNO','BB D 123','CC E 321']),
(3,['AA 1234 ZXYV','CC B 321'])])
df = rdd.toDF(['index', 'result'])
You can first explode the array into multiple rows using flatMap
and extract the two letter identifier into a separate column.
df_flattened = df.rdd.flatMap(lambda x: [(x[0],y, y[0:2],y[3::]) for y in x[1]])\
.toDF(['index','result', 'identifier','identifiertype'])
and use pivot
to change the two letter identifier into column names:
df_result = df_flattened.groupby(df_flattened.index,)\
.pivot("identifier")\
.agg(first("identifiertype"))\
.join(df,'index')
I added the join to get the result
column back
Solution 2
Alternate way to explode array without converting to rdd,
from pyspark.sql import functions as F
udf1 = F.udf(lambda x : x.split()[0])
df.select('index',F.explode('result').alias('id'),udf1(F.col('id')).alias('idtype')).show()
+-----+-------------+------+
|index| id|idtype|
+-----+-------------+------+
| 1|AA 1234 ZXYV| AA|
| 1| BB A 890| BB|
| 1| CC B 321| CC|
| 2|AA 1234 LMNO| AA|
| 2| BB D 123| BB|
| 2| CC E 321| CC|
| 3|AA 1234 ZXYV| AA|
| 3| CC B 321| CC|
+-----+-------------+------+
df1.groupby('index').pivot('idtype').agg(F.first('id')).join(df,'index').show()
robarthur1
Updated on June 17, 2022Comments
-
robarthur1 almost 2 years
Hi I'm dealing with a slightly difficult file format which I'm trying to clean for some future processing. I've been using Pyspark to process the data into a dataframe.
The file looks similar to this:
AA 1234 ZXYW BB A 890 CC B 321 AA 1234 LMNO BB D 123 CC E 321 AA 1234 ZXYW CC E 456
Each 'AA' record defines the start of a logical group or records, and the data on each line is fixed length and has information encoded in it that I want to extract. There are at least 20-30 different record types. They are always identified with a two letter code at the start of each line. There can be 1 or many different record types in each group (i.e. not all record types are present for each group)
As a first stage, I've managed to group the records together in this format:
+----------------+---------------------------------+ | index| result| +----------------+---------------------------------+ | 1|[AA 1234 ZXYV,BB A 890,CC B 321]| | 2|[AA 1234 LMNO,BB D 123,CC E 321]| | 3|[AA 1234 ZXYV,CC B 321] | +----------------+---------------------------------+
And as a second stage I really want to get data into the following columns in a dataframe:
+----------------+---------------------------------+-------------+--------+--------+ | index| result| AA| BB| CC| +----------------+---------------------------------+-------------+--------+--------+ | 1|[AA 1234 ZXYV,BB A 890,CC B 321]|AA 1234 ZXYV|BB A 890|CC B 321| | 2|[AA 1234 LMNO,BB D 123,CC E 321]|AA 1234 LMNO|BB D 123|CC E 321| | 3|[AA 1234 ZXYV,CC B 321] |AA 1234 ZXYV| Null|CC B 321| +----------------+---------------------------------+-------------+--------+--------+
Because at that point extracting the information that I need should be trivial.
Does anyone have any suggestions as to how I might be able to do this?
Many Thanks.
-
robarthur1 over 6 yearsThat worked absolutely perfectly, exactly what I needed. Thanks very much for your help.